CIS 2210 - Database Management and Design

Chapter 13: Business Intelligence and Data Warehouses

Objectives:

This lesson discusses material from chapter 13. Objectives important to this lesson:

  1. Business intelligence
  2. Decision support data
  3. Data warehouse
  4. Star schemas
  5. Online analytical processing (OLAP)
  6. SQL extensions for OLAP
Concepts:

Chapter 13

Business Intelligence

This chapter is about a Business Intelligence (BI), which we are told means a system of tools and information that helps make business decisions. This is a benefit that has been promised in most of the chapters in this book, starting with the first one.

On page 592, the text provides a graphic of a BI framework, which is explained briefly as being people and processes, using technology to work with external and operational (internal) data to facilitate and enhance the management and governance of a business. Some of the items in the illustration carry over into the next concept, the BI architectural components.

The text lists six components of the BI architecture:

  • ETL tools - Tools for Extraction, Transformation, and Loading of data.
  • Data stores - Large data stores are often referred to as data warehouses, which will contain lots of data to analyze.
  • Query and reporting tools - Tools that we find in any good database to generate information and reports on the data it holds.
  • Data visualization - This simply means tools that present data in graphic ways, like charts and graphs, reports and dashboards.
  • Data monitoring and alerting - Tools that allow the system to call attention to meaningful events.
  • Data analytics - Tools that help users pick the right tools to build and operate business models that show where the business has been and where it might go next. Once you have a idea about what to monitor, analytics can be used to explain a situation, good or bad, and to predict what may happen with or without change.

Page 594 presents a table of tool types, descriptions of what they do, and a list of vendors who supply tools like that.

On page 595, the text tells us that the usual information system (it means a DBMS) is concerned with operational data, information that assists and records the functions performed by employees of an organization. This is what every example database in the text has been about. The example systems hold information about transactions, inventory, personnel, customers, vendors, and other critical entities associated with the organization.

A BI system uses operational data, but it requires more, so that it can focus on strategical and tactical information. A critical component is master data management (MDM), which holds processes that support generally accepted best practices for business. The outputs of this subsystem can be used to make governance decisions, which are decisions about the rules used to run an organization. The MDM processes will look specifically at key performance indicators (KPIs). The text provides a list of commonly used KPIs. Organizations tend to use specific KPIs that make sense for the kind of business that they do.

  • General KPIs are profits, sales, product turnover, product recalls, sales from promotions, and sales by each unit, employee or area.
  • Financial KPIs are earnings per share of stock, profit, sales by each employee, percentage of sales not yet received, and assets.
  • Human resources KPIs are number of applicants per job opening, employee turnover, and employee retention.
  • Education KPIs are related mostly to schools: graduation rates, incoming students, student retention rates, publications by staff, and staff evaluations.

As the text points out, the appropriate KPIs for an organization can't be defined before defining what that organization does, and what its goals are. Goals and KPIs should both be expressed as things to be completed within a specific time limit. This is the T in SMART objectives. Follow the link in the previous sentence to get a quick lesson on SMART goals and objectives. They make sense whether you are trying to make plans for a business or for your own life and happiness.

The text gives us a few pages about the evolution of Business Intelligence systems, but we can skip over it without missing any necessary information. It is more useful to move to page 601, where the text briefly discusses some newer versions of BI systems.

  • SSD storage devices improve search time on any DBMS, and that goes for BI systems as well.
  • BI appliances are devices that can be mounted on your network, can access existing data warehouses, and can be lower cost solutions to adding BI to your organization.
  • BI as a service is a cloud based service that lets you buy/lease as much as you need on a vendor's system, from a vendor's system developer, and from the software packages available from the cloud vendor.
  • Big Data analytics is an approach that is discussed in the next chapter. It uses social media, which may be applicable to your organization, or it may be a solution for someone else, not for you.
  • Personal analytics - OLAP systems are Online Analytical Processing systems. These are mentioned earlier in the chapter as forerunners to BI systems. The text tells us that mobile BI systems access processing in the cloud to allow data gathering and decision making closer to the customer.

Decision Support Data

The next section begins with a warning that a BI system depends greatly on the operational data that it is given. We are reminded that operational data is stored in a typical database. That storage is different from a decision support database in several ways:

  • Time span - Decision support data must cover longer time frames than operational data. It is more strategic. Operational data is about specific points in time, and decision data is about history and projections into the future.
  • Granularity - Decisions vary from highly granular through highly aggregated. A decision maker needs to be able to start at any level and move to a more atomic or more aggregated level, drilling down or drilling up through the data as needed to do so.
  • Dimensionality - The text explains that this relates to examining the data across different parameters, not just across time. You may want to examine your data as it relates to any size area, to customers, to products, and more, depending on the dimensions stored in your database.
  • Currency and volume - Operational data is about keeping the individual data up to date, but decision support data is about large numbers of transactions done for long periods.
  • Table nature - Operation data is typically stored in normalized relations, which results in many tables. Decision support data is stored in aggregated results from those normalized tables.
  • Data models - Operational data models use the normalized tables noted above, and have lots of detail. Decision support models rely on querying that data to create the tables they really want.

More characteristics of the differences between operational and decision support data are shown in the table at the top of page 605, Table 13.5. On page 606, the text shows us two tables that illustrate a difference that causes a need for a decision support system. The first one shows sales figures for each of ten years at a single department in a single store. This data is easy to read. The second table shows some of the data for the same years from each of two stores, each of which has two departments in it. This small change makes the table much longer, and makes it much harder to spot trends. That is what a decision support system is for, especially in more complex organizations.

Data Warehouse

The next part of the chapter talks about the meaning of a data warehouse. In short, it is a huge collection of data for a complex organization that covers many entities over long periods of time.

  • It has to be integrated. In this context that means that data about various entities that are similar is stored in the same way, so it can be compared from one place to another, and from one time to another. For a given attribute, the same values are used across all parts of the organization, in order to make comparisons possible.
  • It has to be subject oriented. Data elements are collected and organized according to topics of interest to the decision makers and to the processes in the BI system.
  • It has to be time variant. This means it has to continue to exist over long periods, so we can examine the history and trends of what happened in the company.
  • It has to be nonvolatile. This means that we have to be certain of the information, and that it is not subject to revision or change. We need to know the truth about what happened, so we can compare any truth to any other.

Table 13.8 on page 609 summarized the text's discussion of these characteristics.

Earlier in the text, a line implied that a data warehouse and a data mart were the same thing. This is clarified on page 610, where we learn that a data mart is a limited data warehouse, about a single subject such as one store in a chain, or one division in a large company. It may be easier to create a BI system from a data mart, but the danger of doing so is its loss of the larger scope: you won't be able to compare the subject to similar subjects across the organization.

Star Schemas

A star schema is a different way to look at your data that is useful in moving from a relational database to a BI database.

Tables in a star schema

In the example above from Wikipedia, you see one Fact table in the center, with three Dimension tables arranged in a star around it. That is the basic system.

  • Facts - objective numeric values about some entity. In the example above, you see four values that apply to sales in our company.
  • Dimensions - These are items in other tables that the elements in the fact table relate to. We ask questions like "are the sales higher or lower on specific days at specific stores for any kind of product?"
  • Attributes - As you can see in the example, a dimension table has attributes that may be of importance in making decisions about our company.
  • Attribute hierarchies - The truth to be gained from the data iis likely to be found in combinations of the attributes across the dimensions. In each dimension, we expect to find what we see in the example above, an hierarchy of data that goes from the general to the specific in that dimension. The illustration on page 615 shows the idea of considering specific combinations of attributes in each dimension. Moving up and down the hierarchies can lead us to new information we can use to make choices about what to do next.
  • What is not obvious from the example is that the Facts table will be very large, containing all the numeric data from all the Dimension table comparisons, in all their logical combinations. This is accomplished by making the primary key of the Facts table a composite of all the primary keys from the Dimension tables.
  • Mulitdimensional imageThe more stars (Dimension tables) you have around the Fact table, the more combinations of attributes this creates in the Facts table. The text illustrates an example with three Dimension tables and only three attributes to care about in each one. That is not realistic. Real star schemas will have many potential stars and lots of attributes in each of them, leading to multidimensional geometry. Take a look at the increasingly complex geometry on this page in Wikipedia. It is a lot more digestible than most of the material I found in a quick search for a lesson on multidimensional geometries. (You can click the image on the right instead of the link. You go to the same page.)

The text offers some advice for improving performance in star schemas:

  • Normalize your dimensional tables.
  • Denormalize your fact tables. They are result tables, so normalizing them is not practical.
  • Create more fact tables to represent different aggregations. This means new schemas.
  • Partition and replicate tables. They get big enough to want to do this.
For further reference, take a look at this Microsoft article that describes star schemas and snowflake schemas.

Online Analytical Processing (OLAP)

On page 621, we start a new topic that covers a concept mentioned earlier: Online Analytical Processing (OLAP). We are told that it concerns three main ideas, which continue to relate to the material above.

  • The Wizrd of Oz behind his curtainMultidimensional data analysis - Having led us into multiple Dimensions, the text offers hope that we can analyze the data held in them. Users of the database will have different interests, causing them to want different views of the data available. The data are more meaningful when presented in different ways, when combined from different dimensions in different aggregations, and when viewed through PKIs that interest the user requesting the data.
  • Advanced database support - OLAP tools need to access data from diifferent kinds (disparate) systems. A useful feature mentioned in the text is the ability to query the system in business terms, what developers have typically called using "natural language". Well, it is natural to the speaker of that language. The idea is that the user can request data in terms the user knows that have been defined in the database as parameters it can actually use.
  • Easy-to-use end-user interfaces - We always have to simplify reality for the users. It is often best if they do not know how the database actually functions. You may wish you were insulated from it as well, at this point.

The text goes on to describe the intensive processing that OLAP requires. This is not directly related to the material you are expected to absorb from the chapter, so let's stop the presentation of new material here.