Multidimensional data models

Multidimensional data models. OLAP operations

The choice of aggregation

In previous lectures (see in particular lecture 3) we know the principle of multidimensional data modeling. Recall that in order to speed up the operation, some aggregation analysis (summary data) can be calculated in advance and ready for use. A particularly convenient model for the storage of such aggregation is a cube of data.

Create a data cube is usually based on the contents of slice data warehouse model presented in the form of a star or snowflake (with a central array of facts). But not always need the data described in the full available precision. Aggregation of values ​​may relate to omit certain dimensions or attributes in the hierarchy to better fit the size of the stored summaries to the real needs and technical capabilities. Therefore, the forms are often multiple (hierarchical) thematic warehouse, storing data podkostki materialized.

Example: The facts gathered data covering sales (50 million. Records, measure: the value), customers identified the postal code (3000 different codes), goods (60 groups, 800 individual names), time (3 years, ie 1000 days), shops ( 18 units). This data model includes tables with celebrities (including) a little over 50 million. records, because the dominant fact table is here. Let's calculate what would be the size of a full, four-dimensional data cubes:

3000 * 800 * 1000 * 18 = 43 200 000 000 cells

This is too much to keep the ankle in memory and use it effectively. Of course, most such cells would be empty cubes, and so it would be possible to use various forms of data compression, but it still does not provide the required speed of access to data. Now consider the version of the aggregate: ignore customer dimension, we consider only the goods in groups, consider the time scale of months. This time, the data cube includes:

60 * 36 * 18 = 38 880 cells (total value)

This data can easily fit in memory. This aggregation does not allow the course to generate all these reports, which could be created originally, but it is for other reports can run about three orders of magnitude faster than the model star.

At this point, there are questions the nature of optimization:


  • Which dimensions, we can ignore?
  • How to aggregate data (sum, min, max, avg, count)?
  • Which attributes of the hierarchy of aggregates, which are left with all the details?
  • How much should arise auxiliary materialized cubes?
  • How to calculate the optimum specific query?

Note that in many cases, the answer to the question about the low level of detail can be inferred from more detailed data cubes. We can look at the problem as the selection of a subset of blocks (see figure below) to be the materialization of the many possible definitions podkostki. The goal of this optimization is usually to minimize the average processing time in practice asked OLAP queries, while maintaining ties imposed by the technical possibilities (size drives, memory, etc.). In the example below, only two materialized data cubes, which allows us to generate a report such as the total sales of individual commodities.

Usually, we can not materialize all podkostek (for n dimensions we have 2n podkostek defined by the choice of a subset of dimensions, not including restrictions on attributes, etc.), so we must choose those that later would be easiest to generate reports. It is quite difficult optimization problem whose solution requires further gathering of information about each of your (potential) of the ankles and the types of reports. Collection of this information in a metadata repository is a navigator on aggregations.

Typical OLAP operations

One of the main analytical activities related to the process is to navigate OLAP aggregations, and thus creating an interactive target of a summary report by selecting appropriate aggregation, data ranges, additional conditions, etc. These operations are often performed in real time, which imposes a significant performance requirements the data source (in this case, the wholesaler is usually thematic). The need to provide an interactive work with data is one of the motivations to create data cubes.

Navigator after aggregations OLAP is a program designed for the analytical user. Its mission is to support the data and preparing reports. Programs of this family provide a convenient overview of multidimensional data model. Another important task of the navigator is to help collect statistics on the use of thematic warehouse. Information asked OLAP queries are stored as metadata in a special database (repository) and can be used for query optimization (eg, the materialization of some aggregation). The metadata warehouse contains a lot of other information, such as a description of the conceptual and logical data, information sources and their integration, daily updates, etc.

Below are the most common operations on data cubes, including their nomenclature. Examples are based on already known data warehouse logical model for the supermarket chains.

Folding (aggregation, consolidation, aggregation, consolidation, roll-up) is more general data viewing, the more aggregated (eg, the summary made the months instead of days). This operation reduces the amount of information displayed, replacing many of the value of one. Another example is the introduction of rolling operations summary (summary and selection of functions such as sum, minimum, arithmetic mean) for the values ​​that have been presented in full detail.

Developing (roll-down, drill-down, drill through) is the inverse of the previous operation. this operation we use to get a summary of the values ​​of its details, for example through the transition from annual to daily time scale, or of a report presenting the sale of a commodity in each region, the report includes individual stores.

Selection (Slicing) cutting out a fragment of data by specifying conditions on the values ​​of dimensions and a list of dimensions to be taken into account in the report. podkostka result is, for example, two-dimensional slice (slice), corresponding to a satisfying the conditions. Dimension values ​​are not present in the resulting table are projected (aggregated), ie if we consider the four-dimensional cube-time shop-good-to-client, which we reject the customer dimension, a dimension item limit (selection) to the value of the ABC television, the result is a slice of time -store only sale includes the ABC television, aggregated over all clients.

Filtering (screening, selection, filtering) is the rejection of some of the data included in the aggregate, for example, by reducing the value of measurement of dimensions or attributes (except for the rejection of entire dimensions, and the only restriction, for example, to a certain time period). Filtering can act on the results of aggregation, so it is quite expensive computationally ROLAP model, if we wanted to calculate the result of an operation directly from the model star.

Narrowing (scoping) is to reduce the data to be taken into account in further analysis. This is similar to the filtering operation and selection, but realized at an earlier operational level, the definition phase podkostek (lower-level thematic warehouse). In practice, the narrowing of the work as automatically add conditions to any further operations on the data cube. Particularly useful when we want to make many, often repeating the analytical operations, always the same (in terms of conditions), a subset of data.

Rotation (pivot, rotate) is the common name of many different types of operations on the dimensions of the data used in the report. Its simplest form is swapped columns and rows in a table (transposition).
A more complex case is the transfer of one dimension of the columns to rows (or vice versa), does it matter if the two-dimensional table in the report constitute more than two dimensions. This can be achieved by providing a number of dimensions at different levels of one axis of the table (see figure below) and amplification of the value of one of the dimensions in each group defined by the second. For example, the time dimension and shop in this example were on one axis, by duplicating all the shops in each year. Note that such a transfer does not change the dimension of the table cells, but only a reorganization (changing the order of cells).

This category includes OLAP operations also sometimes replace one dimension (or attribute) to another, unused.

The above list is not exhaustive list of all OLAP operations provided by each tool, this is the approximate intersection.

No comments: