Query Optimization

Places query processing

The data warehouse is a complex database system, ktrego different parts have different requirements for the performance of query processing. There are at least three main areas, which differ in the type and purpose of requests processed:

Facilities: Query processing in ODS, the source databases, loading programs (wrappers).

Mostly there are questions about updating processing and cleaning: a simple query to insert and modify data, large amounts of data in response. Some aggregation can be calculated already during data loading (eg, ODS).

Kernel: Query processing in a central data store.

Calculation of the query applies to large amounts of data, but the result is usually small (eg, summary). The basic kind of queries processed in the nucleus warehouse refresh thematic concerns and answer queries from users, which could not be counted in warehouses themes.

Tip: Query processing in the warehouse modes (OLAP, DSS, KDD).

Complex queries in response to providing a small set of data - a summary, tables, reports, etc. We aim to most queries do not have to go to the data from lower levels (including the kernel).

Methods of queries optimization

The basic method of accelerating answers to frequently asked questions in data warehouses is redundancy, or duplication of data in aggregate form (such as in warehouses theme). Redundant data warehouse is to store materialized perspectives used to calculate the query.

Cons:

  • Higher cost of data storage,
  • More difficult to update,
  • Risk of loss of integrity.


The only (but essential) advantage is a significant acceleration of the fixed type of query. To maintain a balance between the amount of redundant counting statistics and efficiency of queries, you can use different granularity of data aggregation for newer and older.

In terms of query execution optimization problems, we can distinguish the following types of queries:

Transaction Inquiries concern usually only a few records, but present in very large quantities. This type of query is not specific to the data warehouse, so is not subject to the special optimization.

Predefined queries are formulated by the administrator and called multiple times (eg for different time ranges). They usually form a parameterized query. They can be optimized in advance.

Inquiries current (ad hoc) are asked by analysts and have unpredictable character. They need to be optimized "on the fly."

Inquiries concerning the dimensions. Group questions on viewing and grouping attribute dimensions. In this model the star is not easy viewing individual tables in a snowflake model requires the use of joins. Optimization: define groups of related (pre-defined queries for frequently used values ​​of dimensions, such as "products originating from outside Europe"), or groups of behaviors (predefined queries is also dependent on the facts, such as "eagerly bought products before the holidays").

Inquiries regarding the facts. Inquiries generating reports. They can be implemented using pre-defined queries and optimized in advance, for example by breaking into smaller queries. In the ROLAP model requires the use of joins.

Query optimization is carried out in several directions, and includes the following techniques:


  • Indexes (optimization of data access)
  • Join indexes
  • Optimization of the order of calculating the result
  • The use of redundancy (prospects materialized)
  • Estimates of materialistic perspective - a technique useful to estimate the optimal path of a query.

The use of redundancy in the data warehouse is associated with the problem of usability perspective: Which prospects materialized can be used to calculate a specific inquiry? Q1 Does the query result can be calculated from the resultant query Q2?

Simple examples:

queries whose collection of poems in response is limited to the additional condition (exact match)
granularity aggregation queries which is smaller than the original (incorrect match)

Is the first query can be replaced by another?

This problem is at least as difficult as the problem SAT (satisfiability of logical formulas), so in practice it may be exponentially difficult. In general, the problem of concluding queries is undecidable. However, in practice to apply several simple heuristics to detect some redundancy and use the results of one query to count the other.

Indexes

The need for such indexes show with a standard technique to store data in databases: the data is stored line, allowing easy access to the record (important for OLTP), however, difficult to filter the values ​​of a particular column (important for OLAP). It uses many types of indexes, some of which can be used automatically, while others can be operated manually (by explicitly telling them the plan of a query).

Projection index: string, whose elements are the next (for new records) a value of specified column

Index list of values: B-tree, whose nodes are the values ​​of the columns on the doczepionymi lists of records having a given value:

Bitmaps: binary strings assigned to each value in a column containing 1 at positions corresponding to the values:

Segment indexes: replace the values ​​to integers and store separately each item of the binary:

Join indexes: The star schema or snowflake, we can use any of the above types of indexes to link an array of facts with an array dimension:

The sequence of calculations

The sequence of calculations in a query can affect the speed of calculating the result, due to significant differences in the size of intermediate data. For large data (arrays of multi-gigabyte), incorrect sequence of joints may even prevent the execution of the query. Databases have built the optimizer, the order, matching the best, but it does not always work well enough. Remember that your being a concatenation of 10 tables with simple conditions can be made on the n!, More than 3 million different ways. Many database engines allows manual selection of the query execution plan.

In the case of summary queries that are typical for OLAP, time interleaving is used for grouping and join. We perform aggregation after Prod.prod_id (thus creating a much smaller array of intermediate). It Sprzedaż.cenę add up according to the criteria of joining, at the end of the Group and Prod.rok_wprowadzenia.

No comments: