Update (refresh) the data warehouse is to maintain compatibility of the current source data warehouse. This is a very important process of immediate relevance to the effective utility of the collected and aggregated in the warehouse of information. The quality of data provided to users makers on the basis of decisions depends on the ability of the system to include data warehousing and promotion within a reasonable time change the information source. Therefore, an important part of optimizing the data warehouse is the most effective outcome:
- How to detect changes in the source data?
- How to efficiently integrate the changed data in a data warehouse?
- How to update the derived systems (such as data warehouse cube theme)?
Refresh element is described in the previous lecture, the extraction, cleaning and data integration. These processes must be repeated every day, for each new chunk of data to ensure consistency of data in the warehouse.
Detecting changes in data sources
The diagram below shows the breakdown of the data sources for different types, depending on the technical possibilities of detection of new data.
The basic division is between the source data (source database) in which we can make modifications and implement the elements to help distinguish the data already loaded from the rest, and other sources (non-cooperating), we can not modify it. Depending on the source, the following techniques for the detection of changes:
- Snapshots: Create an external copy of the data and compared periodically with the current state. In the worst case we need to duplicate the entire database to the source, sometimes enough to remember such key records.
- Track log: We can control the operations of modifying the underlying data in the database by keeping track of daily activities (eg, queries are processed). The log can also be a direct source of data to be updated.
- Tables of differences: Arrays, in which the source database records added, deleted and modified records. This requires a source of intrinsic activity, such as trigger mechanism.
- Sources are telling (usually by means of the snapshots) keep the same data warehouse or ODS with modified records.
- Reproduction of the data stream to intercede on the transaction records to the database is a method of effective and transparent in terms of sources. It requires the ODS layer charge of the compilation released in the real-time data for later upload to the warehouse.
Propagation of updates
Data warehouses usually contain a collection of prospects materialized derived from the tables stored outside the warehouse. Prospects must be kept updated with information about changes in the original tables.
Perspective (View) is defined as an array of query results in other relationships. The prospect may be virtual (the data are calculated on request) or materialized (the data are calculated in advance, allowing for their rapid utilization). In the second case it is necessary to refresh the prospects, they change the data in the warehouse. Most prospects are stored in warehouses modes (eg, data cubes in the model MDDB) is materialized, due to the required rapid calculation of queries. When it comes to the moment of refreshing perspectives, there are two strategies:
- Update Delayed (on-demand, on first use when you change data in the warehouse)
- longer it takes the first query;
- do not have to refresh these perspectives, which we will not use;
- Immediate update (refresh warehouse):
- longer it takes to process a batch update process;
- toss costly processes for night hours;
- part of an update may be unnecessary.
The optimal solution includes an analysis of the frequency of use of the prospects and the cost of making an immediate upgrade. It is a complex combinatorial problems, the approximate solutions are calculated heuristic methods. The information on the use of outlook profile is a repository of metadata.
The update may rely on re-execution of the query that defines the perspective. But it's inefficient, faster method is to modify the data from the array of differences (delta table), in which signals are all new, skasowne and changed values.
From the standpoint of profile refresh OLAP applications is as follows:
An array of facts:
- very frequent updates;
- only add new records;
- figures (easily obsługiwalne).
Array of "input" (dimensions and attributes):
- rarely updated;
- adding, deleting and changing data;
FLASHBACK necessary (we note, when and how to change the data).
Maintainability perspective
The concept of maintainability refers to a situation where we can calculate a new value by calculating the outlook is not all prospects all over again. The most comfortable in the update are the prospects for self-service, that is, the perspective that new content can be calculated from the perspective of the current content and the content of the array of differences. Many popular types of operation proves to be a self-service.
No comments:
Post a Comment