Basic architecture of a data warehouse.

Basic architecture of a data warehouse. Conceptual model, logical and physical.

Data Warehouse Architecture

The structure of the data warehouse create additional layers of data, each of the next layer is processed last. The bottom layer form a data source, the ambient database companies, often geographically dispersed and diverse in terms of how to access (normal databases of different formats, binary or text files, source special), and logical structure, size and quality of data.

The middle layer of the diagram is the central data warehouse (basic, corporate). It is a non-volatile primary storage location information gathered from sources, as well as partial summaries useful in tasks OLAP and in supporting the decision. The global data warehouse captures the history of the source data and is cyclically during the update, added new, condensed information regarding the state data sources, stored next to the previous ones. The data in data warehouses can come from many years ago, as well as wholesale meets archival functions.

The next layer is the local wholesalers, created for the users (the analytical department), containing selected data in a highly aggregated, allowing for fast presentation of summaries used in the management, long-term planning, analysis, historical trend analysis, processing and analysis of integrated information. Local data warehouses are called thematic wholesalers (data marts, wholesale branch). Due to the smaller size and ability to work the local wholesalers modes allow for more efficient handling of data. They can be implemented as a relational database or a special multi-dimensional structure.

Sometimes the source data between the layer and the global data warehouse introduced an intermediate layer, called the operational data store (operational data store, ODS). ODS layer typically contains the results of transformation, integration and aggregation of data from sources, and is itself a direct source of supply global data warehouse. ODS is updated more frequently than the data warehouse and has a much fresher information, but data are far less aggregated, making it difficult to perform tasks OLAP. Creating ODS layer may relieve the central data warehouse of the tasks related to updating of data, often it is also justified by technical reasons (eg, significant geographical dispersion of data sources).

An additional element of a data warehouse is a database of metadata (metadata repository). Database metadata is supposed to keep current and historical pattern of physical, logical and conceptual warehouse, including the processes of extraction, transformation, aggregation, cleaning and storage of information, as well as the history of the use of data.

Designing a data warehouse is to create a conceptual model, logical and physical warehouse. Modelling of these three levels for all elements of a data warehouse - a central warehouse, ETL processes, warehouses, etc. The levels of these themes can be characterized as follows:

Conceptual model is a description of the structure, content and destination of a data warehouse carried out on a conceptual level, ie in terms of business objectives, using the names of natural language specialist, responsible for the organization. Conceptual model may for example specify that it is necessary to collect certain information about customers, defines the concept of 'client' and points to the business objectives of the planned analyzes.

Logical model is a description referring to the logic elements of databases and processes, warehouse, and so the columns, tables, relationships, etc. The description logic level resembles a typical database design is made such as UML.

The physical model is a description of the parameters to optimize data warehouse activities, such as indexing, partitioning, copying data, as well as items such as computer hardware, network, backup systems, arrangement of the logic resources, etc.

From the standpoint of the adopted approach, we have:


  • bottom-up design (from the particular to the general), in which projects are first formed from the various data sources, business divisions, the needs of users, etc., then these projects are merged into one overall project;
  • Top-down design, in which we begin by creating a business model on a conceptual level, would then gradually move to the project required the integration of source data.

This second method is more difficult and more expensive, it does bypass the pitfalls associated with the integration of (potentially inconsistent) local models.

Note that the data warehouse project is not only information but also organizational. The design and implementation of data warehouse shall establish such procedures and instructions, diagrams, data replication, storage and transport organization of backups (eg, restoration of the operating center in one of the banks will require trucks to transport the data to a tape backup of the building in another city), etc .

Wholesale Theme

Wholesalers are a dedicated thematic subsets and processed data for specific types of analyzes. Applications (and the typical operations) warehouses themes include:

  • OLAP: multidimensional data cubes rotating, retractable, ... (Statistical statements, reports, graphs)
  • Data mining (synthetic and focused description of the data or tasks automatic classification and detection of patterns)
  • GIS (geographic information systems / spatial)
  • other tasks such as business intelligence (eg, analysis of options: what would happen if ...)

The multidimensional data model is one way to speed up common operations summary OLAP, used especially in warehouses themes. The most frequent tasks are analytical summaries (tables, graphs) of some of the figures, the amount of goods, the amount of money, etc., broken down into certain categories, often at different times. A typical query summary is one that writes in SQL using the GROUP BY clause and aggregation (eg SUM, COUNT). For such queries are not required to review the whole (multi-terabyte) of data sources, part of the aggregation can be calculated beforehand and stored in the form of multidimensional tables, so called. data cubes. Below is an introduction to the operation of multidimensional.

In the OLAP model, we assume that the database contains facts described by the dimensions and determining the value of measurement. The fact is the record of a single event stocktakings subject (such as the fact of sale, a single request to the web server, etc). The central data warehouse, such facts may be stored in one or more data arrays. Business Characteristics of a fact, like the type of product concerned, or the moment the incident of the fact that, in turn, its dimensions. The measure is the numerical determination of fact, which is subject to summary and charts presented in tables or as a result of the summary (eg, value of sales transactions). Multidimensional model is to create n-dimensional table, synthetic, whose edges are described dimensions, and each cell contains a summary measure. This table is then convenient to summarize the data source - often enough to select the two dimensions to achieve the required statistical table to the report.

The tasks that go beyond that authorized under the thematic warehouse are performed on a central data warehouse (eg, ad hoc query, formulated by the user in SQL). One of the directions of thematic warehouse system optimization is to minimize the need to ask these questions - anything that can be calculated in the warehouse theme, there should be counted.

Example of a multidimensional model

The database contains records of sale cash registers is read from the supermarket. The data warehouse collects information from the source database 300 stores the network, deployed in many cities. Customers are identified through the discount cards that they use when shopping.

The target cube is 4-dimensional.
The fact that in this case a single sale of one product to one customer (the position on the receipt).
The facts described are the dimensions: time, customer, product, shop.
A measure of the value of sales and number of units.
The level of aggregation at the ankle to the level of detail description of the dimensions, eg time can be divided into days or into quarters, the products can be grouped into groups of products.
The content of the cells are aggregated measure (the sum of sales of the product in the store, any day, any given client).

Management Systems

Due to the requirements regarding the number of collected data and specific queries are processed, the data warehouse components are usually not using more than one management system. If the source database, and operational data stores, they are usually ambient systems (often very different), which is one of the most general-purpose database systems (such as Microsoft SQL Server, Oracle, DB2, MySQL and the like). In this case, the warehouse designer has little influence on the choice of solutions.

Another type of database systems used in the central data warehouse. In this case, these systems are not sufficient for general use because of their scalability is inadequate for the amount of data counted in the tens of terabytes. In addition, these systems are optimized more for OLTP, and so for other types of performance requirements and queries. The central data warehouse is used so the Class VLDB (very large databases).

An example is the solution to the central warehouse data warehouse offered by Teradata. The solution consists of hardware and software modules, based on a dedicated UNIX flavors and appropriate design solutions (such as optical connections). These modules provide high scalability (expanding the system is to only buy additional modules, each of which is for a 0.5 TB of data). Completes performance engine with elements of data compression.

A separate issue concerns the use OLAP access to data stored in warehouses themes. The priority here is not to support large amounts of data (because wholesalers are usually themed section of all stored data), what type of query optimization and support a model summarizing the star or multi-dimensional cubes. Below we will discuss the types of solutions available.

Superrelated database management systems. These are the systems that manage the extended possibilities of cooperation with the tools by means of the so-called OLAP. superrelacyjnych function. These are the formats for data storage expansion, operations and relational indexing. The data in these systems is stored in the structure of a star or snowflake (discussed in subsequent lectures), which allows you to automatically optimize OLAP queries.

ROLAP architecture (Relational OLAP). It is a way to access data using tools to analyze multidimensional information (data cubes), the data source for queries is hidden, the internal structure of a relational (as in the case of systems superrelacyjnych). The user gains a clear data model without knowing the model of a star or snowflake. An example of such systems is the data warehousing solution offered by MicroStrategy. Other ROLAP servers to Red Brick (Informix / IBM), and Sybase. Assuming that data warehouses should be built gradually, by ascending, manufacturers offer ROLAP systems not only for the wholesale thematic, but also generally for distributed data warehouses.

Multidimensional database systems (multidimensional database systems, MDDB) reproduces the way OLAP represent and process data. MDDB stores data directly in the form of multidimensional cubes. Each dimension represents one aspect of the data. For example, data on sales in chain stores may have a dimension of the commodity, time and place. With this construction MDDB systems do not need an operation to connect the processing of the request for the sale of one of these dimensions. Therefore MDDB OLAP applications tend to be much more effective than traditional database systems, but the harder it is to update the stored information, and the same block can take considerable size. Examples of such solutions to Cognos PowerPlay, Business Objects and Brio (cheap MDDB systems, so called. Desktop OLAP, which is only an interface to the data warehouse). More complex solutions such HOLAP (hybrid), provide complete integration of relational data warehouse (where priority is scalability) of the multidimensional model (where the priority is the efficiency of tasks OLAP) within a complex architecture. Here the leaders are the systems of Hyperion Essbase, Oracle Express, and Microsoft OLAP.

No comments: