How is it different from the usual data warehouse databases?

How is it different from the usual data warehouse databases? Typical applications.

What is a data warehouse?

B. InMon, one of the founders of the naming and data warehousing methodologies, so in 1996, defined this term:

The data warehouse is a collection of integrated, non-volatile, targeted databases, which are used in decision support systems.

Since then, the theory and practice of data warehouse developed considerably, however, the above characterization is still valid. Full-scale data warehouse is a complex undertaking organizational and business, generally considered to be costly because it takes time, software and hardware. On the other hand, the market appears more and more open source modules, which can build a relatively cheap and effective solution, usually "almost" sufficient for the applications.

Data Warehouse Center is a collection of data:

  • integrated - that is expressed in a common language and concepts of database schema, regardless of the source data;
  • non-volatile - that is intended for long term storage, with archival value, in principle not subject to changes other than adding new portions of the data;
  • targeted - that is organized in a manner designed to facilitate the specific analysis tasks that are optimized for processing specific types of queries.

The primary objectives of which it is built data warehouses are:

  • Analytical processing of data (On-Line Analytical Processing, OLAP) - the query (in SQL or other languages ​​supporting this type of operation), allowing the performance of statistical summaries, graphs and reports, summarizing large amounts of data.
  • Support of Decision (Decision Support, DS) - perform more complex analysis, simulation, business scenarios, etc. Related term: Business Intelligence (BI). Decision support can also be combined with more advanced and automated projects related to knowledge discovery in databases (Knowledge Discovery in Databases, KDD).
  • Centralization of data - collecting detailed data coming from various sources, often associated with OLTP databases, often processed and integrated with tools Extract Transform Load (ETL), to provide a wide range of data warehouse for each theme, OLAP or data mining tools (DM .) In this sense, the data warehouse is a central point for the enterprise data infrastructure and application, called the Corporate Information Factory (CIF).
  • Archiving - performed because of legal requirements (some institutions are required to store certain data), where fast data access via SQL is still valid.


OLAP and OLTP

OLAP tasks can be described by comparing it with another popular type of data processing in database systems: OLTP (On-Line Transactional Processing, Transaction Processing). Transaction processing is done in current operations (production) company databases used for daily work. Systems are optimized for maximum performance transactional, high parallelism, and availability. Examples: a banking system that supports reading and modification of customer account balances, financial and accounting system that supports and connected to the supermarket cash registers, database-enabled active content portal, mobile billing system, etc.

Basic features of OLTP systems are:

  • perform a large number of simple queries from many users (often these are the hundreds of queries per second)
  • database system should be optimized for rapid retrieval
  • frequent operations to add, delete and modify individual records
  • requires immediate access to current information

OLAP is processed primarily to create reports (usually predefined) including graphs and tabular. This type of processing is usually designed for a different kind of users: managers, analysts, administrators. Examples: report sales growth of products in different countries, for which the source is a single record of all transactions at the registers 100 stores the network for the past three years, reports summarizing sales and commissions of the bank's customers broken down by month, the types of fees and client groups, a typical monthly statistics Internet traffic to Web servers.

Basic features of OLAP systems are:

  • small number of queries, but for large amounts of data (summaries, etc., may be questions asked every few minutes through a few hundred users)
  • These systems generally only read the information from the database, if the OLAP system is logically separated from the transactional database, the information is periodically replenished (adding a large group of new records)
  • we do not assume the full news information: the data may be available with a lag (best known in advance, eg day), and the same calculation can last from seconds to hours.

The discrepancy between the processing requirements of OLTP and OLAP-type justifies the separation of these tasks. This is one of the reasons why data warehouses are created - separate (logically and physically) systems that use other types of database engines, designed differently constructed content than the production systems (trading) companies. On the other hand, the data warehouse user requirements they put more and more of the elements OLTP to data warehouse functionality, which can be associated, for example, the need to generate reports in real time while loading new data, and likewise with the need for operational reports by many users simultaneously , which is associated with more and more recently popular concept of Operational BI.

Decision Support

Processing OLAP is a tool for decision support. Decision support systems (decision support systems) are created for business executives and analysts looking to minimize costs, more targeted advertising, improve product quality and customer service, risk prediction, etc.
Examples of the types of questions that the system should be able to respond, it (starting from the simplest):


  • what was the sales broken down by month?
  • what group of customers generate 80% of the market?
  • What is the structure (histogram) of purchases?
  • what are the characteristics of our best customers, taking into account these CHARACTERISTICS that we collect about them?
  • how to automatically detect which customers are likely soon going to give up our services?
  • Some of these questions can easily be expressed in SQL, others require a statistical tool, as well as KDD tools and artificial intelligence, eg decision trees, regułowych systems, neural networks, etc.


One of the important segments using decision support systems, OLAP, as well as other data warehouse options, is a CRM (Customer Relationship Management, Customer Relationship Management). It is this department of the company (and its information system), which is responsible for the retention of best customers and increasing sales to existing customers. CRM software solutions and organizational measures to reduce the risk of losing customers (it is assumed that acquiring new customers is expensive and difficult than maintaining the current customer loyalty). The basic tasks of CRM is gathering customer information, improve relationships with customers (via a centralized, comprehensive information about the client, available in multiple, geographically dispersed points of sale). CRM systems also provide technical support (as a data source) marketing campaigns, loyalty, etc.

Business Summary

The basic problem for which a cure is to be building a data warehouse, the continuous arrival of heterogeneous information across the enterprise. The increasing technical and progressive computerization makes this growth is exponential - it is estimated that the amount of data doubles every two years. Today there are many thousands of organizations, which exceed the size of data warehouses, 2 TB, and a 5 TB wholesalers are beginning to be regarded as "small". At the same time demand for the results of complex analyzes performed on both historical data and the freshest.

All these factors drive the boom in the data warehouse. It is a growing market for almost 20% annually and there is no reason to expect in the medium term to reduce interest. Building a data warehouse is often mentioned in business plans, which still do not use them.

Of course, considerable investments associated with building a data warehouse must be paid. Basic benefits of building a data warehouse access to integrated information about the company, helping to make decisions and simplifies the task of marketing departments, analysts and managers. Often, the application lists the data warehouse and CRM applications such as Business Intelligence, whose use may be in a tangible way to reduce the number of customers leaving the competition. In the banking sector data warehouses provide data for fraud detection systems (eg credit card transactions). In telecommunications, building a data warehouse can help meet their legal obligation to store billing data for a long time, while allowing quick access to historical information for the client.

Current trends

Current trends in the data warehouse is based on several further ways of development:

  • Friendly and easy to use tool for analysis
  • Mentioned in the introduction of open source software for data warehousing and ETL, OLAP, BI and DM
  • Extending the functionality of the classic data warehouse in the direction of the basic elements of transactional
  • Extending the functionality of the classic data warehouse towards the ad hoc analysis, not supported by the OLAP
  • New technologies such as data compression, columnar organization, Cloud Computing, or such as MapReduce
The existing technical and business assumptions about the data warehouse are:


  • The typical amount of data:> 10 TB.
  • Typical load queries: about 10 thousand. OLAP queries per day, up to 100 users (analytical).
  • Refreshing the data usually diurnal cycle, the daily gain is> 10 GB, which is charged during the night session.
  • Typical projects: 1 to 3 years.
  • Typical budget: a few million. USD, of which 60% for equipment, 16% of the database software.

Examples of trends in changes to the above assumptions:

Typical budget: a strong emphasis on expenditure restraint, both hardware and software houses.
Refreshing data more often required at short intervals, during the "almost real".

No comments: