Source Data Integration

The integration of conceptual

The integration process is seen as the most important aspect of the data warehouse. It consists in removing inconsistencies and redundant information from the data released to the warehouse from the operational environment, which allows for a single image data collected by the institution. The integration process involves many aspects, in particular the three basic perspectives:


  • The integration of conceptual perspective. Data integration at the conceptual level to establish a common language przekładającego business terms of the conceptual model of the objects found on the side of sources. It is the construction of integrated enterprise schema (called a business model) at the level of language (business and technical terms).
  • Integration in a logical perspective. It is the settlement of conflicts arise when different schemas use different terminology for the same data. Can be distinguished homonyms - the same name used to refer to different concepts, and synonyms - different names referring to the same concept. Logical integration problems also apply to the naming of individual columns and tables, the differences in the method used for modeling and database structures.
  • Integrating the physical perspective. This is the realization of physical connections between systems, reconciliation of technical methods of data transport and specific processes (mechanisms) of type ETL.


Examples of conceptual integration problems:

  • Who is a "client"? Are all of the source database to understand this concept in the same way? How to transform the "customer" of the database from a foreign branch of the "customer" as defined in our warehouse? How to match sets of features describing clients in different source systems?
  • "The fact of sale" in one system is identified with an invoice, and others - from the loading.


Data integration at the conceptual level is also a selection of information (the omission of irrelevant data in terms of the assumed warehouse conceptual model), removal of redundancy in the data source to determine the rules of accuracy and data quality control going to the warehouse (eg minimum factor of completeness).

Data Extraction

Cleaning and homogenization of raw data from the technical side involved in programs such as ETL (Extraction, Transformation, Load). Their task is to retrieve and process data from very different sources:


  • relational databases (eg transactional systems)
  • data from legacy systems already existing in the enterprise
  • text files, spreadsheets, recording equipment, binary files, etc.


The extraction process is to select the information to reach the warehouse, and then obtaining the information from the source database. This may require complicated procedures related to such tracking data changes over time. An example of architecture: the intermediate database periodically polls the source using SQL and then subjected to collect data transformation and loading into the warehouse.

The problem of selection of the relevant portions of data to be loaded will be discussed at the lecture on the propagation of updates. Now we give a few examples of solutions:


  • Sometimes the data is explicitly specified the date of introduction to the source systems, or available to the daily updates.
  • If we can make modifications, install triggers (procedures that are executed automatically on the source systems). We can also add information such as (column) whether the record has already been loaded, or implement an array of differences.
  • If we can not interfere with the source systems, we must remember that the data has been loaded (snapshot mechanism: we remember the range of keys, a full list of keys, checksums, and in extreme cases, keep a copy of the source data).

Some sources are able to ask (eg SQL), others require special programs for the extraction of raw data (wrappers). These programs 'repackage' source systems and allow them to be queried or introduce active elements facilitate loading. Programs like wrapper could be arbitrarily complex - consider the example of hospital data warehouse, which reach such information about research and treatments of patients. To extract information useful in the analysis OLAP and KDD, we need tools, natural language recognition (analysis of patient record) and a dedicated image recognition systems.

Transformation and data cleansing

The transformation of data consists of all operations to adapt the content and format of the data warehouse needs. Cleaning the data may include, for example:


  • filling the empty value if the column does not allow them
  • change the format (dates, numbers)
  • change in value (eg, conversion of units)
  • unifying values ​​(eg from dictionaries - detection of typing errors, use one name in place of several alternatives)
  • maintaining data integrity (referential)

Example: sex in some systems it may be stored as "M / K ', others as" M / F ". Systems can vary the format of dates, the format of the decimal point, even encoding (ASCII / EBCDIC).

Other examples of data cleaning operations:


  • Supplementing the missing postal code based address.
  • Detection and (if possible) and repair errors of spelling, vocabulary (the error in the name of the city, where we have a complete list in the database), format (non-existent phone number - not enough digits).
  • Normalization values: replacement string "" or "space" to NULL.
  • Separation of first and last name of one common text box.
  • Aggregation based on external information sources (eg, addition of the missing postal code by address and code books).
  • Heterogeneity may result from the different origins of data (such as technical or project assumptions in the source databases). Differences may also arise from the ordinary errors when filling out the fields.


Loading data

Loading data into the warehouse is primarily a technical issue. The data are collected after the transition in the central data warehouse, the main limitation is the performance of the whole process here. Charging may involve several gigabytes per day, it is therefore necessary to optimize the load, parallelization works and any other activities that shortening the charging process. Data loading is usually performed at night, when the minimum load a data warehouse for other tasks. Often, the wholesaler is exempt from the normal operation.

Methods for loading data:


  • transformation and loading of record after the record - the simplest method, but so inefficient that it is hard to imagine her in a large data warehouse;
  • external processing (merging, sorting) and finished loading the data, often with the use of ODS-s;
  • use of dedicated, efficient mechanisms for the target database (binary finished loading records in a proprietary format warehouse engine) is also sometimes used to duplicate the data stream (individual transactions to intercede data) and duplication of the insertion process, operational data stores, to later move ready, integrated data directly to wholesalers

No comments: