Quality factors of data warehouse

Database metadata

Database metadata (metadata repository) is a special data store, that is not part of the data on the company. It is used in all phases of operation of the data warehouse. Content metadata database includes the following aspects of the data warehouse:


  • The prospect of conceptual (business data)
  • Logical perspective (diagram)
  • Physical Perspective
  • Statistics Data
  • Usage Statistics
  • Administrative information
Here, we describe each of these aspects.

Conceptual perspective:

- A formalized description of the contents of warehouses in terms of specific company (real objects, such as customer, vendor).
- Description of the conceptual integration of data.
- Description (business) to the user (eg reports).

This description can be formalized (with auto use to apply) or be expressed in natural language.

Example: Assertions międzymodelowe - defined relationships between concepts occurring in the different models. For example, we can write that the counterparty (the notion of belonging to the model of one of the sources of data) is a special case of the client (the concept of business model). This allows you to quickly locate all the sources from which information about customers.

Logical perspective:

 Diagram of the data in a central data warehouse sources and warehouse theme, in particular:


  • The names of tables and columns,
  • Data types,
  • Types of relationships between tables,
  • Definitions prospects materialized,
  • Definition of referential integrity
  • Definitions of data cubes, dimensions and hierarchies,
  • Description of the logical process of loading data.


Physical perspective:


  • The definition of the physical location of data and their flow.
  • The basic objects of physical perspective: data warehouses and agents.
  • Information about the location of pieces of data that make up the logical data source.
Definitions of agents:

  • Control (such as triggers, notification programs)
  • Transport (carrying out the procedures for loading, cleaning and transforming data, and the propagation of an update)
  • All three perspectives are related metadata - such as data warehouse stores the logical part of the array, contained in the notion of conceptual perspective.


Statistics data and their use:


  • The type of metadata being updated very often (at each data update.)
  • Statistics of data (such as histograms of the size of arrays) are used in process optimization, query and physical model data.
  • Usage Statistics logs (optional) allow you to assess how often redundant information is used (outlook materialized), and locate the repeated inquiries that have not yet been optimized.
  • Ability to automatically optimize operation.
  • Registration of changes in the data and the metadata.


Administrative Metadata:


  • Access to data, definitions of users and their groups.
  • The period for exercise of cyclical activity in the data warehouse (update from source, backup).
  • Other, for example, information about the owners of individual data.


Attempts to complete the standardization of metadata, similar to standard SQL access to data so far failed. Individual database engine manufacturers offer their own solutions for metadata.

Microsoft Repository / Meta Data Engine:


  • Metadata schemas using UML.
  • Used mechanisms: COM (Common Object Model), OEM (Object Exchange Model).

Metadata Interchange Specification (MDIS):


  • Text files with a specific structure (extensible specification).
  • Objects "database", "element", "relationship", etc. For example you can describe the relationships between tables or columns using keywords Equivalent, INCLUDES, DERIVED, etc.


Telos (ConceptBase systems, Semantic Index):

Universal data description language that offers composite objects, hierarchies, relationships and mechanisms of automatic reasoning.

Other: OIM (Open Information Model), CWM (Common Warehouse metamodel)

Quality factors

The need to quantify the quality of the data warehouse is due to several reasons:

determination and verification of design assumptions
determine the direction of the warehouse (elimination of weak points)
identify sources of potential operational problems.
Measures of the figures (obtained from metadata) objectify assessment of the quality and allow you to automate part of the required quality. However, note that the data warehouse design and the overall quality is always the task of multi-criteria optimization: you can not simultaneously optimize all the factors (eg increasing the speed obtained by the materialization of multiple perspectives increases the refresh time, and thus reduces the availability of warehouse). In addition, various quality factors have different meanings for different groups of users.

Here is the list of top-quality factors, the data warehouse in its various aspects. Some factors may take the form of a numerical and be calculated automatically.

Data:

Accuracy: the percentage of data with values ​​consistent with reality.
Completeness of data on the percentage of non-empty values ​​(among those that actually have non-empty value).
Consistency: the degree of compatibility of data formats, such as the percent entries inconsistent with established format.
Verifiability: the amount of data, whose quality can be checked (the percentage of records, columns, etc.).
Time:

Price: the percentage of time with values ​​in line with reality.
Volatility: the percentage of data with values ​​that meet the age requirements (eg data content of the magazine are, at most daily).
Freshness: The percentage of data values ​​entered into the warehouse at the right time.
Action:

Availability of transaction: the percentage of time the system is fully wykorzystywalny (eg, does not update the data).
System Availability: The percentage of time the system is on and takes command.
Availability: the percentage of data (tables, perspectives, columns) which can be used by authorized users.
Interactivity: ease of communication with the user.
Security: the possibility of authentication, resistance to unexpected errors.
Project:

The correctness of the scheme: discrepancy between the real world and the conceptual and logical model.
The completeness of the scheme: the extent to which the model covers the entire enterprise.
Minimality: the simplicity of description of reality.
Interpretowalność: transparency model.
Verifiability: the existence of methods to detect discrepancies.
The quality of metadata: completeness, ease of access and expansion of the database metadata.
Tools:

Speed: the system performance when processing queries.
Functionality: compliance with the requirements of end users.
Usability: ease of use and rich features.
Reliability: the number of common errors and their diagnostics.
Ease of maintenance.
Portability.

From the perspective of the designer and administrator most important are:
- The quality of schemes: correctness, completeness, minimality, auditability, interpretowalność, consistency
- Quality Metadata: easy evolution of metadata
- The quality of implementation of software: functionality, reliability, performance, portability, ease of maintenance.

From a user perspective the most important are:

- Accessibility (easy access via optional)
- Availability of systems and transaction
- Usefulness, safety, appropriateness of time, accuracy, elusiveness.

No comments: