Supporting data mining tasks

The tasks of KDD

The process of KDD (Knowledge Discovery in Databases, knowledge discovery) involves many steps, from preparation of data, to interpret the results of analysis. Classically KDD process is divided into the following steps:

I. Understanding the problem domain and to analyze
II. Building a working set of data
III. Preparation and purification of
IV. The choice of method of data analysis
V. Data mining (data mining)
VI. The interpretation of regularities found
VII. Use of discovered knowledge

Known methods of data mining, and decision trees, decision rules and methods of grouping, assume the availability of training data as a table-making, and so an array of fixed columns and records containing full information about the objects of interest. With a data warehouse, we can not worry about the problem of collecting, cleaning and data verification. The problem in this case is to locate which tables and columns can serve as source objects and their attributes, and how to find or calculate a decision for the training facilities. Sometimes this requires a considerable effort, as your information may be disseminated in a variety of boards connected complex relational structure.

Example: car sales

In our chain of stores selling such cars. This rare purchases, but very profitable for us. We want to locate customers who over the coming months are likely to buy a car from us. We have a data warehouse that supports m.in our CRM system, so we have a database of our customers, their purchases, etc.

Decision table: The object will be seen a customer at a time (ie, client + date)

Decision: if the client over the next three months to buy a car?

We should table the decision to have both cases positive and negative (these have more available).

Attributes of objects (clients), that information is input to data mining algorithms often need to be calculated from other columns, and may include for example:

place of residence, date of birth, education (these features can be finished in any of the tables)
total turnover, the number of purchases, the number of certain types of products purchased, etc. - before the moment of analysis (that we have to calculate).

So constructed decision table can be input for the algorithms that build decision trees (which will give us a meaningful description of the data, while automatic classification algorithm.

For example, medical diagnosis

Hospitals collect large amounts of data about the patients, performed tests and treatments, as well as financial information, personnel, office, storage condition, etc. The first step in the process of KDD aiming to use this information for diagnostic purposes, is to separate thematic data warehouse covering only matters closely medical, and other useful information about patients (eg, region of residence, lifestyle, occupation, family situation). Construction of the warehouse theme may be associated with the implementation of complex systems like wrapper (natural language analysis, that is, descriptions of medical image analysis, such as ultrasound images, signal analysis, eg ECG). The next step is to construct a separate, purified array of data (often in the form of thematic sub-wholesalers, separately for each medical issues), to serve as the board's decision-making in the process of data mining.

The main problems faced by analysts in the case of medical data, are:

A small number of objects, at least until the object in the array of decision-making is the patient (and not pajent in a specific time). It is the difficulty of projecting both the selection of database tools and methods for data mining. Raw information on research results, especially if they are microaray DNA testing or computed features based on images or signals may include many thousands of columns, and only a few tens-hundreds of records. Not all databases are prepared to hold such a large number of columns, hence the sometimes used the transposition (swapping places of columns and rows). Similarly, the data mining process requires the use of methods designed specifically for the case of a large number of features (high in importance of the feature selection algorithms).

The complex structure of information stored. There is always a rectangular array of individual characteristics. For example, information about specific medicines should not, for the purpose of data mining algorithms, to be stored in a column of text, where these would be the next name. That feature would not be constructed correctly interpreted for example by constructing systems of association rules - instead, you need a separate product information to include (or not) of each drug.

Timing important for medical and processes occurring at many levels: it can go both on the history of previous illnesses, and of the description of the current treatment (analysis of cause and effect relationships between treatment and test results). This problem already analyzed in the previous example.

No comments: