Sunteți pe pagina 1din 8




1 Email:



Data Mining is the process of extracting valid, previously unknown, comprehensible, and actionable information
from large databases and using it to make crucial business decisions. Data Mining contains two models: predictive
model and descriptive model. It contains various tasks such as Classification, Regression, Time series analysis,
prediction, clustering, summarization etc.
The main aim of Data Mining is Knowledge Discovery in Databases (KDD). KDD is used to derive the patterns that
are useful for Data Extraction. KDD process contains a mechanism that includes Selection, Preprocessing
Transformation, and Interpretation.
Data Mining basically depends on Classification and Clustering, which provides strength to the Data Mining.
Data mining metrics are applied to measure the effectiveness of functions using ROI (Return on investment).
Data Mining contains the relative concepts such as OLTP systems, Fuzzy sets, and Web search engines etc.
Data Mining can also be extended to Web Mining, Spatial Mining, and Temporal Mining.


One of the reasons behind maintaining any database is to enable the user to find interesting trends in the data. Data
mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful
information from data". It uses machine learning, statistical and visualization techniques to discovery and present
knowledge in a form which is easily comprehensible to humans.

The process of extracting valid, previously unknown, comprehensible and actionable information from large
databases and using it to make crucial business decisions.


Data mining mainly concerned with the use of software techniques for finding hidden and unexpected patterns and
relationships in sets of data.
There are two data mining models.
1. Predictive Model
2. Descriptive Model

A predictive model makes a prediction about values of data using known results found from different data and it
may be made based on the use of other historical data. Predictive model data mining tasks are Classification,
Regression, Time series analysis and prediction.
Example: - Credit Card Usage

A descriptive model identifies patterns or relationships in data. It is used to explore the properties of the data
examined. Descriptive model data mining tasks are Clustering, Summarization, Association rules, and Sequence
discovery etc.
Example: - Manual Evaluation.

2 Email:


Predictive Descriptive

Classification Clustering
Regression Summarization
Time Series Analysis Association rules
Prediction Sequence discovery

Fig: Data Mining Models and Tasks


These describe the functions performed by the data mining techniques.
1. Classification: -
Classification maps data into predefined groups or classes. Always these classes are determined before examining

the data.
2. Regression: -
Regression is used to map a data item to real valued prediction variable, which is used to determine which function
is “best”.
3. Time series analysis: -
With time series analysis, the value of an attribute is examined as it varies over time.
4. Prediction: -
Prediction is to know the future data states based on past and current data.
5. Clustering: -
Clustering maps data into groups, which are defined, by data alone. That is, the similar data are grouped into
clusters. Domain experts will maintain these clusters.
6. Summarization: -
Summarization is also called characterization or generalization. It maps data into subsets with associated simple
descriptions. That is, it characterizes the contents of the database.
7. Association rules: -
An association rule is a model that identifies data relationships, which are used to identify the items.
8. Sequence discovery: -
Sequence discovery is used to determine sequential patterns in data, based on a time sequence of actions.


Definition of KDD process: -

Knowledge Discovery in Databases (KDD) is the process of finding useful information and patterns in data.

3 Email:

Selection Preprocessing Transformation Data Mining Interpretation

Initial Target Preprocessed Transformed Model Knowledge

Data Data Data Data

Fig: KDD Process

Importance of KDD process in data mining: -

KDD process plays an important role in the process of data mining. Data mining process uses the patterns derived
by the KDD process. The KDD process takes the data as input and produces the useful information desired by the
users as the output. It is an interactive process (itself) and it requires much elapsed time to produce the accurate

Steps involved in KDD process: -

 Selection: -
The required data is obtained from various databases, files, and non-electronic sources.

 Preprocessing: -
Here, the erroneous data and missing data will be identified and using the data mining tools, the corrections of
data will be done, if necessary.

 Transformation: -
Here, data may be encoded or transformed into more generalized, usable, and reduced formats.

 Data mining: -
The suitable algorithms will be applied to the transformed data to generate the desired results.

 Interpretation (Evaluation): -
The results will be visualized to the users in an effective way (using GUI strategies).


Information Retrieval

Data Bases Data Statistics


Algorithms Machine Learning

4 Email:

Fig: Development Process of Data Mining

The following functions are used in the developing process of data mining.
1. Induction - Induction is used to proceed from very specific knowledge to more general information.
2. Compression - The compressed description of data characteristics is only found in the model.
3. Querying - Different types of data mining queries may be developed.
4. Approximation - Approximation helps uncover hidden information about data in large database.
5. Search problem - The size and efficiency of developing an abstract model are considered.

1. Human interaction: -
The interaction with both domain and technical experts may be needed to assist in interpreting the results.
2. Outliers: -
Outliers are data entries that do not fit into the derived model. These also must be eliminated.
3. Large Data Sets: -
The massive data sets create number of problems. Sampling and parallelization are tools that are used to reduce
those problems.

4. Multimedia data: -
It complicates the database and algorithms.
5. Missing data: -
The estimates used for missing data in preprocessing phase can lead to invalid results.
6. Irrelevant data: -
Some attributes in the database might not be used for the current task.
7. Noisy data: -
Incorrect attribute values must be corrected before running data mining applications.
8. Changing data: -
If the data base changes the algorithm must be completely rerun.
9. Integration: -
Integration of data mining functions into traditional DBMS systems is certainly a desirable goal.


The data mining functions effectiveness can be measured using ROI (Return On Investment, from overall usefulness
perspective). ROI examines the difference between what the data mining technique cause and what the savings or
benefits from its use are. The metrics used include the traditional metrics of space and time based on complexity

1. Database/OLTP Systems: -
Database is a collection of data usually associated with some organizations. A DBMS is the software used to
access the database.

5 Email:

Now, the DB and OLTP (On Line Transaction Processing systems) use the data mining queries which produces
output as a KDD object, where normal queries produce a subset of DB as output. A KDD object is either a rule,
a classification or a cluster.
2. Fuzzy sets and Fuzzy logic: -
A Fuzzy set is a set, F, in which the set member ship function, f, is a real valued function with output in the
range [0,1] : an element x is said to belong to ‘F’ with probability f(x) and simultaneously to be in ~F with
probability 1-f(x).
Fuzzy sets may be used to describe number of data mining functions in various DB areas.
3. Decision support systems: -
Decision support systems are comprehensive computer systems and related tools that assist managers in making
decisions and solving problems. A DSS could be enterprise-wide, thus allowing upper level managers the data
needed to make intelligent business decision that impact the entire organization. The needed data must be
extracted through data mining only, in DSS systems.
4. Web search engines: -
Web search engines are used to access the data and can be viewed as query systems. Search engine queries can
be stated as keyword, Boolean, weighted etc. The data being searched in search engines, pages with
heterogeneous data and extensive hyper links.
5. Pattern Matching: -
Pattern matching or pattern recognition finds occurrences of a predefined pattern in data. Pattern matching
serves for Text editors, Information retrieval and Time series analysis.

1. Classification: -
Classification is the most familiar and most popular data mining technique.
Classification problem is defined as follows.
Given a database D= {t1, t2, …tn } of tuples (items, records) and a set of classes C= {c 1, c2, …cm } the
classification problem is to define a mapping f: D→C where each ti is assigned to one class. A class, Cj, contains
precisely those tuples mapped to it; that is, Cj= {ti/ f( ti)=Cj,1< = i< = n and ti ∈D)
Algorithms used for solving classification problems in data mining: -
• Statistical – based algorithms • Distance – based algorithms
• Decision tree – based algorithms • Neural network – based algorithms
• Rule – based algorithms • Combing techniques algorithms

Example: -In air force security checking, Pattern recognition is one type of classification.
2. Clustering: -
Clustering is defined as grouping the similar data according to characteristics found in the actual data.
Example: - An international online catalog company wishes to group its customers based on a common
features include income, age, number of children, marital status and education. Now depending on the type of
advertising, the required attributes are only considered for clustering.
 Characteristics: -
• Set of like elements. Elements from different clusters are alike.
• The distance between points in a cluster is less than that of the distance between a point in the
cluster and any point outside it.
6 Email:

 Problems occurred while clustering is applied to real world database: -

• Outlier handling and Interpreting the semantic of each cluster may be difficult
• Cluster membership may change over time
• No prior knowledge of what data should be used and number of clusters used.
 Algorithms used to solve clustering problems: -
• Hierarchical algorithms: Agglomerative algorithms and Divisive algorithms
• Categorical algorithms
• Large database: Sampling algorithms and Compression

 Applications of clustering: -
• Plant and animal classification and Disease classification
• Image processing, pattern recognition and Document retrieval.

1. Web Mining: -
Web mining is mining of data related to the World Wide Web.
Web data can be classified into the following classes.
• Content of actual web pages
• Intra page structure includes the HTML or XML code for the page
• Inter page structure is the actual linkage structure between web pages
• User profiles include demographic and registration information obtained about uses. This could also be
includes information found in cookies.
The above classes of Web data may be manipulated using different web mining tasks such as:
Web Content Mining, Web Structure Mining and Web Usage Mining.

2. Spatial Mining: -
Spatial mining is data mining as applied to spatial databases or spatial data.
Specialized operations and data structures are used to access spatial data. Some of those are,
• Spatial queries
• Thematic maps
• Spatial data structures
• Image databases

 Spatial data mining primitives: -

Primitive operations involved between spatial objects are,
• Disjoint
• Equals
• Overlaps or intersects
• Covered by or inside or contained in

Generalization and Specialization: -

The use of a concept hierarchy shows levels of relationships among data. Spatial data mining techniques
have involved both generalization and specialization type approaches.
Spatial rules: -

7 Email:

There are different types of rules found in spatial data mining.

• Spatial characteristic rules • Spatial discriminant rules
• Spatial discriminant rules
Areas of applications of spatial data mining: -
• GIS systems, Geology, Environmental Science
• Resource management, Agriculture, Medicine, Robotics

3. Temporal Mining: -
Temporal data mining is the mining process of temporal data from temporal databases.
Temporal mining involves the concepts such as Modeling temporal events, Time series, Pattern detection


Retail marketing: Identifying buying patterns of customers, Market basket analysis etc.
Banking: Detecting patterns of fraudulent credit cards, Identifying loyal customers etc.
Insurance: Claims analysis, Predicting which customers will buy new policies etc.
Medicine: Characterizing patient behavior to predict surgery visits etc.

• Atlogix Software Solutions Pvt. Ltd.
• Microsoft India (R & D) Pvt. Ltd.
• MillenniumCare Infocom India Private Ltd


A data warehouse is well equipped for providing data for following reasons.
• Data quality and inconsistency in pre requisite for mining to ensure the accuracy of the predictive
models. Data warehouses are populated with clean, consistent data.
• It is advantageous to mine data from multiple sources to discover as many inter relationships as
possible. Data warehouses contain data from a number of sources.
• Selecting the relevant subsets of records and fields for data mining requires the query capabilities of
data warehouse.
• The results of the data mining study are useful if there is some way to further investigate the
uncovered patterns. Data warehouses provide the capability to go back to the data source.

Data mining have been implemented to fulfill the business requirements by extracting the data from data bases and
these are applied to provide futuristic view of decision support systems.

8 Email:

S-ar putea să vă placă și