Documente Academic
Documente Profesional
Documente Cultură
Singh
Knowledge Discovery in Databases (KDD) is a
non-trivial process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data. The process
is presented in the next transparency
Data gathering: At this stage data is collected either from data warehousing,
data marts, operational data stores, legacy systems, OLAP and Web crawling.
Data cleansing: It includes elimination of all errors and/or bogus data, e.g.,
patient fever = 1250 F.
Feature extraction: At this stage miner will identify only the interesting
attributes of the data, e.g., “date acquired" is probably not useful for clustering
celestial objects, as in Skycat.
Pattern extraction and discovery. This is the stage that is often thought of as
“data mining" and is where we shall concentrate our effort by using various data
mining tools.
Visualization: Visualization is the process of representing abstract business or
scientific data as images that can aid in understanding the meaning of the data.
This is a process of presenting the findings in graphic form to the users.
Evaluation of results: As a fact not every discovered fact is useful, or even true.
Therefore, true judgment is necessary before following your software's
conclusions.
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
▪ Object-oriented and object-relational databases
▪ Spatial databases
▪ Time-series data and temporal data
▪ Text databases and multimedia databases
▪ Heterogeneous and legacy databases
▪ WWW
Data mining tools can analyze the following
types of data.
Numerical: Domain is ordered and can be
represented on the real line (e.g., age, income)
Nominal or categorical: Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
Ordinal: Domain is ordered, but absolute differences
between values is unknown (e.g., preference scale,
severity of an injury)
Predictive Modeling ( Classification, Regression),
Descriptive Modeling ( Segmentation (Clustering)
Partition based clustering algorithms
Pattern Recovery/ Summarization (Relations
between fields, associations Rule algorithms,
visualization)
Dependency Modeling and Causality ((Graphical
Models, Density Estimation)
Change & Deviation detection/Modeling ( in data or
in Models), Protein Sequencing, Behavioral
Sequences
What is Classification?
The goal of classification is to organize and
categorize data in distinct classes.
A model is first created based on the data
distribution.
The model is then used to classify new data
Given the model, a class can be predicted for
new data.
The goal of prediction is to forecast or deduce the
value of an attribute based on the values of other
attributes.
A model is first created based on the data
distribution.
The model is then used to predict future or
unknown values.
Supervised Classification = Classification
▪ We know the class labels and number of classes.
Unsupervised Classification = Clustering
▪ We do not know the class labels and may not know the
number of classes.
Step 1: Model Construction (learning)
Each record (instance) is assumed to belong to
predefined classes as determined by one of the
attributes; called the class labels.
The set of all records used for construction of the
model is called training set.
The model is usually represented in the form of
classification rules, (IF-THEN Statements) or
decision rules
Step 2. Model Evaluation (Accuracy):
Estimate accuracy rate of the model based on a
test set
The known label of test sample is compared with
the classified result from model
Accuracy rate: percentage of test set samples
correctly classified by the model
Test set is independent of training set otherwise
over-fitting will occur
Step 3. Model Use (Classification):
The model is used to classify unseen instances
(assigning class labels)
Predict the value of an actual attribute
Bayesian Classification
Decision Tree Induction
Neural Networks
Association-Based Classification
K-Nearest Neighbor
Case-Based Reasoning
Genetic Algorithms
Fuzzy Sets
Classification and prediction methods can be
compared and evaluated according to the
following criteria:
Predictive Accuracy: This refers to the ability of the
model to correctly predict the class label of new or
previously unseen data.
Speed: It relates to the computation costs involved in
generating and using the model.
Robustness: This is the ability of the model to make
correct predictions given noisy data or data with
missing values.
Scalability: This refers to the ability to construct the
model efficiently given large amount of the data.
Interpretability: This refers to the level of
understanding and insight that is provided by the
model.
Bayesian Classification
Instances are represented by attribute-value
pairs.
Instances are described by a fixed set of attributes
(e.g., temperature) and their values (e.g., hot).
The easiest situation for decision tree learning
occurs when each attribute takes on a small
number of disjoint possible values (e.g., hot, mild,
cold).
Extensions to the basic algorithm allow handling
real-valued attributes as well (e.g., a floating point
temperature).
The target function has discrete output
values.
A decision tree assigns a classification to each
example.
▪ Simplest case exists when there are only two possible
classes (Boolean classification).
▪ Decision tree methods can also be easily extended to
learning functions with more than two possible output
values.
A more substantial extension allows learning
target functions with real-valued outputs,
although the application of decision trees in this
Disjunctive descriptions may be required.
Decision trees naturally represent disjunctive expressions.
The training data may contain errors.
Decision tree learning methods are robust to errors - both errors
in classifications of the training examples and errors in the
attribute values that describe these examples.
The training data may contain missing attribute values.
Decision tree methods can be used even when some training
examples have unknown values (e.g., humidity is known for only a
fraction of the examples).
Dependent
Attributes /
Independent Attributes / Condition Attributes Decision
Attributes
Name Hair Height Weight Lotion Result
sunburned
Sarah blonde average light no
(positive)
Marginal
No Change Sunburned
Sum
Blonde 16 16 32
Not Blonde 20 12 32
Marginal
36 28 64
Sum
Sample degrees of freedom calculation:
df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1
From the chi-square table, = 3.84
Since , we accept the null hypothesis of
independence, H0.
We thus conclude, according to the training
examples, that sunburn is independent from
blonde hair, and thus we may eliminate this
antecedent from Rule #1 and Rule #2.
No Sunburne Marginal
Change d Sum
Lotion 12 0 12
No Lotion 8 12 20
Marginal
20 12 32
Sum
Chi-Square = 11.52
df = 1
From the chi-square table, = 3.84
Since , we reject the null hypothesis of
independence, H0, and accept the alternate
hypothesis of dependence, Ha.
Therefore, according to the training
examples, sunburn is clearly dependent upon
the use of lotion, so we cannot eliminate this
antecedent.