CLASS

Dr. N.P.
Singh
 Knowledge Discovery in Databases (KDD) is a
non-trivial process of identifying valid, novel,
potentially useful, and ultimately
understandable patterns in data. The process
is presented in the next transparency
 Data gathering: At this stage data is collected either from data warehousing,
data marts, operational data stores, legacy systems, OLAP and Web crawling.
 Data cleansing: It includes elimination of all errors and/or bogus data, e.g.,
patient fever = 1250 F.
 Feature extraction: At this stage miner will identify only the interesting
attributes of the data, e.g., “date acquired" is probably not useful for clustering
celestial objects, as in Skycat.
 Pattern extraction and discovery. This is the stage that is often thought of as
“data mining" and is where we shall concentrate our effort by using various data
mining tools.
 Visualization: Visualization is the process of representing abstract business or
scientific data as images that can aid in understanding the meaning of the data.
This is a process of presenting the findings in graphic form to the users.
 Evaluation of results: As a fact not every discovered fact is useful, or even true.
Therefore, true judgment is necessary before following your software's
conclusions.
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
▪ Object-oriented and object-relational databases
▪ Spatial databases
▪ Time-series data and temporal data
▪ Text databases and multimedia databases
▪ Heterogeneous and legacy databases
▪ WWW
 Data mining tools can analyze the following
types of data.
 Numerical: Domain is ordered and can be
represented on the real line (e.g., age, income)
 Nominal or categorical: Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
 Ordinal: Domain is ordered, but absolute differences
between values is unknown (e.g., preference scale,
severity of an injury)
 Predictive Modeling ( Classification, Regression),
 Descriptive Modeling ( Segmentation (Clustering)
Partition based clustering algorithms
 Pattern Recovery/ Summarization (Relations
between fields, associations Rule algorithms,
visualization)
 Dependency Modeling and Causality ((Graphical
Models, Density Estimation)
 Change & Deviation detection/Modeling ( in data or
in Models), Protein Sequencing, Behavioral
Sequences
 What is Classification?
 The goal of classification is to organize and
categorize data in distinct classes.
 A model is first created based on the data
distribution.
 The model is then used to classify new data
 Given the model, a class can be predicted for
new data.
 The goal of prediction is to forecast or deduce the
value of an attribute based on the values of other
attributes.
 A model is first created based on the data
distribution.
 The model is then used to predict future or
unknown values.
 Supervised Classification = Classification
▪ We know the class labels and number of classes.
 Unsupervised Classification = Clustering
▪ We do not know the class labels and may not know the
number of classes.
 Step 1: Model Construction (learning)
 Each record (instance) is assumed to belong to
predefined classes as determined by one of the
attributes; called the class labels.
 The set of all records used for construction of the
model is called training set.
 The model is usually represented in the form of
classification rules, (IF-THEN Statements) or
decision rules
 Step 2. Model Evaluation (Accuracy):
 Estimate accuracy rate of the model based on a
test set
 The known label of test sample is compared with
the classified result from model
 Accuracy rate: percentage of test set samples
correctly classified by the model
 Test set is independent of training set otherwise
over-fitting will occur
 Step 3. Model Use (Classification):
 The model is used to classify unseen instances
(assigning class labels)
 Predict the value of an actual attribute
 Bayesian Classification
 Decision Tree Induction
 Neural Networks
 Association-Based Classification
 K-Nearest Neighbor
 Case-Based Reasoning
 Genetic Algorithms
 Fuzzy Sets
 Classification and prediction methods can be
compared and evaluated according to the
following criteria:
 Predictive Accuracy: This refers to the ability of the
model to correctly predict the class label of new or
previously unseen data.
 Speed: It relates to the computation costs involved in
generating and using the model.
 Robustness: This is the ability of the model to make
correct predictions given noisy data or data with
missing values.
 Scalability: This refers to the ability to construct the
model efficiently given large amount of the data.
 Interpretability: This refers to the level of
understanding and insight that is provided by the
model.
 Bayesian Classification
 Instances are represented by attribute-value
pairs.
 Instances are described by a fixed set of attributes
(e.g., temperature) and their values (e.g., hot).
 The easiest situation for decision tree learning
occurs when each attribute takes on a small
number of disjoint possible values (e.g., hot, mild,
cold).
 Extensions to the basic algorithm allow handling
real-valued attributes as well (e.g., a floating point
temperature).
 The target function has discrete output
values.
 A decision tree assigns a classification to each
example.
▪ Simplest case exists when there are only two possible
classes (Boolean classification).
▪ Decision tree methods can also be easily extended to
learning functions with more than two possible output
values.
 A more substantial extension allows learning
target functions with real-valued outputs,
although the application of decision trees in this
 Disjunctive descriptions may be required.
 Decision trees naturally represent disjunctive expressions.
 The training data may contain errors.
 Decision tree learning methods are robust to errors - both errors
in classifications of the training examples and errors in the
attribute values that describe these examples.
 The training data may contain missing attribute values.
 Decision tree methods can be used even when some training
examples have unknown values (e.g., humidity is known for only a
fraction of the examples).
Dependent
Attributes /
Independent Attributes / Condition Attributes Decision
Attributes
Name Hair Height Weight Lotion Result
sunburned
Sarah blonde average light no
(positive)
Dana blonde tall average yes none (negative)
Alex brown short average yes none

Annie blonde short average no sunburned
Emily red average heavy no sunburned
Pete brown tall heavy no none
John brown average heavy no none
Katie blonde short light yes none
b1 = blonde
b2 = red
b3 = brown
b1 = short
b2 = average
b3 = tall
b1 = light
b2 = average
b3 = heavy
b1 = no
b2 = yes
Attribute Average Entropy
Hair Color 0.50
Attributes 0.69
Weight 0.94
Lotion 0.61
 we now choose another test to separate out
the sunburned individuals from the blonde
haired inhomogeneous subset, {Sarah, Dana,
Annie, and Katie}.
Name Height Weight Lotion Result
Sarah average light no Sunburned
Dana tall average yes None
Annie short average no Sunburned
Katie short light yes None
Attribute Average Entropy
Height 0.50
Weight 1.00
Lotion 0.00
Actual:
Marginal
No Change Sunburned
Sum
Blonde 16 16 32
Not Blonde 20 12 32
Marginal
36 28 64
Sum
 Sample degrees of freedom calculation:
 df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1
 From the chi-square table, = 3.84
 Since , we accept the null hypothesis of
independence, H0.
 We thus conclude, according to the training
examples, that sunburn is independent from
blonde hair, and thus we may eliminate this
antecedent from Rule #1 and Rule #2.
No Sunburne Marginal
Change d Sum
Lotion 12 0 12
No Lotion 8 12 20
Marginal
20 12 32
Sum
 Chi-Square = 11.52
 df = 1
 From the chi-square table, = 3.84
 Since , we reject the null hypothesis of
independence, H0, and accept the alternate
hypothesis of dependence, Ha.
 Therefore, according to the training
examples, sunburn is clearly dependent upon
the use of lotion, so we cannot eliminate this
antecedent.

CLASS

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CLASS

Încărcat de

Drepturi de autor:

Formate disponibile

Dr. N.P.

Dana blonde tall average yes none (negative)

Alex brown short average yes none

S-ar putea să vă placă și