Documente Academic
Documente Profesional
Documente Cultură
Concepts and
Techniques
Chapter 7
May 5, 2015
May 5, 2015
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction:
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
May 5, 2015
ClassificationA Two-Step
Process
May 5, 2015
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
May 5, 2015
Classification
Algorithms
Classifier
(Model)
IF rank = professor
OR years > 6
THEN tenured = yes
5
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
May 5, 2015
Tenured?
May 5, 2015
Data cleaning
Data transformation
May 5, 2015
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness
handling noise and missing values
Scalability
efficiency in disk-resident databases
Interpretability:
understanding and insight provded by the model
Goodness of rules
decision tree size
compactness of classification rules
May 5, 2015
May 5, 2015
10
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
May 5, 2015
11
Training Dataset
age
<=30
This
<=30
follows 3140
an
>40
example >40
>40
from
3140
Quinlan <=30
<=30
s ID3
>40
<=30
3140
3140
>40
May 5, 2015
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
12
age?
<=30
student?
30..40
overcast
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
May 5, 2015
13
May 5, 2015
14
Attribute Selection
Measure
15
Information Gain
(ID3/C4.5)
log 2
pn
pn pn
pn
May 5, 2015
16
May 5, 2015
17
Attribute Selection by
Information Gain Computation
5
4
Class P: buys_computer =
yes
Class N: buys_computer =
no
age
<=30
3040
>40
pi
2
4
3
May 5, 2015
ni I(pi, ni)
3 0.971
0 0
2 0.971
E ( age)
I ( 2,3)
I ( 4,0)
14
14
5
I (3,2) 0.69
14
Hence
Gain(age) I ( p, n) E (age)
Similarly
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
18
gini split (T )
T 1)
1 gini(
gini(T 2)
19
May 5, 2015
20
Avoid Overfitting in
Classification
May 5, 2015
21
May 5, 2015
22
Enhancements to basic
decision tree induction
May 5, 2015
23
May 5, 2015
24
May 5, 2015
25
26
Presentation of Classification
Results
May 5, 2015
27
May 5, 2015
28
Bayesian
Theorem
May 5, 2015
29
Bayesian classification
32
Estimating a-posteriori
probabilities
Bayes theorem:
P(C|X) = P(X|C)P(C) / P(X)
May 5, 2015
33
May 5, 2015
34
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Play-tennis outlook
example: estimating
P(sunny|p) =
P(sunny|n) =
P(xi|C)
2/9
3/5
P(overcast|p) =
4/9
P(overcast|n) =
0
P(rain|p) = 3/9
P(rain|n) = 2/5
temperature
P(hot|p) = 2/9
P(hot|n) = 2/5
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(p) = 9/14
P(high|p) = 3/9
P(high|n) = 4/5
P(n) = 5/14
P(normal|p) =
6/9
P(normal|n) =
2/5
windy
May 5, 2015
P(true|p) = 3/9
P(true|n) = 3/5
35
P(X|p)P(p) =
P(rain|p)P(hot|p)P(high|p)P(false|p)P(p) =
3/92/93/96/99/14 = 0.010582
P(X|n)P(n) =
P(rain|n)P(hot|n)P(high|n)P(false|n)P(n) =
2/52/54/52/55/14 = 0.018286
May 5, 2015
36
May 5, 2015
37
Smoker
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LungCancer
Emphysema
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
Dyspnea
38
May 5, 2015
39
Neural Networks
Advantages
prediction accuracy is generally high
robust, works when training examples contain
errors
output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes
fast evaluation of the learned target function
Criticism
long training time
difficult to understand the learned function
(weights)
not easy to incorporate domain knowledge
May 5, 2015
40
A Neuron
x0
w0
x1
w1
xn
wn
Input
weight
vector x vector w
- k
weighted
sum
output y
Activation
function
May 5, 2015
41
Network Training
May 5, 2015
Multi-Layer Perceptron
Output vector
Output nodes
Err j O j (1 O j ) Errk w jk
k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden nodes
Err j O j (1 O j )(T j O j )
wij
Input nodes
Oj
I j
1 e
I j wij Oi j
i
Input vector: xi
May 5, 2015
Association-Based Classification
May 5, 2015
45
case-based reasoning
Genetic algorithm
May 5, 2015
46
Instance-Based Methods
Instance-based learning:
Store training examples and delay the
processing (lazy evaluation) until a new
instance must be classified
Typical approaches
k-nearest neighbor approach
47
space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or realvalued.
For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
Vonoroi
diagram: the decision surface induced
_
by 1-NN
for_ a _typical set of training examples.
_
_
_
May 5, 2015
.
+
+
xq
48
1
d ( xq , xi )2
May 5, 2015
49
Case-Based Reasoning
May 5, 2015
50
May 5, 2015
51
Genetic Algorithms
May 5, 2015
52
May 5, 2015
53
Fuzzy Set
Approaches
May 5, 2015
54
What Is Prediction?
Non-linear regression
May 5, 2015
55
Predictive Modeling in
Databases
Predictive modeling: Predict data values or construct
Minimal generalization
Prediction
Determine the major factors which influence the
prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
May 5, 2015
56
( f ( x) f ( x))2 K (d ( xq , x))
2 xk _nearest _neighbors_of _ x
q
K (d ( xq , x))(( f ( x) f ( x))a j ( x)
x k _ nearest _ neighbors_ of _ xq
May 5, 2015
58
May 5, 2015
59
May 5, 2015
60
Partition: Training-and-testing
Cross-validation
Bootstrapping (leave-one-out)
May 5, 2015
61
May 5, 2015
62
May 5, 2015
63
Summary
May 5, 2015
64