Documente Academic
Documente Profesional
Documente Cultură
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Classifying
Classifying
Categorizing
Tan,Steinbach, Kumar
4/18/2004
Classification Techniques
Decision
Tan,Steinbach, Kumar
4/18/2004
ca
go
e
t
al
c
ri
ca
go
e
t
al
c
ri
us
o
u
in
t
ss
n
a
cl
co
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
NO
YES
10
Tan,Steinbach, Kumar
4/18/2004
Decision
Tree
Tan,Steinbach, Kumar
4/18/2004
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
10
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
11
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
12
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
13
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
Assign Cheat to No
NO
> 80K
YES
4/18/2004
14
Decision
Tree
Tan,Steinbach, Kumar
4/18/2004
15
Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Tan,Steinbach, Kumar
4/18/2004
16
Tan,Steinbach, Kumar
Dt
4/18/2004
17
Hunts Algorithm
Dont
Cheat
Refund
Yes
No
Dont
Cheat
Dont
Cheat
Refund
Refund
Yes
Yes
No
Dont
Cheat
Single,
Divorced
Cheat
Dont
Cheat
Marital
Status
Married
Single,
Divorced
Marital
Status
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
Tan,Steinbach, Kumar
No
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
18
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
19
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
20
on attribute types
Nominal
Ordinal
Continuous
Depends
Tan,Steinbach, Kumar
4/18/2004
21
Luxury
Sports
CarType
Tan,Steinbach, Kumar
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
4/18/2004
22
Large
Size
{Large}
Tan,Steinbach, Kumar
OR
{Small,
Large}
{Medium,
Large}
Size
{Small}
Size
{Medium}
4/18/2004
23
ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Tan,Steinbach, Kumar
4/18/2004
24
Tan,Steinbach, Kumar
4/18/2004
25
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
26
Tan,Steinbach, Kumar
4/18/2004
27
approach:
Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
Non-homogeneous,
Homogeneous,
Tan,Steinbach, Kumar
4/18/2004
28
Index
Entropy
Misclassification
Tan,Steinbach, Kumar
error
4/18/2004
29
M0
A?
Yes
B?
No
Yes
No
Node N1
Node N2
Node N3
Node N4
M1
M2
M3
M4
M12
M34
Gain = M0 M12 vs M0 M34
Tan,Steinbach, Kumar
4/18/2004
30
GINI (t ) 1 [ p ( j | t )]2
j
0
6
Gini=0.000
Tan,Steinbach, Kumar
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
4/18/2004
31
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
32
GINI split
where,
ni
GINI (i )
i 1 n
Tan,Steinbach, Kumar
4/18/2004
33
B?
Yes
No
C1
C2
Gini = 0.500
Gini(N1)
= 1 (5/7)2 (2/7)2
= 0.194
Gini(N2)
= 1 (1/5)2 (4/5)2
= 0.528
Tan,Steinbach, Kumar
Node N1
Node N2
C1
C2
N1 N2
5
1
2
4
Gini=0.333
Introduction to Data Mining
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
4/18/2004
34
Multi-way split
CarType
C1
C2
Gini
Tan,Steinbach, Kumar
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
4/18/2004
35
Tan,Steinbach, Kumar
4/18/2004
36
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
No
Gini
Tan,Steinbach, Kumar
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
4/18/2004
0.420
37
Entropy (t ) p ( j | t ) log p ( j | t )
j
4/18/2004
38
Entropy (t ) p ( j | t ) log p ( j | t )
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
39
Information Gain:
GAIN
split
Entropy ( p )
Entropy (i )
n
i 1
4/18/2004
40
Gain Ratio:
GainRATIO
GAIN
n
n
SplitINFO log
SplitINFO
n
n
Split
split
i 1
4/18/2004
41
Error (t ) 1 max P (i | t )
i
Minimum
Tan,Steinbach, Kumar
4/18/2004
42
Error (t ) 1 max P (i | t )
i
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
43
Tan,Steinbach, Kumar
4/18/2004
44
A?
Yes
No
Node N1
Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489
Tan,Steinbach, Kumar
Node N2
C1
C2
N1
3
0
N2
4
3
Gini=0.361
C1
C2
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
4/18/2004
45
Tree Induction
Greedy
strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Tan,Steinbach, Kumar
4/18/2004
46
Stop
Early
Tan,Steinbach, Kumar
4/18/2004
47
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Tan,Steinbach, Kumar
4/18/2004
48
Example: C4.5
Simple
depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
You
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Tan,Steinbach, Kumar
4/18/2004
49
and Overfitting
Values
of Classification
Tan,Steinbach, Kumar
4/18/2004
50
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Tan,Steinbach, Kumar
4/18/2004
51
Underfitting: when model is too simple, both training and test errors are large
Tan,Steinbach, Kumar
4/18/2004
52
4/18/2004
53
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Tan,Steinbach, Kumar
4/18/2004
54
Notes on Overfitting
Overfitting
Training
Need
Tan,Steinbach, Kumar
4/18/2004
55
Tan,Steinbach, Kumar
4/18/2004
56
Occams Razor
Given
Tan,Steinbach, Kumar
4/18/2004
57
X
X1
X2
X3
X4
y
1
0
0
1
Xn
B
1
C
1
X
X1
X2
X3
X4
y
?
?
?
?
Xn
Tan,Steinbach, Kumar
4/18/2004
58
Tan,Steinbach, Kumar
4/18/2004
59
Tan,Steinbach, Kumar
4/18/2004
60
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes
20
Class = No
10
Error = 10/30
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1
A4
A3
A2
Class = Yes
Class = Yes
Class = Yes
Class = Yes
Class = No
Class = No
Class = No
Class = No
Tan,Steinbach, Kumar
4/18/2004
61
Examples of Post-pruning
Optimistic error?
Case 1:
Pessimistic error?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Case 2:
Tan,Steinbach, Kumar
4/18/2004
62
Tan,Steinbach, Kumar
4/18/2004
63
Refund=Yes
Refund=No
Refund=?
Class Class
= Yes = No
0
3
2
4
1
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
Missing
value
Tan,Steinbach, Kumar
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 (0.8813 0.551) = 0.3303
Introduction to Data Mining
4/18/2004
64
Distribute Instances
Refund
Yes
No
No
Tan,Steinbach, Kumar
4/18/2004
65
Classify Instances
New record:
Married
Refund
Yes
NO
Single
Divorced Total
Class=No
Class=Yes
6/9
2.67
Total
3.67
6.67
No
Single,
Divorced
MarSt
Married
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
NO
> 80K
YES
4/18/2004
66
Other Issues
Data
Fragmentation
Search Strategy
Expressiveness
Tree Replication
Tan,Steinbach, Kumar
4/18/2004
67
Data Fragmentation
Number
Number
Tan,Steinbach, Kumar
4/18/2004
68
Search Strategy
Finding
The
Other
strategies?
Bottom-up
Bi-directional
Tan,Steinbach, Kumar
4/18/2004
69
Expressiveness
Tan,Steinbach, Kumar
4/18/2004
70
Decision Boundary
4/18/2004
71
x+y<1
Class = +
Class =
4/18/2004
72
Tree Replication
P
4/18/2004
73
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
74
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
75
ACTUAL
CLASS
Class=No
Class=Yes
Class=No
Tan,Steinbach, Kumar
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
4/18/2004
76
ACTUAL Class=Yes
CLASS
Class=No
Most
Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
widely-used metric:
ad
TP TN
Accuracy
a b c d TP TN FP FN
Tan,Steinbach, Kumar
4/18/2004
77
Limitation of Accuracy
Consider
a 2-class problem
Number of Class 0 examples = 9990
Number of Class 1 examples = 10
If
Tan,Steinbach, Kumar
4/18/2004
78
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
C(Yes|No)
Class=No
C(No|Yes)
C(No|No)
Tan,Steinbach, Kumar
4/18/2004
79
PREDICTED CLASS
ACTUAL
CLASS
Model M1
C(i|j)
-1
100
PREDICTED CLASS
ACTUAL
CLASS
150
40
60
250
Accuracy = 80%
Cost = 3910
Tan,Steinbach, Kumar
Model M2
ACTUAL
CLASS
PREDICTED CLASS
250
45
200
Accuracy = 90%
Cost = 4255
Introduction to Data Mining
4/18/2004
80
Cost vs Accuracy
PREDICTED CLASS
Count
Class=Yes
Class=Yes
ACTUAL
CLASS
Class=No
b
N=a+b+c+d
Class=No
d
Accuracy = (a + d)/N
PREDICTED CLASS
Cost
Class=Yes
ACTUAL
CLASS
Class=No
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N a d)
Class=Yes
= q N (q p)(a + d)
Class=No
= N [q (q-p) Accuracy]
Tan,Steinbach, Kumar
4/18/2004
81
Cost-Sensitive Measures
a
Precision (p)
ac
a
Recall (r)
ab
2rp
2a
F - measure (F)
r p 2a b c
wa w d
Weighted Accuracy
wa wb wc w d
1
Tan,Steinbach, Kumar
4/18/2004
82
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
83
Performance
Tan,Steinbach, Kumar
4/18/2004
84
Learning Curve
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)
Tan,Steinbach, Kumar
Variance of estimate
4/18/2004
85
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement
Tan,Steinbach, Kumar
4/18/2004
86
Model Evaluation
Metrics
Methods
Methods
Tan,Steinbach, Kumar
4/18/2004
87
4/18/2004
88
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
Tan,Steinbach, Kumar
4/18/2004
89
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class
Tan,Steinbach, Kumar
4/18/2004
90
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
Ideal:
Random guess:
Tan,Steinbach, Kumar
Area = 1
Area = 0.5
4/18/2004
91
P(+|A)
True Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
10
0.25
Tan,Steinbach, Kumar
4/18/2004
92
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
Class
P
Threshold
>=
TP
ROC Curve:
Tan,Steinbach, Kumar
4/18/2004
93
Test of Significance
Tan,Steinbach, Kumar
4/18/2004
94
x Bin(N, p)
e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Np = 50 0.5 = 25
Tan,Steinbach, Kumar
4/18/2004
95
Area = 1 -
P(Z
/2
acc p
Z
p (1 p ) / N
1 / 2
Z/2
Z1- /2
/2
/2
/2
Tan,Steinbach, Kumar
4/18/2004
96
1-
0.99 2.58
0.98 2.33
50
100
500
1000
5000
0.95 1.96
p(lower)
0.670
0.711
0.763
0.774
0.789
0.90 1.65
p(upper)
0.888
0.866
0.833
0.824
0.811
Tan,Steinbach, Kumar
4/18/2004
97
Comparing Performance of 2
Models
Given
e1 ~ N 1 , 1
e2 ~ N 2 , 2
e (1 e )
Approximate:
n
i
Tan,Steinbach, Kumar
4/18/2004
98
Comparing Performance of 2
Models
To
e1(1 e1) e2(1 e2)
n1
n2
2
d d Z
/2
4/18/2004
99
An Illustrative Example
Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
d = |e2 e1| = 0.1 (2-sided test)
0.0043
30
5000
d
4/18/2004
100
Comparing Performance of 2
Algorithms
Each
(d j
j 1
d)
k (k 1)
d d t
t
Tan,Steinbach, Kumar
1 ,k 1
4/18/2004
101