Documente Academic
Documente Profesional
Documente Cultură
Tan,Steinbach, Kumar
4/18/2004
Rule-Based Classifier
Classify records by using a collection of ifthen
rules
Rule:
(Condition)
where
Condition is a conjunctions of attributes
y is the class label
(Lay Eggs=Yes)
(Refund=Yes)
Birds
Evade=No
4/18/2004
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Blood Type
warm
cold
cold
warm
cold
cold
warm
warm
warm
cold
cold
warm
warm
cold
cold
cold
warm
warm
warm
warm
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Live in Water
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
reptiles
mammals
birds
mammals
birds
4/18/2004
hawk
grizzly bear
Blood Type
warm
warm
Fishes
Name
Birds
Mammals
Reptiles
Amphibians
Give Birth
Can Fly
Live in Water
Class
no
yes
yes
no
no
no
?
?
4/18/2004
Coverage of a rule:
1
Yes
Single
Fraction of records
2
No
Married
that
satisfy
the
3
No
Single
antecedent of a rule
4
Yes
Married
5
No
Divorced
Accuracy of a rule:
6
No
Married
Fraction of records
7
Yes
Divorced
that satisfy both the
8
No
Single
9
No
Married
antecedent
and
10 No
Single
consequent of a
(Status=Single)
No
rule
Taxable
Income Class
125K
No
100K
No
70K
No
120K
No
95K
Yes
60K
No
220K
No
85K
Yes
75K
No
90K
Yes
10
4/18/2004
lemur
turtle
dogfish shark
Blood Type
warm
cold
cold
Fishes
Name
Birds
Mammals
Reptiles
Amphibians
Give Birth
Can Fly
Live in Water
Class
yes
no
yes
no
no
no
no
sometimes
yes
?
?
?
Tan,Steinbach, Kumar
4/18/2004
4/18/2004
Refund
Yes
No
NO
Marita l
Status
{Single,
Divorced}
{Married}
NO
Taxable
Income
< 80K
NO
> 80K
YES
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
95K
Yes
60K
No
Refund
Yes
No
NO
{Single,
Divorced}
Marita l
Status
{Married}
5
No
Divorced
Rules
Can
Be
Simplified
NO
Taxable
6
No
Married
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
(Status=Married)
No
Income
< 80K
NO
> 80K
YES
10
Initial Rule:
(Refund=No)
No
4/18/2004
4/18/2004
10
turtle
Tan,Steinbach, Kumar
Blood Type
cold
Fishes
Birds
Mammals
Reptiles
Amphibians
Give Birth
Can Fly
Live in Water
Class
no
no
sometimes
4/18/2004
11
Class-based ordering
Rules that belong to the same class appear together
Rule-based Ordering
Class-based Ordering
(Refund=Yes) ==> No
(Refund=Yes) ==> No
Tan,Steinbach, Kumar
4/18/2004
12
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
Tan,Steinbach, Kumar
4/18/2004
13
Tan,Steinbach, Kumar
4/18/2004
14
Tan,Steinbach, Kumar
(ii) Step 1
4/18/2004
15
R1
R1
R2
(iii) Step 2
Tan,Steinbach, Kumar
(iv) Step 3
4/18/2004
16
Rule Evaluation
Stopping Criterion
Rule Pruning
Tan,Steinbach, Kumar
4/18/2004
17
Rule Growing
{}
Yes: 3
No: 4
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=
No
Status =
Single
Status =
Divorced
Status =
Married
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 0
No: 3
...
Income
> 80K
Yes: 3
No: 1
(a) General-to-specific
Tan,Steinbach, Kumar
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specific-to-general
4/18/2004
18
RIPPER Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOILs information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
Tan,Steinbach, Kumar
4/18/2004
19
Instance Elimination
Why do we need to eliminate
instances?
Otherwise, the next rule is
identical to previous rule
R3
R1
+
class = +
+
class = -
Prevent
underestimating
accuracy of rule
Compare rules R2 and R3 in
the diagram
Tan,Steinbach, Kumar
+ +
++
+
+
+
++
+ + +
+ +
-
+ +
-
+
-
R2
4/18/2004
20
Rule Evaluation
Metrics:
Accuracy
Laplace
M-estimate
Tan,Steinbach, Kumar
nc
n
nc 1
n k
nc kp
n k
Introduction to Data Mining
n : Number of instances
covered by rule
nc : Number of instances
covered by rule
k : Number of classes
p : Prior probability
4/18/2004
21
4/18/2004
22
Repeat
Tan,Steinbach, Kumar
4/18/2004
23
Tan,Steinbach, Kumar
4/18/2004
24
4/18/2004
25
4/18/2004
26
Compare the rule set for r against the rule set for r*
and r
Choose rule set that minimizes MDL principle
Tan,Steinbach, Kumar
4/18/2004
27
Indirect Methods
P
No
Yes
Q
No
Rule Set
R
Yes
No
Yes
Q
No
Tan,Steinbach, Kumar
Yes
4/18/2004
28
Tan,Steinbach, Kumar
4/18/2004
29
Tan,Steinbach, Kumar
4/18/2004
30
Example
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Tan,Steinbach, Kumar
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Lay Eggs
no
yes
yes
no
yes
yes
no
yes
no
no
yes
yes
no
yes
yes
yes
yes
yes
no
yes
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
Class
mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
reptiles
mammals
birds
mammals
birds
4/18/2004
31
Give
Birth?
Birds
No
Yes
(Give Birth=Yes)
Fishes
Mammals
Live In
Water?
Mammals
Yes
()
Amphibians
RIPPER:
No
(Live in Water=Yes)
(Have Legs=No)
Sometimes
Fishes
Yes
Birds
Fishes
Reptiles
Can
Fly?
Amphibians
Tan,Steinbach, Kumar
Reptiles
No
()
Birds
Mammals
Reptiles
4/18/2004
32
0
0
0
3
0
Mammals
0
1
0
0
6
0
0
0
2
0
Mammals
2
0
1
1
4
RIPPER:
PREDICTED CLASS
Amphibians Fishes Reptiles Birds
ACTUAL Amphibians
0
0
0
CLASS Fishes
0
3
0
Reptiles
0
0
3
Birds
0
0
1
Mammals
0
2
1
Tan,Steinbach, Kumar
4/18/2004
33
Tan,Steinbach, Kumar
4/18/2004
34
Instance-Based Classifiers
Set of Stored Cases
Atr1
...
AtrN
Class
A
B
B
C
A
Unseen Case
Atr1
...
AtrN
C
B
Tan,Steinbach, Kumar
4/18/2004
35
Nearest neighbor
Uses k closest points (nearest neighbors) for
performing classification
Tan,Steinbach, Kumar
4/18/2004
36
Training
Records
Tan,Steinbach, Kumar
Test
Record
Choose k of the
nearest records
4/18/2004
37
Nearest-Neighbor Classifiers
Unknown record
4/18/2004
38
4/18/2004
39
1 nearest-neighbor
Voronoi Diagram
Tan,Steinbach, Kumar
4/18/2004
40
d ( p, q )
(p
q)
4/18/2004
41
Tan,Steinbach, Kumar
4/18/2004
42
Tan,Steinbach, Kumar
4/18/2004
43
100000000000
vs
011111111111
000000000001
d = 1.4142
d = 1.4142
Tan,Steinbach, Kumar
4/18/2004
44
Tan,Steinbach, Kumar
4/18/2004
45
Example: PEBLS
PEBLS: Parallel Examplar-Based Learning System
(Cost & Salzberg)
Works with both continuous and nominal features
For nominal features, distance between two nominal
values is computed using modified value difference
metric (MVDM)
Tan,Steinbach, Kumar
4/18/2004
46
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
Yes
Single
125K
No
d(Single,Married)
No
Married
100K
No
No
Single
70K
No
d(Single,Divorced)
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
d(Married,Divorced)
Yes
Divorced 220K
No
No
Single
85K
Yes
d(Refund=Yes,Refund=No)
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Marital Status
Class
Refund
Single
Married
Divorced
Yes
No
Tan,Steinbach, Kumar
Class
Yes
No
Yes
No
d (V1 ,V2 )
i
4/18/2004
n1i
n1
n2i
n2
47
Example: PEBLS
Tid Refund Marital
Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
10
( X ,Y )
wX wY
d ( X i , Yi )
i 1
where:
wX
wX
4/18/2004
48
Bayes Classifier
A probabilistic framework for solving classification
problems
Conditional Probability:
P ( A, C )
P (C | A)
P ( A)
P( A | C )
Bayes theorem:
P (C | A)
Tan,Steinbach, Kumar
P ( A, C )
P (C )
P ( A | C ) P (C )
P ( A)
Introduction to Data Mining
4/18/2004
49
P( M | S )
Tan,Steinbach, Kumar
P( S | M ) P( M )
P( S )
Introduction to Data Mining
0.5 1 / 50000
1 / 20
0.0002
4/18/2004
50
Bayesian Classifiers
Consider each attribute and class label as random
variables
Given a record with attributes (A1, A2,,An)
Goal is to predict class C
Specifically, we want to find the value of C that
maximizes P(C| A1, A2,,An )
4/18/2004
51
Bayesian Classifiers
Approach:
compute the posterior probability P(C | A1, A2, , An) for
all values of C using the Bayes theorem
P( A A A | C ) P(C )
P( A A A )
P(C | A A A )
1
Choose
value
P(C
|
of
C
A1,
that
A2,
maximizes
,
An)
4/18/2004
52
Tan,Steinbach, Kumar
P(Cj)
P(Ai| Cj)
4/18/2004
is
53
at
Refund
o
eg
a
c
i
r
at
o
eg
a
c
i
r
on
u
it n
s
u
o
s
s
a
cl
Marital
Status
Taxable
Income
Evade
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced
95K
Yes
No
Married
60K
No
Yes
Divorced
220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
For
7/10,
discrete attributes:
P(Ai | Ck) = |Aik|/ Nc k
10
Tan,Steinbach, Kumar
P(No)
=
P(Yes) = 3/10
P(Status=Married|No) =
P(Refund=Yes|Yes)=0
4/18/2004
4/7
54
4/18/2004
55
e
at
Refund
Yes
a
c
i
r
e
at
Marital
Status
Single
a
c
i
r
t
n
o
Taxable
Income
125K
in
as
l
c
Evade
Normal distribution:
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced
95K
Yes
No
Married
60K
No
Yes
Divorced
220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
1
2
P( A | c )
No
( Ai
ij
2
ij
ij
10
1
e
2 (54.54)
( 120 110 ) 2
2 ( 2975 )
0.0072
4/18/2004
56
)2
(Refund
P(X|Class=No) = P(Refund=No|Class=No)
P(Married| Class=No)
P(Income=120K| Class=No)
= 4/7 4/7 0.0072 = 0.0024
=> Class = No
4/18/2004
57
N ic
Nc
N ic 1
Nc c
m - estimate : P( Ai | C )
Tan,Steinbach, Kumar
c: number of classes
p: prior probability
m: parameter
N ic mp
Nc m
4/18/2004
58
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
Tan,Steinbach, Kumar
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
yes
no
Class
A: attributes
M: mammals
N: non-mammals
6
7
1
P( A | N )
13
6 2 2
0.06
7 7 7
10 3 4
0.0042
13 13 13
7
P( A | M ) P( M ) 0.06
0.021
20
13
P( A | N ) P( N ) 0.004
0.0027
20
P( A | M )
4/18/2004
59
4/18/2004
60
SVM also works very well with Highdimensional Data and avoids the
curse of dimensionality problem.
Another unique aspect of this
approach is that it represents the
Decision Boundary using a Subset
of the Training examples, known as
the Support Vectors.
Tan,Steinbach, Kumar
4/18/2004
61
Tan,Steinbach, Kumar
4/18/2004
62
Find a linear hyperplane (decision boundary) that will separate the data
Tan,Steinbach, Kumar
4/18/2004
63
Tan,Steinbach, Kumar
4/18/2004
64
B2
Tan,Steinbach, Kumar
4/18/2004
65
B2
Tan,Steinbach, Kumar
4/18/2004
66
B2
Tan,Steinbach, Kumar
4/18/2004
67
B2
b21
b22
margin
b11
b12
Tan,Steinbach, Kumar
4/18/2004
68
Tan,Steinbach, Kumar
4/18/2004
69
two
the
4/18/2004
70
Tan,Steinbach, Kumar
4/18/2004
71
Tan,Steinbach, Kumar
4/18/2004
72
w x b 0
w x b
w x b
b11
f ( x)
1
if w
1 if w
Tan,Steinbach, Kumar
x b
x b
1
1
b12
Margin
4/18/2004
|| w ||2
73
Margin
2
2
|| w ||
L( w)
2
|| w ||
2
f ( xi )
1
if w
1 if w
xi b
xi b
1
1
4/18/2004
74
Tan,Steinbach, Kumar
4/18/2004
75
k
i
Subject to:
f ( xi )
Tan,Steinbach, Kumar
1
if w
1 if w
xi b
xi b
11
i
i
4/18/2004
76
Tan,Steinbach, Kumar
4/18/2004
77
Tan,Steinbach, Kumar
4/18/2004
78
Characteristics of SVM
SVM has many desirable qualities that make it one of the most widely
used classification algorithms. Following is a summary of the general
characteristics of SVM:
1. The SVM learning problem can be formulated as a convex optimization
problem, in which efficient algorithms are available to find the global
minimum of the objective function.
4/18/2004
79
Ensemble Methods
It is a techniques for improving Classification Accuracy
by Aggregating the Predictions of multiple classifiers.
These techniques are known as the Ensemble or
classifier combination methods.
An ensemble method constructs a set of base
classifiers from training data and performs
classification by taking a vote on the predictions
made by each base classifier.
Tan,Steinbach, Kumar
4/18/2004
80
General Idea
D
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
D1
D2
C1
C2
Step 3:
Combine
Classifiers
Tan,Steinbach, Kumar
....
Original
Training data
Dt-1
Dt
Ct -1
Ct
C*
4/18/2004
81
i 13
Tan,Steinbach, Kumar
25
i
(1
)25
0.06
4/18/2004
82
4/18/2004
83
4/18/2004
84
Bagging
Sampling with replacement
Original Data
Bagging (Round 1)
Bagging (Round 2)
Bagging (Round 3)
1
7
1
1
2
8
4
8
3
10
9
5
4
8
1
10
5
2
2
5
6
5
3
5
7
10
2
9
8
10
7
6
9
5
3
3
10
9
2
7
Tan,Steinbach, Kumar
4/18/2004
85
Boosting
An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of
boosting round
Tan,Steinbach, Kumar
4/18/2004
86
Boosting
Records that are wrongly classified will have their
weights increased
Records that are classified correctly will have their
weights decreased
Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
Tan,Steinbach, Kumar
4/18/2004
87
Example: AdaBoost
Base classifiers: C1, C2, , CT
Error rate:
1
N
w j Ci ( x j )
yj
j 1
Importance of a classifier:
1 1
ln
2
Tan,Steinbach, Kumar
i
i
4/18/2004
88
Example: AdaBoost
Weight update:
wi
( j 1)
wi( j ) exp
Z j exp
if C j ( xi )
yi
if C j ( xi )
yi
C * ( x)
arg max
y
Tan,Steinbach, Kumar
C j ( x)
j 1
4/18/2004
89
Illustrating AdaBoost
Initial weights for each data point
Original
Data
0.1
0.1
0.1
+++
- - - - -
++
Data points
for training
B1
0.0094
Boosting
Round 1
+++
Tan,Steinbach, Kumar
0.0094
0.4623
- - - - - - -
= 1.9459
4/18/2004
90
Illustrating AdaBoost
B1
0.0094
Boosting
Round 1
0.0094
+++
0.4623
- - - - - - -
= 1.9459
B2
Boosting
Round 2
0.0009
0.3037
- - -
- - - - -
0.0422
++
= 2.9323
B3
0.0276
0.1819
0.0038
Boosting
Round 3
+++
++ ++ + ++
Overall
+++
- - - - -
Tan,Steinbach, Kumar
= 3.8744
++
4/18/2004
91
THANK YOU
Tan,Steinbach, Kumar
4/18/2004
92