Documente Academic
Documente Profesional
Documente Cultură
CS181 Lecture 7:
Decision trees
Decision trees
Classifier model
Nodes: attributes
Edges: attribute values
Leaves: classes
ID3 Pseudocode
X1
F
X2
F
N
T
X4
T
X2
F
ID3(D,X) =
Let T be a new tree
If all instances in D have same class c
Label(T) = c; Return T
If X = ! or no attribute has positive information gain
Label(T) = most common class in D; return T
X ! attribute with highest information gain
Label(T) = X
For each value x of X
Dx ! instances in D with X = x
If Dx is empty
Let Tx be a new tree
Label(Tx) = most common class in D
Else
Tx = ID3(Dx, X { X })
Add a branch from T to Tx labeled by x
Return T
Splitting criterion
Example
Entropy of data
- measures disorder of data
Remainder
- expected disorder of data after splitting
Information gain
- decrease in disorder after splitting on attribute
D=
X1
F
F
F
T
T
T
T
X2
F
F
T
T
F
T
T
X3
F
T
F
T
F
T
T
X = {X1,X2,X3,X4}
X4
F
T
T
F
F
T
F
C
P
P
P
P
N
N
N
Example
D=
X1
F
F
F
T
T
T
T
X2
F
F
T
T
F
T
T
X3
F
T
F
T
F
T
T
X4
F
T
T
F
F
T
F
C
P
P
P
P
N
N
N
Consider X1
Entropy(D) = Entropy(4 : 3)
4
4 3
3
= ! log 2 ! log 2
7
7 7
7
= 0.98
X = {X1,X2,X3,X4}
D=
X1
F
F
F
T
T
T
T
X2
F
F
T
T
F
T
T
X3
F
T
F
T
F
T
T
D=
X2
F
F
T
T
F
T
T
X3
F
T
F
T
F
T
T
X = {X1,X2,X3,X4}
X4
F
T
T
F
F
T
F
C
P
P
P
P
N
N
N
D=
X2
F
F
T
T
F
T
T
X3
F
T
F
T
F
T
T
Entropy(D) = 0.98
X1
F
3:0
1:3
D=
Entropy(3 : 0) = 0.0
Entropy(1 : 3) = 0.81
3
4
* 0 + * 0.81 = 0.46
7
7
Gain(X1 ) = 0.98 ! 0.46 = 0.52
Re mainder(X1 , D) =
X = {X1,X2,X3,X4}
X4
F
T
T
F
F
T
F
C
P
P
P
P
N
N
N
Entropy(D) = 0.98
X1
F
3:0
1:3
0
3
3
4
Consider X2
X1
F
F
F
T
T
T
T
X2
F
F
T
T
F
T
T
X3
F
T
F
T
F
T
T
X = {X1,X2,X3,X4}
First split
X1
F
F
F
T
T
T
T
C
P
P
P
P
N
N
N
X = {X1,X2,X3,X4}
Consider X1
X1
F
F
F
T
T
T
T
X4
F
T
T
F
F
T
F
X4
F
T
T
F
F
T
F
C
P
P
P
P
N
N
N
Entropy(D) = 0.98
X2
F
2:1
3
4
* 0.92 + *1.0 = 0.97
7
7
Gain(X 2 ) = 0.98 ! 0.97 = 0.01
Re mainder(X 2 , D) =
X1 X2 X3 X4 C
X1 X2 X3 X4 C
F F F F P
F F T T P
F T F T P
T T T F P
T F F F N
T T T T N
T T T F N
X1 X2 X3 X4 C
D=
2:2
Left side
Gain(X1) = 0.52
Gain(X2) = 0.01
Gain(X3) = 0.01
Gain(X4) = 0.01
X1
F F F F P
F F T T P
F T F T P
X = {X2,X3,X4}
Right side
So far...
X1 X2 X3 X4 C
D=
T T T F P
T F F F N
T T T T N
T T T F N
X2
F
X1
X4
X3
X2
F
X = {X2,X3,X4}
All attributes have same information gain.
Break ties arbitrarily.
Choose X2
X1 X2 X3 X4 C
X1 X2 X3 X4 C
T F F F N
T T T F P
T T T T N
T T T F N
Left side
Right side
X1 X2 X3 X4 C
D=
T F F F N
T T T F P
T T T T N
T T T F N
X = {X3,X4}
X4
X3
X1 X2 X3 X4 C
D=
0:0 1:2
1:1 0:1
X = {X3,X4}
X3 has zero information gain
X4 has positive information gain
Choose X4
So far...
Left side
X1
F
X1 X2 X3 X4 C
D=
X2
F
X = {X3}
X4
F
X1 X2 X3 X4 C
T T T F P
T T T F N
T T T F P
T T T F N
X1 X2 X3 X4 C
T T T T N
Right side
Final tree
X1
D=
X1 X2 X3 X4 C
T T T T N
P
X = {X3}
X4
F
N/P
Entropy revisited
Entropy as bits
1.00
0.75
entropy
X2
F
0.50
0.25
message?
pT
Example
Example
1.00
0.75
0.50
0.25
the data
pT
X2
T
2:4
1:1
3:3
2:0
3:3
X3
T
2:0
2:2
4:0
Remainder = 0.75
Remainder = 0.5
2:2
X4
T
4:0
1:1
5:1
Remainder = 0.74
An example
Correct tree
X1 X2 X3
F F F
F T T
T F T
T T T
N iff X1 = X2
X1
F
C
N
P
P
N
X2
F
X2
T
N N
Example
Example
Gain(X )?
- 0.0
Gain(X )?
2
- 0.0
Gain(X )?
3
- 0.31
Split on X
X1 X2 X3
F F F
F T T
T F T
T T T
C
N
P
P
N
or X2?
Oveview
algorithm?
ID3(D,X) =
Let T be a new tree
If all instances in D have same class c
Label(T) = c; Return T
If X = ! or no attribute has positive information gain
Label(T) = most common class in D; return T
X ! attribute with highest information gain
Label(T) = X
For each value x of X
Dx ! instances in D with X = x
If Dx is empty
Let Tx be a new tree
Label(Tx) = most common class in D
Else
Tx = ID3(Dx, X { X })
Add a branch from T to Tx labeled by x
Return T
Multi-class attributes
Continuous-valued attributes
Missing information
Attribute costs
Disjunctions
Continuous-valued attributes
Continuous-value attributes
What happens if we split on the Date attribute?
Date
Outlook
8/2/06
Sunny
4/14/07
Rain
5/20/07 Overcast
9/2/07
Sunny
10/16/07 Sunny
11/1/07 Sunny
3/7/08
Rain
Temp
Hot
Mild
Hot
Mild
Mild
Low
Low
Humidity Wind
High
Weak
High
Weak
High
Weak
High
Weak
Normal Strong
Low
Weak
High
Weak
Multi-class attributes
Tennis
N
P
P
N
P
P
N
Continuous-valued attributes
ID3(D,X) =
Let T be a new tree
If all instances in D have same class c
Label(T) = c; Return T
If X = ! or no attribute has positive information gain
Label(T) = most common class in D; return T
X ! attribute with highest information gain
Label(T) = X
For each value x of X
Dx ! instances in D with X = x
If Dx is empty
Let Tx be a new tree
Label(Tx) = most common class in D
Else
Tx = ID3(Dx, X { X })
Add a branch from T to Tx labeled by x
Return T
Continuous-value attributes
What happens if we split on the Date attribute?
Date
Outlook
8/2/06
Sunny
4/14/07
Rain
5/20/07 Overcast
9/2/07
Sunny
10/16/07 Sunny
11/1/07 Sunny
3/7/08
Rain
Temp
Hot
Mild
Hot
Mild
Mild
Low
Low
Humidity Wind
High
Weak
High
Weak
High
Weak
High
Weak
Normal Strong
Low
Weak
High
Weak
Tennis
N
P
P
N
P
P
N
Split information
Split information
values
values
SplitInformation(D, X) = ! "
x
Split information
Dx
D
log 2 x
D
D
Missing information
GainRatio(D, X i ) !
Gain(D, X i )
SplitInformation(D, X i )
Age
1
1
4
5
48
67
4
3
5
5
3
3
1
1
57
60
53
4
4
28
70
1
1
1
2
3
3
0
0
BI-RAD
Age
4
5
48
67
5
5
57
60
53
28
70
66
63
78
66
63
78
1
1
Age
Shape
0.75
0.25
4
4
48
48
4
4
5
5
3
1
1
1
1.0
67
57
60
53
3
3
0
0
4
4
28
70
1
2
3
3
0
0
Age
Shape
Margin Density
4
5
48
67
4
3
5
5
1
1
57
60
53
4
4
28
70
1
2
66
63
78
Class
66
63
78
Missing information
Fraction
0.75
0.25
1.0
1.0
0.66
0.33
0.66
0.33
1.0
0.75
0.25
0.25
0.75
0.75
0.25
1.0
BI-RAD
4
4
5
5
5
5
4
4
4
4
4
2
2
5
5
4
Age
48
48
67
57
60
60
53
53
28
70
70
66
66
63
63
78
Shape
4
4
3
4
4
3
4
3
1
1
3
1
1
3
3
1
Margin
5
5
5
4
5
5
4
4
1
2
2
1
1
1
2
1
Density
3
1
3
3
1
1
3
3
3
3
3
1
3
3
3
1
Disjunctions
Margin Density
Class
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
Gain 2 (D, X)
Cost(X)
X1
F
X2
X3
T
X2
N
X1
Attribute costs
0
3
Disjunctions
Class
X3
N P
X4
F
X4
P
F
Disjunctions
How about (X1"X2)#(X3"X4)?
Decision graphs
X1
F
Can we avoid
this repetition?
X4
P
F
X2
F
be joined
X3
T
T
F
N P
X4
F
X2
X3
X1
X3
X4
Overview
Is this good?
Wrong.
Overfitting
Patterns in data
Overfitting example
Correct
tree:
Beginning
- lots of data
X1
F
X2
F
Later
Overfitting example
X1
Training
set:
X4
F
X3
F
Possible
learned
tree:
X1
F
F
F
T
T
T
X2
F
F
T
F
T
T
Training
set:
X1
F
F
F
T
T
T
X2
F
F
T
F
T
T
X3
F
T
F
T
T
F
X4
F
T
T
F
T
F
C
P
P
P
P
N
P
X3
F
T
F
T
T
F
X4
F
T
T
F
T
F
C
P
P
P
P
N
P
Overfitting example
X3
F
T
F
T
T
F
X4
F
T
T
F
T
F
C
P
P
P
P
N
P
Possible
learned
tree:
X1
F
Training
set:
X1
F
F
F
T
T
T
X2
F
F
T
F
T
T
Overfitting example
Correct
tree:
X1
Possible
trees:
X1
P
X1
X3
F
2. Non-deterministic domain
X4
X2
F
Causes of overfitting
4. Many features
N
T
Classify !TTTF"?
Extreme example
1000 features
True concept is X # X
8 training examples
For a given attribute X (i > 2), what is likelihood that X
1
Bias tradeoff
- non-deterministic domain
i
can!t avoid
- many features
Restriction bias
- may overfit
Preference bias
- prefer simpler hypotheses even with higher training
error
- e.g. prefer shorter trees even if they don!t fit the data
as well
Overview
Pruning
Remove a subtree
Replace with a leaf
Class of leaf is most common class of data in
subtree
Pruning example
X1
X2
X3
X4
When to prune
X1
F
Pre-pruning
X1
F
X2
F
P
T
Pruning methods
1. Validation set
X4
Post-pruning
- grow complete tree and prune it afterward
- takes into account more information
2. Rule post-pruning
3. Chi-squared pruning
4. Minimum description length
Training set
Validation
set
Test
set
Rule post-pruning
If s # t, prune subtree
Outlook
Outlook
Sunny
Overcast
Humidity
High
Rain
Normal
Sunny
Wind
Strong
Humidity
Weak
High
Overcast
Normal
Rain
Wind
Strong
All rules
Weak
Prune preconditions
(Outlook = Overcast)
N
P
P
P
N
Outlook
Temp
Humidity
Wind
Tennis?
Rain
Rain
Low
High
High
High
Weak
Strong
N
N
Prune preconditions
(Outlook = Sunny " Humidity = High)
(Outlook = Sunny " Humidity = Low)
(Outlook = Overcast)
(Outlook = Rain " Wind = Weak)
(Outlook = Rain " Wind X
= Strong)
N
P
P
P
N
Prune preconditions
(Outlook = Sunny " Humidity = High)
(Outlook = Sunny " Humidity = Low)
(Outlook = Overcast)
(Outlook = Rain " Wind = Weak)
(Outlook = Rain)
Outlook
Temp
Humidity
Wind
Tennis?
Outlook
Temp
Humidity
Wind
Tennis?
Rain
Rain
Low
High
High
High
Weak
Strong
N
N
Rain
Rain
Low
High
High
High
Weak
Strong
N
N
N
P
P
N
P
Outlook
Temp
Humidity
Wind
Tennis?
Rain
Rain
Low
High
High
High
Weak
Strong
N
N
N
P
P
P
N
Improved readability
N
P
P
N
P
Temp
Humidity
Wind
Tennis?
Sunny
Rain
Low
High
Low
High
Weak
Weak
P
N
Chi-squared
Pre-pruning method
- only need distribution of classes at node
Proposed split
Given instances D
Propose split on X
Notation
Absence of pattern
- Nx = number of instances in Dx
- Nxc = number of instances in Dx with class c
Deviation
)2
(N xc ! N
xc
N
xc
Using deviation
Chi-squared
- can use all data for training
- does not consider all information in subtree
- statistical test becomes less valid with less data
- requires specifying cutoff q
Next time
Experimental methodology
Computational learning theory
Description length
1
2