Documente Academic
Documente Profesional
Documente Cultură
1. Consider the training example show in the table 1, for a multiway classification
problem.
Table 1
NOTE: “You can consult all the formulas of classification in Package of formula”
b) Compute de Gini Index for costumer ID attribute.
.
.
.
Cart Type
Luxury
Family Sports
co:1 co:8 co:1
c1:3 c1:0 c1:7
Shirt Size
The names of the branches are Small, Medium, Large, and Extra-large, Respectively
form left to right.
f) Which attribute is the best?
The best attribute is the one with the lowest impurity, therefore is costumer ID
g) Why the Costumer ID should not use as the attribute test condition.
Because it produce the highest generalization error in the final model, and
excessive overfitting.
2. Consider the training example show in the table 2, for a binary classification
problem.
Table 2
Instance a1 a2 a3 Targe Class
1 T T 1 "+"
2 T T 6 "+"
3 T F 5 "-"
4 F F 4 "+"
5 F T 7 "-"
6 F T 3 "-"
7 F F 8 "-"
8 T F 7 "+"
9 F T 5 "-"
(a) What is the entropy of this collection of training examples with respect to the
positive class?
[4, 5]
We have:
Then:
NOTE: “You can consult all the formulas of classification in Package of formula”
(b) What are the information Gain of a1 and a2 relative to this training examples
Using The Brute force Method, First calculate the entropy by Brute force
method for find the value (v) that split the data with lowest entropy
Attribute 1 6 5 4 7 3 8 7 5
Sorted Vlues 1 3 4 5 5 6 7 7 8
Split Position 0,5 1,5 3,5 4,5 5,5 6,5 7,5 8,5
<= > <= > <= > <= > <= > <= > <= > <= >
Sorted Values "+" 0 4 1 3 1 3 2 2 2 2 2 2 3 1 4 0
Split Positions "-" 0 5 0 5 1 4 1 4 3 2 4 1 5 0 5 0
Entropy 0,991 0,844 0,766 0,918 0,9838 0,918 0,844 0,991
a3
<= 3,5 > 3,5
"+" "+"
"-" "+"
[1, 1] "+"
"-"
"-"
"-"
"-"
[3, 4]
Information Game
a1 0,23
a2 0,341
a3 0,225
The highest information game is 0,341 of the attribute a2, then is the best
split in root of the tree
(e) What is the best split (between a1 and a2) according to the classification error
rate?
The a1 attribute has the lowest impurity that is the best split.
(f) What is the best split (between a1 and a2) according to the Gini Index?
The a1 attribute has the lowest impurity that is the best split.
3. Consider the decision tree show in the figure.
Figure
a) Compute the generalization error rate of the tree using the optimistic
approach.
Instance Classification
1 v
2 v
3 x
4 v
5 v
6 v
7 v
8 x
9 v
10 v
b) Compute the generalization error rate of the tree using the pessimistic
approach. (Factor = 0, 5 for all leaves).