Sunteți pe pagina 1din 12

-----------------------------------------------------------------------------------------------------------------------------------

My Own Studies – Autodidact (*)


Data Mining: Business Intelligence and Practical Machine Learning Tools and Techniques
-----------------------------------------------------------------------------------------------------------------------------------
- Homework 1 -- Group of exercises: Classification “Decision Tree”
- Miguel Caballero Sierra
------------------------------------------------------------------------------------------------------------------------

1. Consider the training example show in the table 1, for a multiway classification
problem.

Table 1

a) Compute de Gini for the overall collection of training examples

Class Count P(i | t)


C1 10 0,5
C2 10 0,5

NOTE: “You can consult all the formulas of classification in Package of formula”
b) Compute de Gini Index for costumer ID attribute.

.
.
.

c) Compute de Gini for Gender attribute.


d) Compute de Gini Index for Cart Type attribute.

Cart Type
Luxury
Family Sports
co:1 co:8 co:1
c1:3 c1:0 c1:7

e) Compute de Gini Index for Shirt Size attribute.

Shirt Size

co:3 co:3 co:2 co:2


c1:2 c1:4 c1:2 c1:2

The names of the branches are Small, Medium, Large, and Extra-large, Respectively
form left to right.
f) Which attribute is the best?
The best attribute is the one with the lowest impurity, therefore is costumer ID

g) Why the Costumer ID should not use as the attribute test condition.
Because it produce the highest generalization error in the final model, and
excessive overfitting.

2. Consider the training example show in the table 2, for a binary classification
problem.

Table 2
Instance a1 a2 a3 Targe Class
1 T T 1 "+"
2 T T 6 "+"
3 T F 5 "-"
4 F F 4 "+"
5 F T 7 "-"
6 F T 3 "-"
7 F F 8 "-"
8 T F 7 "+"
9 F T 5 "-"

(a) What is the entropy of this collection of training examples with respect to the
positive class?

[4, 5]
We have:

P (i|t) = Fraction of instances that belong to the class i


C: Number of classes at node t

Then:

NOTE: “You can consult all the formulas of classification in Package of formula”

(b) What are the information Gain of a1 and a2 relative to this training examples

First entropy and information gain about attribute a1


Second entropy and information game about a2
(c) For a3 which is a continuous attribute, compute the information Gain for
every possible split.

Using The Brute force Method, First calculate the entropy by Brute force
method for find the value (v) that split the data with lowest entropy

Remember that Method Brute-force is:

 Sorted Values of the attribute


 Split Between two consecutive values, (v), (left part of the split <=;
Right part of the split>)
 Calculate the impurity measure all the values of the split in this case
Entropy
 Select The lowest Impurity of the measure

Attribute 1 6 5 4 7 3 8 7 5
Sorted Vlues 1 3 4 5 5 6 7 7 8
Split Position 0,5 1,5 3,5 4,5 5,5 6,5 7,5 8,5
<= > <= > <= > <= > <= > <= > <= > <= >
Sorted Values "+" 0 4 1 3 1 3 2 2 2 2 2 2 3 1 4 0
Split Positions "-" 0 5 0 5 1 4 1 4 3 2 4 1 5 0 5 0
Entropy 0,991 0,844 0,766 0,918 0,9838 0,918 0,844 0,991

Then the split value is 0,766, the split looks like

a3
<= 3,5 > 3,5

"+" "+"
"-" "+"
[1, 1] "+"
"-"
"-"
"-"
"-"
[3, 4]

Information Gain Calculate:


(d) What is the best split? (Among a1, a2, and a3 attributes) according to the
information gain.

Information Game
a1 0,23
a2 0,341
a3 0,225

The highest information game is 0,341 of the attribute a2, then is the best
split in root of the tree

(e) What is the best split (between a1 and a2) according to the classification error
rate?

Calculate of the error rate for the a1 attribute:


Calculate of the error rate for the a2 attribute:

The a1 attribute has the lowest impurity that is the best split.

(f) What is the best split (between a1 and a2) according to the Gini Index?
The a1 attribute has the lowest impurity that is the best split.
3. Consider the decision tree show in the figure.

Figure

a) Compute the generalization error rate of the tree using the optimistic
approach.

Optimistic approach or Resubstitution Estimate is when we assume the


training error is extrapolated to the generalization error:

Trainig error = “Generalization Error”

Instance Classification
1 v
2 v
3 x
4 v
5 v
6 v
7 v
8 x
9 v
10 v
b) Compute the generalization error rate of the tree using the pessimistic
approach. (Factor = 0, 5 for all leaves).

REFERENCE [Introduction To data mining: Ping Pang Tang]

S-ar putea să vă placă și