E1039207009 21119 1218595455594

DECISION TREE
“Machine Learning”
by Anuradha Srinivasaraghavan & Vincy Joseph
1
Copyright  2019 Wiley India Pvt. Ltd. All rights reserved.
 Introduction to Classification and Decision Tree
 Problem Solving Using Decision Trees
 Basic Decision Tree Learning Algorithm
 Iterative Dichotomiser 3 (ID3)
 Popularity of Decision Tree Classifiers
 Steps to Construct a Decision Tree
 Issues in Decision Trees
 Rule-Based Classification
by Anuradha Srinivasaraghavan & Vincy
Joseph
 A tree is build in which the leaf nodes
contain the output category.
 The class of the output is predicted based on
the rules generated from the tree structure.

 Learned trees can be represented as a set of
IF–THEN rules as well.
Joseph
 Decision trees can be used to solve problems
that have the following features:
◦ Instances or tuples are represented as attribute value
pairs, where the attributes take a small number of
disjoint possible values.
◦ The target function has discrete output values such as
yes or no.
◦ Decision trees require disjunctive descriptions which
implies that the output of the decision tree can be
represented using a rule-based classifier.
◦ Decision tree can be used when training data contains
errors and when it contains missing attribute values.
Joseph
 Two basic algorithms are the Iterative
Dichotomiser 3 (ID3) algorithm and the by C4.5
algorithm
 Attribute selection measures are also known as
splitting rules because they determine how the
tuples at a given node are to be split
 Information gain is the main selection measure
that is used
 The attribute having the best score for the
measure is chosen as the splitting attribute for
the given tuples.
Joseph
 Information gain
 Gain ratio
 Gini index
Joseph
 Entropy is an entity that controls the split in
data.
 It computes the homogeneity of examples.
 Entropy ranges between 0 and 1
 0 if all members of S belong to the same
class
 1 if there are equal number of positive and
negative examples
where p stands for probability of various instances under consideration.
Joseph
 Information gain is the measure of
effectiveness of an attribute in classifying the
training data.
 Gain ratio overcomes the bias which is

present in information gain
Joseph
Joseph
 In decision tree learning, the most popular
algorithm is the Iterative Dichotomiser 3 (ID3)
algorithm.
 Stopping condition of ID3
 Every element in the subset belongs to the
same class and then the node is turned into

a leaf node and labelled with the name of
that class.
Joseph
 Every element in the subset belongs to the
same class and then the node is turned into a
leaf node
 There can be no more attributes to be
selected, but the examples still do not belong

to the same class.
 There can be no more examples in the
subset, which happens when no example in

the parent set was found
Joseph
1. It maintains only a single current hypothesis as
it searches through the space of decision trees.
2. It does not have the ability to determine
alternative decision trees.
3. It does not perform backtracking in search.
Hence, there are chances of getting stuck in local
optima.
4. It is less sensitive to error because information
gain, which is a statistical property is used.
Joseph
 No domain knowledge required and used in
exploratory knowledge discovery
 Classification is based on probability alone
 Can handle multidimentional data
Joseph
 Compute the entropy for the given dataset.
 For every attribute/feature:
◦ Calculate entropy for all categorical values.

◦ Take the average information entropy of the current
attribute.
◦ Calculate gain for the current attribute.
 Pick the highest gain attribute.
 Repeat until the desired tree is complete.
Joseph
 Handling continuous attributes.
 Choosing an appropriate attribute selection
measure.
 Handling training data with missing attribute
values.
 Handling attributes with differing costs.
 Improving computational efficiency.
Joseph
 Underfitting occurs when a machine learning
algorithm cannot capture the underlying trend of
the data
 Overfitting occurs when a machine learning
algorithm captures the noise of the data
 Given a hypothesis space H, a hypothesis h ∈ H is
said to overfit the training data if there exists some
alternative hypothesis h ‘ ∈ H, such that h has
smaller error than h ‘ over the training examples,
but h ‘ has a smaller error than h over the entire
distribution of instances.
Joseph
 Approaches that stop the growth of tree
before it reaches the point where it perfectly
classifies the training data.
 Approaches that allow the tree to overfit the
data, and then post-prune it.
Joseph
 Use separate dataset for training and for
evaluation. This is done by the training and
validation set approach.
 Use the entire dataset for training but use a
statistical test (chi square test) for estimation.

 Use an explicit measure of the complexity for
encoding the training examples and the decision

tree, halting growth of the tree when this
encoding size is minimized. This is done using
the minimum description length principle.
Joseph
 Continuous values attributes: Uses a
threshold based Boolean attribute approach
 Missing attribute values: assign the value that
is most common among training examples at

node n.
 Handling Attributes with Differing Costs:
prefer low-cost attributes are preferred over

high-cost attributes
Joseph
 Using IF–THEN Rules for Classification:
 Define coverage and accuracy of a rule. It is
given by
 Properties of rule generated by rule based

classifier
◦ Mutually exclusive rules
◦ Exhaustive rules
Joseph
 The rule ordering scheme does the
prioritization of the rules
 Ordering based on class: the classes are
sorted in the order of decreasing

“importance”.
 With rule ordering, the triggering rule that
appears first in the list has the highest

priority and it fires the class prediction
Joseph
 To extract rules from a decision tree, one rule
is created for each path from the root node to
the leaf node.
 Each splitting criterion in each path is
logically ANDed to form the rule antecedent.

The leaf node holds the class prediction.
Joseph
 For a given rule antecedent, any condition
that does not improve the estimated accuracy
of the rule can be pruned
 Sequential covering algorithm can be used for
the same
Joseph

E1039207009 21119 1218595455594

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

E1039207009 21119 1218595455594

Încărcat de

Drepturi de autor:

Formate disponibile

DECISION TREE

the rules generated from the tree structure.

IF–THEN rules as well.

where p stands for probability of various instances under consideration.

 Gain ratio overcomes the bias which is

 Every element in the subset belongs to the

same class and then the node is turned into

selected, but the examples still do not belong

subset, which happens when no example in

◦ Calculate entropy for all categorical values.

data, and then post-prune it.

statistical test (chi square test) for estimation.

encoding the training examples and the decision

is most common among training examples at

prefer low-cost attributes are preferred over

 Properties of rule generated by rule based

sorted in the order of decreasing

appears first in the list has the highest

logically ANDed to form the rule antecedent.

S-ar putea să vă placă și