On The Complexity of Learning Decision Trees

On the Complexity of Learning Decision Trees
J. Kent Martin and D. S. Hirschberg

(jmartin@ics.uci.edu) ( dan@ics.uci.edu)
Department of Information and Computer Science
University of California, Irvine, CA, 92717
Abstract
Figure 1: TDIDT Tree-Building
Various factors affecting decision tree learning time BuildTree(A, V, data)
are explored. The factors which consistently affect if (AllSameClass(data)) then
accuracy are those which directly or indirectly ( as in MakeLeaf( ClassOf( data))
the handling of continuous attributes) allow a greater else
variety of potential trees to be explored. Other fac- A - IAI
tors, e.g., pruning and choice of heuristics, generally N <-- Idata I
have little effect on accuracy, but significantly affect if ((N > 0) and (A> 0)) then
learning time. We prove that the time complexity of Initialize(b, best)
induction and post-processing is exponential in tree for a= l ... A
height in the worst case and, under fairly general con- sub <-- Partition(data,a,Va)
ditions, in the average case. This puts a premium on <-- Heuristic(sub)
designs which produce shallower and more balanced if ( < best) then
trees. Simple pruning is linear in tree height, con- best <--
trasted to the exponential growth of more complex b <-- a
operations. The key factor influencing whether simple if ( not Significance(best)) then
pruning will suffice is that the split selection and prun- MakeLeaf(LargestClass( data))
ing heuristics should be the same and unbiased. The else
information gain and x 2 tests are biased towards un- sub <-- Partition(data,b,Vb)
balanced splits, and neither is admissible for pruning. v <-- |vb1|
Empirical results show that the hypergeometric func- for v = l ... V
tion can be used for both split selection and pruning, if (A-{b} = 0) then
and that the resulting trees are simpler, more quickly MakeLeaf(LargestClass(subv))
learned, and no less accurate than trees resulting from else
other heuristics and more complex post-processing. BuildTree(A-{b },V-V b, subv)
Introduction
the choices available in designing an algorithm and,
This paper studies the complexity of Top-Down Induc-
particularly, in those elements of the choices which
tion of Decision Trees - the TDIDT family of algo-
generalize over many input sets. At the highest level,
rithms typified by ID3 (Quinlan 1986) and C4.5 (Quin-
there are three such choices: (1) how the set of can-
lan 1993). The input for these algorithms is a set of
didates is chosen ( and handling continuous variables,
items, each described by a class label and its values for
look-ahead, etc.), (2) what heuristic function is used,
a set of attributes, and a set of candidate partitions of
and (3) whether to stop splitting or to post-process.
the data. These algorithms make a greedy, heuristic
choice of one of the candidates, and then recursively
split each subset until a subset consists only of one Analysis of TDIDT Tree-Building
class or until the candidates are exhausted. Early algo- Figure 1 summarizes the tree-building phase, where
rithms included other stopping criteria, stopping when the set of candidates has been defined off-line and is
the improvement achieved by the best candidate was summarized by the input parameters A ( a vector of
judged to be insignificant. Later algorithms dropped partition labels) and V ( a matrix of values for splitting
these stopping criteria and added post-processing pro- on each attribute). Some algorithms re-evaluate the V
cedures to prune a tree or to replace subtrees. matrix on every call to this procedure. Other major
We analyze TD IDT algorithms, not only in the usual differences between algorithms lie in the Heuristic func-
terms of best and worst case data, but also in terms of tion, in the way numeric attributes are handled by the
Partition function, and in whether stopping based on
a Significance function is performed. Figure 2: Pessimistic Pruning
PostProc( dtree, data)
Typically, the run times of the Heuristic and Par-
Asls, Pruned, Surgery, and Q
tition functions are linear ( O(N)), compiling a contin- are defined as in Eval( dtree, data)
gency matrix and computing a function such as infor- if (Asls::; min{Pruned,Surgery})
mation gain. Then, the run time TB of BuildTree is return Asls
TB(A,V,data) = 0(AN) + Lv TB(A-{b}, V-Vb, subv) else
if (Pruned ::; Surgery) then
which leads to TB = O(A N) for a complete tree of
2 dtree +-- MakeLeaf( data)
height A. (Here, bis the 'best' split, Vb its splitting return Pruned
criteria, Vi =IVb I its arity, and subv its v th subset.) else /* surgery performed here */
Some algorithms build only binary trees, and most dtree +-- Q
Surgery +-- PostProc(Q, data)
allow only binary splits on real-valued attributes. In
return Surgery
these cases, the effect is the same as that of increas- Eval( dtree, data)
ing the number of candidates. If we simply create an N +--ldatal
equivalent set of -1 binary attributes for each can- Q +-- nil
didate (where is the i th candidate's arity), the time if dtree is a leaf then
is 0( d 2 N), where d is the dimensionality of the data, P +- Predicted Class( dtree)
d = I:( - 1). For real-valued attributes, is O(N) E.-(N-IPI)
for each attribute, and these methods could cause the Pruned = Surgery = Asls +-- f(E, N)
behavior to be O(A 2 N 3 ). return Asls
Analysis of real-world data sets is more complicated else
because the splitting process for any path may termi- F +- Splitlnfo(Root(dtree))
Ldata, Rdata +-- Partition( data, F)
nate before all attributes have been utilized, either be-
L +---ILdatal /N
cause all items reaching the node have the same class R +---IRdatal /N
or because of some stopping criterion. Thus, the tree Asls +-- L x Eval(Ltree, Ldata)
height may be less than the number of candidates A, + R x Eval(Rtree, Rdata)
and the leaves may lie at different depths. Then, the P +- LargestClass( data)
time complexity is related to the average height ( h) E.-(N-IPI)
of the tree (weighted by the number of items reaching Pruned +-- f(E, N)
each leaf). if ( L > R) then Q +-- Ltree
Looking at Figure 1 in detail, and assuming that all else Q +-- Rtree
Surgery +-- Eval(Q, data)
candidates have the same arity V, the time complexity
return min( Asls, Pruned, Surgery)
can be modeled as
TB(A, V, N)=Ko+K1N +K3 V + Lv TB(A-l, V, mv)
+ A [K2+(0 +1N+2C+3 V +4CV)] where V=d/A is the average branching factor.
where C is the number of classes, K 0 , K 2 , & K 3 small For trees of modest height and relatively large sam-
overhead constants, K 1 the incremental cost of the Par- ples, the cumulative overhead (Ko) term is insignifi-
tition function, and Ei the coefficients of the Heuristic cant - typically, that is, h :::::;
0.3A and N :::::;
25A, and
function (0 is a small overhead term, and [ 1 N typ- Ko (Vh-l)/(V-l) ~ E1 NA 2 , so that TB= 0(A 2 N)
ically dominates the other terms). Assuming that all in most cases. Applications do exist, however, where
leaves lie at the same depth (A- l), this leads to the exponential growth of this term cannot be ignored.
For our longest run-time (about 1.5 hrs), A= 100 and
TB:::::;(Ko+K3 V) fi(A, V)
a binary tree with average depth h = 30 was built -
+ (Kdo +3 V +4CV) h(A, V) the factor of 230 for the Ko term is significant here. In
+ (1N +2C) A(A+l)/2 + K1AN the worst case h = A- l and TB= O(VA).
where fi(A, V) = (VA -1)/(V -1) and h(A, V)
( fi(A+l, V)-(A+l)) /(V-1). Analysis of Post-Processing
Empirical data from 16 different populations repre- Figure 2 summarizes a typical post-processing rou-
senting a wide range of sample sizes, number of at- tine, C4.5's (Quinlan 1993) pessimistic pruning, for bi-
tributes, arity of attributes, and a mixture of discrete nary trees. Examination of Figure 2 reveals that the
and continuous attributes are well-fit by the following dominant factors are the height of the tree, H, the
pro-rated model: data set size, N, the height and weight balance of the
tree, and whether a decision node is replaced by a child
TB:::::;[ (1N +2C)A(A+l)/2
rather than simply pruned or left unmodified.
+ K1AN + Kofi(V, h)] h/(A-l) We denote the time complexity of post-processing
+ (1N +2C)A +Ko+ K1N and evaluation as Tp(H, N) and TE(H, N), respec-
tively. In all cases, 0 :S H < N and the recursion sigmoid p( x) function, which is more plausible than the
ends with Tp(O, N) = TE(O, N) = 0(N). When tree linear form because of certain boundary constraints for
surgery is not actually performed, but merely evalu- both very large and very small x.
ated, Tp(H, N) = TE(H, N). The recurrence (see Equation 2) on which our
When the surgery is performed, then Tp(H, N) = 0 (N (l+z)H) result is based gives only a lower bound
TE (H, N) + Tp(q, N) where q is the height of the child on the expected behavior, obtained by omitting the ex-
covering the most items (the larger child). In the worst pected time to evaluate the shallower subtree at each
case q is H -l, and so internal node. The contribution of the omitted sub-
trees increases as their height or weight increases, and
Tp(H, N) ~ TE(H, N) + Tp(H -l, N) ~ L~oTE(i, N) we expect that the incremental run times would be
If m is the size of the larger child and r the height of more nearly correlated with a weighted average depth
the smaller child, then either q = H - l or r = H - l, and than with the maximum depth of the tree. This ex-
pectation is confirmed by simulation outcomes.
TE(H, N) = 0(N) + TE(q, m) + TE(r, N -m) + TE(q, N) Based on these results, we expect run times to be
(1) proportional to N (l+et*, where h* is an average
We prove that TE(H, N) and Tp(H, N) are 0(N) in height (not necessarily the weight average, h). A very
the best case and that their worst case complexity is good fit to empirical data spanning 5 decades of run
0(N 2H). Thus, both TE() and Tp() have tight bounds time is obtained using h* = vhh'; where h and h'
of S1(N) and O(N 2H). are, respectively, the weight-average heights before and
To infer typical behavior, we note that real world after post-processing.
data are usually noisy, and real attributes are seldom
perfect predictors of class. For these reasons, there is Discussion
usually some finite impurity rate I for each branch of
a split. For n instances from a population with rate I, Both tree-building and post-processing have compo-
the likelihood P that the branch will contain instances nents that are exponential in tree height, which is
from more than one class is given by P = l - ( 1- I) n , bound above by the candidate set's dimensionality.
and is an increasing function of the subset size n. For While the dominant goal is to maximize accuracy, we
such an impure branch, additional splits will be needed must remember that TDIDT inherently compromises
to separate the classes. Thus, there is a tendency for by using greedy, heuristic search and limiting the can-
larger subsets to have deeper subtrees. didates. A premium is placed on methods which min-
If HL and HR are respectively the left and right imize dimensionality without sacrificing accuracy. In
subtree heights, and n the left subset size, then particular, approaches which replace a V-ary split with
binary splits should be avoided, as this leads to deeper
TE(H, N) = 0(N) + TE(HL, n) + TE(HR, N -n) trees and to re-defining the candidates at every node,
TE(HL, N) if n 2':N/2 a considerable computational expense.
+ { TE(HR, N) if n < N/2 We have confirmed these observations experimen-
tally using 16 data sets from various application do-
Now, either HL = H-l or HR= H-l, and the likelihood mains. Ten of these data sets involved continuous at-
that H L = H - l increases as n increases. If we express tributes which were converted to nominal attributes
this increasing likelihood as Prob( H L = H - l) = p( x), in two ways: using arbitrary cut-points at approx-
where x = n/N, then the approximate expected value, imately the quartile values, and choosing cut-points
t(H, N), of TE(H, N) is at local minima of smoothed histograms (the 'natural'
cut-points). Of the 26 resulting data sets, 2 involved
t(H, N) 2': 0(N) + 0.5 t(H -l, N) (2) only binary attributes, and 24 had attributes of dif-
+ p(x) t(H-l, Nx) + (l-p(x)) t(H-l, N(l-x)) ferent arities. These 24 data sets were converted to all
binary splits by simply creating a new binary attribute
If we assume that p( x) = x, and that x is constant for each attribute-value pair. (See (Martin 1995) for
throughout the tree, we can solve Equation 2 by in-
details of these experiments).
duction on H, obtaining t(H,N) 2:: 0(N (l+z)H), Multi-way trees consistently have more leaves, are
where z = 2(x-0.5) 2 . shallower, and are learned more quickly than their bi-
Obtaining a solution is more complex when the nary counterparts. On the few occasions when there
weight balance x is not the same for every split, is a notable difference in accuracy, the binary trees
but the solution has a similar form, i.e., t(H, N) ~ appear to be more accurate. We show that remap-
0( N) Li Tij(1 + Zij) and we should expect a similar ping multi-valued attributes into a new set of binary
result, namely t(H, N) ~ 0(N (1 + e)H), where e is a attributes indirectly expands the candidate set and,
geometric mean of the various Zij terms and increases by heuristically choosing among a larger set of can-
with the variance of x. This expectation is borne out didates, it sometimes improves accuracy. However, a
by simulation results. t(H, N) is also exponential in H large penalty is paid in terms of increased run time and
for other forms for p( x), as shown by simulation of a tree complexity.
In the quartiles method, the most unbalanced split is prune/stop. We show that neither information gain
approximately 75/25. The 'natural' cut-points, by con- nor x 2 is suitable for this dual purpose.
trast, tend to produce very unbalanced splits. The dif- x 2 and gain approximate the logarithm of the hy-
ferences in accuracy may be large, and either method pergeometric function, the exact likelihood that the
may be the more accurate one. The quartiles trees observed degree of association between subset mem-
consistently have more leaves, but are shallower (more bership and class is coincidental. We compare the
efficient and more quickly learned). results of stopping using the hypergeometric to un-
Post-processing does not improve accuracy, and only stopped and post-processed trees. Stopping in this way
occasionally does it significantly reduce the number of is not detrimental to accuracy. The stopped trees are
leaves or the average height. It does, however, consis- markedly simpler than the unstopped trees, and are
tently increase the learning time by as much as a factor learned in about half the time (one-fourth the time for
of 2. In the few cases where post-processing does sig- post-processing).
nificantly modify the trees, the unpruned trees are very
unbalanced. Conclusions
The effects of using different heuristics are illustrated 1. Insofar as accuracy is concerned, the important de-
using three functions (information gain, orthogonality, sign decisions are those which expand the candi-
and the hyper geometric). There are no significant dif- date set. Other factors ( e.g., the heuristic and stop-
ferences in accuracy between heuristics, and the in- ping/pruning) generally have little impact on accu-
ferred trees all have about the same number of leaves. racy.
The tree height and learning time, however, do vary
2. For single-attribute, multi-way splits on A discrete
significantly and systematically. For the worst case
variables, the time to build a tree for N items is
data, learning times differ by a factor of 12. The hy-
O(A 2 N). If all V-ary splits are binarized, this be-
pergeometric is better justified on statistical grounds
comes 0( d 2 N) where d is the dimensionality. For
than the other heuristic functions, and it also gives
continuous attributes, the tree building time may be
better results empirically.
0(A 2 N 3 ). Thus, there is potentially a large payoff
Stopping and pruning are basically the same opera- for pre-processing to reduce dimensionality.
tion, differing in that the decision whether to stop is
based on only local information, whereas the decision 3. Both tree building and post-processing have com-
whether to prune is based on information from sub- ponents which increase exponentially in tree height.
sequent splits (look-ahead). The effect of surgery can This puts a great premium on design decisions which
be achieved by simple pruning or stopping performed tend to produce shallower trees.
on another tree in which the order of the splits is dif- 4. Tree surgery is equivalent to simply pruning an al-
ferent. This point is very important, since it is the ternative tree in which the root split of the current
tree surgery which drives the exponential growth of tree is made last, rather than first. The need for
run time vs. tree depth - a simple post-pruning algo- such surgery arises from using biased heuristics and
rithm that does not consider the surgical option would different criteria for selection than for pruning.
have 0(H N) time complexity.
5. x2 and information gain are biased and inadmiss-
The surgical option can be viewed as an expensive ex able, and should not be used. The hypergeometric
post attempt to correct for a bad choice made during is admissible, and it allows efficient pruning or stop-
tree building, which is necessary because the selection ping. It builds simpler and shallower trees which are
heuristic does not order the splits correctly from the no less accurate than those built using other heuris-
point of view of efficient pruning. Since the choice of tics. Learning time can be reduced as much as an
heuristic is largely a matter of complexity, not of ac- order of magnitude by using the hypergeometric.
curacy, we should prefer heuristics which tend to build
balanced, shallow trees in which the splits are ordered References
so as to allow efficient pruning.
Martin, J. K. 1995. An exact probability metric for
The preceding observations concerning the ordering
of splits are doubly important in the context of stop- decision tree splitting and stopping. Technical Report
95-16, University of California, Irvine, Irvine, CA.
ping, since stopping is simply a pruning decision made
with less information. In particular, if the split selec- Quinlan, J. R. 1986. Induction of decision trees. Ma-
tion heuristic ( e.g., information gain) and the stopping chine Learning 1:81-106.
criterion ( e.g., x 2 ) differ significantly as to the rela- Quinlan, J. R. 1993. C4-5: Programs for Machine
tive merit of a candidate ( and these two do), then our Learning. San Mateo, CA: Morgan Kaufmann.
stopping strategy ( e.g., choose a candidate based on
information gain and accept or reject it based on x 2 )
will almost certainly lead to poor results. The obvious
solution to this dilemma is to use the same evaluation
function for ranking splits as for deciding whether to

On The Complexity of Learning Decision Trees

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

On The Complexity of Learning Decision Trees

Încărcat de

Drepturi de autor:

Formate disponibile

On the Complexity of Learning Decision Trees

J. Kent Martin and D. S. Hirschberg

S-ar putea să vă placă și