Sunteți pe pagina 1din 14

9.2.

Tree-Based Methods
Principle:
1. Partition the feature space, X, into a set of rectangles,
homogenous regions: Rm
Recursive binary partition
2. Fit a simple model for Y (e.g. constant) in each
rectangle

CART: Classification and Regression Trees


• Y continuous → regression model : Regression Trees
• Y categorical → classification model : Classification Trees
Ex: in regression case: Y continuous and X1 X2

1 Binary partition: X1 = t1 → 2 Regions : Ra  t1, Rb>t1


2. Model Y in Ra and in Rb
3. Recursive partition: split again : Ra with X2 = t2 , Rb with X1 = t3 …..
• Advantage: easy to interpret

• Disadvantage: instability
– The error made at the upper level will be propagated to the lower
level

• How to grow a tree?


Choice of a splitting variable and a split point
→ min impurity criteria
• How large should we grow the tree?
Need of a stopping rule based on impurity criteria
9.2.2. Regression Trees
• (xi, yi) for i = 1,2,...N with xi  ¡ p

• Partition the space into M regions: R1, R2, …, RM.


M
f ( x) = ∑ cm I ( x ∈ Rm )
m =1

where, c m = average( y i | xi ∈ Rm )
How to grow the regression tree:
• The best Xj and value s for partition:
to minimize the sum of squared error:
N

∑ i
( y
i =1
− f ( xi )) 2

• Finding the global minimum is computationally infeasible


• Greedy algorithm:
at each level choose variable j and value s as:
arg min[min
j ,s c1
∑ i 1 + min
( y − c ) 2

xi ∈R1 ( j , s )
c2
∑ i 2 ]
( y − c ) 2

xi ∈R2 ( j , s )
How large should we grow the tree ?

• Trade-off between accuracy and generalization


– Very large tree: overfit
– Small tree: might not capture the structure

• Strategies:
C (T )

– 1: split only when the decrease of error > threshold


(short-sighted)
– 2: Cost-complexity pruning (preferred):
- Grow large tree T0 (Nm= 5)
- Pruning = collapsing some internal nodes
→ find Tα  T0 minimize C (T )
Minimize Cost complexity |T| = number of terminal
node in T
α 0 : tuning parameter

|T |
Cα (T ) = ∑ N mQm (T ) + α | T |
m =1

Penalty on the
Cost: sum of complexity/size of
squared errors the tree

– For a α , a unique Tα
– To find Tα: weakest link pruning
• Each time collapse an internal node which add smallest error
→ a sequence of subtrees containing Tα
• Estimation of α: minimize the cross-validated sum of squared (p214)
• Choose from this tree sequence: T ̂
9.2.3. Classification Trees
• Y = { 1, 2 …, k, …,K}
• Classify the observations in node m to the major class in the
node: ∧
k (m) = arg max k pmk
With Pmk:= proportion of observation of class k in node m

• Splitting: min Qm(T)


|T |

Pruning : min C  T    N m Qm  T    T
m 1

In regression, Qm(T) = squared error node


→ not suitable in classification
→ need others Qm(T)
Define Qm(T) for a node m:

If we class in k (m) = arg max k pmk

– Misclassification error: 1 − pmk

 pˆ mk pˆ mk '   k 1 pˆ mk  1  pˆ mk 
K
– Gini index:
k k '

K ∧ ∧
– Cross-entropy:
∑p
k =1
mk log pmk
• Ex: 2 classes of Y, p = the proportion of second class

• Cross-entropy and Gini are more sensitive


→ ex : lower for pure node
→To grow the tree: use CE or Gini
• To prune the tree: use Misclassification rate
(or any other method)
9.2.4.Discussions on Tree-based Methods

• Categorical Predictors, X:
– Problem:
Consider splits of sub tree t into tL and tR based on a
unordered categorical predictor x which has q
possible values: 2(q-1) possibilities !

– Solution:
– Order the predictor classes by increasing mean of the
outcome Y.
– Treat the categorical predictor as if it were ordered
→ optimal split, in terms squared error
or Gini index, among all 2(q-1) possible splits
• Classification: The Loss Matrix
– Consequences of misclassification depends on class
– Define loss function L → K x K Loss Matrix with Lkk’
∧ ∧
– Modify the Gini index as
∑L
k ≠k '
kk ' p mk p mk '

→ In 2-class case no effect


→ weigth observations in class k by Lkk’

But alter prior probability on the classes


→ In a terminal node m , classify it to class k as:

k (m) = arg min k ∑ Llk pml
l
• Missing Predictor Values
– If we have enough training data: discard observations
with mission value
– Fill in (impute) the missing value.
E.g. the mean of known values
– categorical predictor: Create a category called
“missing”
– Surrogate variables
• Choose primary predictor and split point
• A list of surrogates and split points
• The first surrogate predictor best mimics the split by the
primary predictor, the second does second best, …
• When sending observations down the tree, use primary first.
If the value of primary is missing, use the first surrogate. If the
first surrogate is missing, use the second. …
• Why binary Splits?
problem with multi-way split : it fragments the data too quickly,
leaving insufficient data at the next level down.

• Linear Combination Splits


– Split the node based not on X j  s but on∑ a j X j ≤ s
with optimization aj and s
– Improve the predictive power
– Hurt interpretability

• Instability of Trees
• Other trees:
c5.0: after growing tree, dropping condition
without changing the subset

S-ar putea să vă placă și