Chap9 2

9.2.
Tree-Based Methods
Principle:
1. Partition the feature space, X, into a set of rectangles,
homogenous regions: Rm
Recursive binary partition
2. Fit a simple model for Y (e.g. constant) in each
rectangle
CART: Classification and Regression Trees

• Y continuous → regression model : Regression Trees
• Y categorical → classification model : Classification Trees
Ex: in regression case: Y continuous and X1 X2
1 Binary partition: X1 = t1 → 2 Regions : Ra  t1, Rb>t1

2. Model Y in Ra and in Rb
3. Recursive partition: split again : Ra with X2 = t2 , Rb with X1 = t3 …..
• Advantage: easy to interpret
• Disadvantage: instability
– The error made at the upper level will be propagated to the lower
level
• How to grow a tree?

Choice of a splitting variable and a split point
→ min impurity criteria
• How large should we grow the tree?
Need of a stopping rule based on impurity criteria
9.2.2. Regression Trees
• (xi, yi) for i = 1,2,...N with xi  ¡ p
• Partition the space into M regions: R1, R2, …, RM.

M
f ( x) = ∑ cm I ( x ∈ Rm )
m =1
∧
where, c m = average( y i | xi ∈ Rm )
How to grow the regression tree:
• The best Xj and value s for partition:
to minimize the sum of squared error:
N
∑ i
( y
i =1
− f ( xi )) 2
• Finding the global minimum is computationally infeasible

• Greedy algorithm:
at each level choose variable j and value s as:
arg min[min
j ,s c1
∑ i 1 + min
( y − c ) 2
xi ∈R1 ( j , s )
c2
∑ i 2 ]
( y − c ) 2
xi ∈R2 ( j , s )
How large should we grow the tree ?
• Trade-off between accuracy and generalization

– Very large tree: overfit
– Small tree: might not capture the structure
• Strategies:
C (T )
– 1: split only when the decrease of error > threshold

(short-sighted)
– 2: Cost-complexity pruning (preferred):
- Grow large tree T0 (Nm= 5)
- Pruning = collapsing some internal nodes
→ find Tα  T0 minimize C (T )
Minimize Cost complexity |T| = number of terminal
node in T
α 0 : tuning parameter
|T |
Cα (T ) = ∑ N mQm (T ) + α | T |
m =1
Penalty on the
Cost: sum of complexity/size of
squared errors the tree
– For a α , a unique Tα
– To find Tα: weakest link pruning
• Each time collapse an internal node which add smallest error
→ a sequence of subtrees containing Tα
• Estimation of α: minimize the cross-validated sum of squared (p214)
• Choose from this tree sequence: T ̂
9.2.3. Classification Trees
• Y = { 1, 2 …, k, …,K}
• Classify the observations in node m to the major class in the
node: ∧
k (m) = arg max k pmk
With Pmk:= proportion of observation of class k in node m
• Splitting: min Qm(T)

|T |
Pruning : min C  T    N m Qm  T    T
m 1
In regression, Qm(T) = squared error node

→ not suitable in classification
→ need others Qm(T)
Define Qm(T) for a node m:
∧
If we class in k (m) = arg max k pmk
∧
– Misclassification error: 1 − pmk
 pˆ mk pˆ mk '   k 1 pˆ mk  1  pˆ mk 
K
– Gini index:
k k '
K ∧ ∧
– Cross-entropy:
∑p
k =1
mk log pmk
• Ex: 2 classes of Y, p = the proportion of second class
• Cross-entropy and Gini are more sensitive

→ ex : lower for pure node
→To grow the tree: use CE or Gini
• To prune the tree: use Misclassification rate
(or any other method)
9.2.4.Discussions on Tree-based Methods
• Categorical Predictors, X:
– Problem:
Consider splits of sub tree t into tL and tR based on a
unordered categorical predictor x which has q
possible values: 2(q-1) possibilities !
– Solution:
– Order the predictor classes by increasing mean of the
outcome Y.
– Treat the categorical predictor as if it were ordered
→ optimal split, in terms squared error
or Gini index, among all 2(q-1) possible splits
• Classification: The Loss Matrix
– Consequences of misclassification depends on class
– Define loss function L → K x K Loss Matrix with Lkk’
∧ ∧
– Modify the Gini index as
∑L
k ≠k '
kk ' p mk p mk '
→ In 2-class case no effect

→ weigth observations in class k by Lkk’
But alter prior probability on the classes

→ In a terminal node m , classify it to class k as:
∧
k (m) = arg min k ∑ Llk pml
l
• Missing Predictor Values
– If we have enough training data: discard observations
with mission value
– Fill in (impute) the missing value.
E.g. the mean of known values
– categorical predictor: Create a category called
“missing”
– Surrogate variables
• Choose primary predictor and split point
• A list of surrogates and split points
• The first surrogate predictor best mimics the split by the
primary predictor, the second does second best, …
• When sending observations down the tree, use primary first.
If the value of primary is missing, use the first surrogate. If the
first surrogate is missing, use the second. …
• Why binary Splits?
problem with multi-way split : it fragments the data too quickly,
leaving insufficient data at the next level down.
• Linear Combination Splits

– Split the node based not on X j  s but on∑ a j X j ≤ s
with optimization aj and s
– Improve the predictive power
– Hurt interpretability
• Instability of Trees
• Other trees:
c5.0: after growing tree, dropping condition
without changing the subset

Chap9 2

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Chap9 2

Încărcat de

Drepturi de autor:

Formate disponibile

9.2.

CART: Classification and Regression Trees

1 Binary partition: X1 = t1 → 2 Regions : Ra  t1, Rb>t1

• How to grow a tree?

• Partition the space into M regions: R1, R2, …, RM.

• Finding the global minimum is computationally infeasible

• Trade-off between accuracy and generalization

– 1: split only when the decrease of error > threshold

• Splitting: min Qm(T)

In regression, Qm(T) = squared error node

• Cross-entropy and Gini are more sensitive

→ In 2-class case no effect

But alter prior probability on the classes

• Linear Combination Splits

S-ar putea să vă placă și