Documente Academic
Documente Profesional
Documente Cultură
Decision Trees
Michael I. Jordan
MIT
Collaborators
Robert Jacobs (Rochester)
Lei Xu (Hong Kong)
Georey Hinton (Toronto)
Steven Nowlan (Synaptics)
Marina Meila (MIT)
Lawrence Saul (MIT)
Outline
decision trees
probabilistic decision trees
EM algorithm and extensions
model selection, Bayesian computations
empirical results
{ system identication
{ classication
theoretical results
{ training set error
{ test set error
= f(x t y t )gT
( )
( )
= f(x t y t )gTs
( )
( )
Decision trees
x3 < 1.4
y
x1 < 0.5
x 7 < -2.1
Regression trees
x3 < 1.4
y
x1 < 0.5
x 7 < -2.1
y= 1Tx
y= 2Tx
y= 3Tx
y= 4Tx
Some advantages:
often much faster than neural networks
often more interpretable
allow operating points to be utilized
Some disadvantages:
non-smooth regression surface
coordinate dependent
batch methods
Why probabilities?
1
1
i
i
i1
i1
ij
ij
ij1
ijk
in
in
ijn
ij1
ijk
ijn
1
1
i
i
i1
i1
ij
ij
ij1
ijk
in
in
ijn
ij1
ijk
ijn
Moments of this mixture distribution are readily computed by tree traversal processes.
Dene
i
ij
E (yjx)
E (yjx !i)
E (yjx !i !ij )
ijk
gi
and dene
gj i
gk ij
j
)
P (!ijx )
P (!ij jx !i i)
P (!ijk jx !i ij )
Then,
= Xi gii
i = Xj gj iij
j
ij = Xk gk ij ijk
j
T x)
ijk = f (ijk
g1
gi
g1|i
gj|i
i1
gn|i
ij
i1
in
in
ij
g1|ij
ij1
ij1
gn
gk|ij
ijk
ijk
gn|ij
ijn
ijn
Component Models
Decision models
where
i = iT x
P (yjx ) = gy gy
1
gnyn
where yi 2 f0 1g and Pi yi = 1.
l( X ) = Xp Xi yi p log gi p
( )
( )
@l = X X(y p
@i p i i
( )
; gi
( )
)x p
( )
( )
( )
( )
( )
EM|Tutorial
Suppose that the problem of maximizing a
likelihood would be simplied if the values
of some additional variables|called \missing
variables"|were known
These values are not known, but given the
current values of the parameters, they can
be estimated (the E step).
Treat the estimated values as provisionally
correct and maximize the likelihood in the
usual way (the M step).
We now have better parameter values, so the
E step can be repeated. Iterate.
EM|Tutorial (cont.)
\missing" data:
\complete" data:
Z
Y
= fX
Zg
E step:
Q(
t ) = E lc(
Y )jX
t ]
( )
( )
( +1)
= arg max Q(
t ):
( )
Complete likelihood:
( )
( )
( )
( )
Incomplete likelihood:
( )
j
( )
( )
( )
E (zi p jx p y p ) = P (!i p jx p y p )
( )
( )
( )
( )
( )
( )
Example
one-level tree
at each leaf, linear regression with Gaussian
errors
For the ith leaf and the tth data point:
t i t
y
g
e
t
i
hi = P t y t t
j
j gj e
( )
( )
( )
1k
2
1k
2
( );
( );
() 2
k
() 2
k
( )
This posterior is a normalized distance measure that reects the relative magnitudes of the
residuals y t ; it .
( )
( )
Posterior probabilities
hi
hj i
hk ij
P (!ijx y)
P (!ij jx y !i)
P (!ijk jx y !i !ij )
gi
gj i
gk ij
j
P (!ijx)
P (!ij jx !i)
P (!ijk jx !i !ij )
P
g
hj i = P jgi kP gkg ij PPijk (y(yjxjx) )
j j i k k ij ijk
j
P
g
hj i = P jgi kP gkg ij PPijk (y(yjxjx) )
j j i k k ij ijk
j
Posterior propagation
h1
hi
hn
h1 | i
hj | i
hn | i
h1| ij
hk | ij
hn| ij
ij1
ijk
ijn
The E step
compute the posterior probabilities (\up-down"
algorithm)
The M step
The Q function decouples into a set of separate maximum likelihood problems
At the nonterminals, t multinomial logit models, with the posteriors hit , hjti , etc., serving
as the targets
At the leaves, obtain weighted likelihoods where
the weights are the product of the posteriors
from root to leaf
( )
( )
j
XX p
p
i t = arg max
h
log
g
i
i
i p i
( +1)
( )
( )
(a cross-entropy cost)
XX p X p
p
ijt = arg max
h
h
log
g
i
ji
ij p i
j ji
( +1)
( )
( )
( )
( )
( )
( )
( )
g1
gi
gn
1
g1 |i
i1
gk |i
gk | ij
hi
hn
h1 |i
hj | i
hn |i
h 1|ij
h k | ij
hn | ij
n
gn |i
in
ij
g1 | ij
h1
gn | ij
ij1
ijk
ijn
Model selection
How do we choose the structure of the tree?
Bayesian issues
Dirichlet priors
Gibbs' sampling is straightforward
Gaussian approximation of posterior via SEM
calculation of Hessian
Mean-eld approximation of posterior
Batch algorithms
0.8
0.4
0.0
Relative error
1.2
Backpropagation
HME (Algorithm 2)
10
100
Epochs
1000
Summary|batch algorithms
Architecture
Relative Error # Epochs
linear
.31
NA
backprop
.09
5,500
HME (Algorithm 1)
.10
35
HME (Algorithm 2)
.12
39
CART
.17
NA
CART (linear)
.13
NA
MARS
.16
NA
( )
( )
( )
( )
( t)
(t)
; ij ) (t)T Rij
Rijt
( )
= Rijt
(
;1)
t
t x t TR t
R
x
ij
ij
;
hijt ] + x t T Rijt x t
(
;1)
;1
( )
;1
and
is a decay parameter.
( )
( )
( )
;1)
;1)
( )
( )
Si = Si
( )
;1)
( )
( )
( )
( )
t
t x t TS t
S
x
i
i
;
:
t
t
T
t
+ x Si x
(
;1)
;1
( )
( )
( )
(
;1)
;1)
( )
( )
( )
( )
( )
( )
( )
Sijt = Sijt
( )
;1)
t
t x t TS t
S
x
ij
ij
:
;
t
t
t
t
T
hi ] + x Sij x
(
;1)
;1
( )
;1
( )
( )
( )
;1)
;1)
( )
Classication
Task Baseline CART
Heart
.44
.22
Pima
.35
.26
Orbitals .48
.29
HME
.18
.22
.23
Bayes
.18
.21
.21
Convergence results
(Jordan & Xu, 1994)
P = diagPg k P
( )
PK P
H (
) = @@
l@(
)
T
1
P K]
l(
) ; l(
k ) rk l(
) ; l(
)]
( )
kP
1
2
(
k ;
( )
v
u
u
k=
2u
t
kj j u
2 l(
) ; l(
)]
m
(2)
< 1. We also have
where r = 1 ; (1 ; M ) mM2
0 < jrj < 1 when M < 2.
(2) For any initial point
when M < 2.
2
(1)
2 D , limk
k =
( )
!1
1
1
p
y(x) = N (w x)#(v x) + N (w x)#(;v x)
Consider a structurally identically teacher with
weight vectors w w v .
p
Order parameters
0
BB
BB
BB
@
1
CC
CC
CC
A
0
BB
BB
BB
@
1
CC
CC
CC
A
v v w v w v
Rv X X
1
v
w w w w w :
R= Y R C =
N v w w w w w
Y C R
1
1
2
Loss
(
1
(v w w x) = 2 y (x) ; w x] #(v x)+
)
y (x) ; w x] #(;v x)
1
E = p
(v w w x p )
=1
( )
2
;1(Rv ) 3 0 R1 + R2 1 2 cos;1(Rv ) 3 0 C1 + C2 1
cos
75 @
75 @
6
A ; 64
A
1 ; 41 ;
(X1
; Xq 2)(Y1 ; Y2)
2 1 ; Rv2
g (cf. perceptron)
2
~
~ c = 1 + =8
2
4:695
1
8
6
4
2
0
2.5
7.5
10
12.5
15
2.5
7.5
10
12.5
15
2.5
7.5
10
12.5
15
1
8
6
4
2
0
1
8
6
4
2
0
A Histogram Tree
Epoch 0
Epoch 9
Epoch 19
Epoch 29
A Deviance Tree
i1
ij
ij1
ijk
in
ijn
ij1
ijk
ijn
Conclusions
A probabilistic approach to decision tree modeling
{ ridge function splits
{ smooth regression functions
{ any GLIM can be used as a leaf model
EM algorithm (and SEM)
Bayesian methods
{ Gibbs' sampling
{ mean-eld methods