Sunteți pe pagina 1din 50

Bayesian Learning in Probabilistic

Decision Trees
Michael I. Jordan
MIT
Collaborators
Robert Jacobs (Rochester)
Lei Xu (Hong Kong)
Georey Hinton (Toronto)
Steven Nowlan (Synaptics)
Marina Meila (MIT)
Lawrence Saul (MIT)

Outline
decision trees
probabilistic decision trees
EM algorithm and extensions
model selection, Bayesian computations
empirical results
{ system identication
{ classication
theoretical results
{ training set error
{ test set error

Some problems with multi-layered neural networks


the learning algorithms are slow
hard to understand the network
hard to build in prior knowledge
poor performance on non-stationary data
not natural for some functions

Supervised learning (aka regression, classication)


We assume that the learner is provided with a
training set:
X

= f(x t y t )gT
( )

( )

where x is an input vector and y is an output


vector.
We will gauge performance on a test set:
Xs

= f(x t y t )gTs
( )

( )

Decision trees
x3 < 1.4
y

x1 < 0.5

x 7 < -2.1

drop the data set down the tree


at each node, try to nd a split of the input
space (a half-plane) that yields the largest
gain in \purity" on left and right
build a large tree and prune backward to create a nested sequence of trees
pick the best tree from the sequence using
cross-validation

Regression trees
x3 < 1.4
y

x1 < 0.5

x 7 < -2.1

y= 1Tx

y= 2Tx

y= 3Tx

y= 4Tx

splitting is based on RSS

Some advantages:
often much faster than neural networks
often more interpretable
allow operating points to be utilized
Some disadvantages:
non-smooth regression surface
coordinate dependent
batch methods

Probabilistic Decision Trees


(Hierarchical mixtures of experts|HME)
(Jordan & Jacobs, 1994)

Why probabilities?

smoother regression surface


error bars from likelihood/Bayesian theory
(e.g., SEM algorithm)
convergence results from likelihood/Bayesian
theory
can handle categorical variables and missing
data in principled ways
better performance (e.g., leverage issue)

Probabilistic Decision Trees


drop inputs down the tree and use probabilistic models for decisions
at leaves of trees use probabilistic models to
generate outputs from inputs
use a Bayes' rule recursion to compute posterior credit for nonterminals in the tree
The basic idea is to convert the decision tree into
a mixture model

1
1

i
i

i1
i1

ij
ij

ij1

ijk

in
in

ijn

ij1

ijk

ijn

Model the decisions in the decision tree using


categorical probability models
let !i, !ij , !ijk : : : represent multinomial decision variables at the nonterminals

these variables will be treated as \missing"


data (cf. states of an HMM)
each path down the tree denes a component
of a mixture

1
1

i
i

i1
i1

ij
ij

ij1

ijk

in
in

ijn

ij1

ijk

ijn

Decision models at the nonterminals:


P (!ijx )
P (!ij jx !i  i)
P (!ijk jx !i  ij )
Output models at the leaves:

P (yjx !i !ij !ijk : : : ijk )

The total probability of an output y given an


input x is given by the sum across all paths from
the root to the leaves:

P (yjx ) = Xi P (!ijx ) Xj P (!ij jx !i  i)


P
(
!
ijk jx !i  ij )   
k

P (yjx !i !ij !ijk : : : ijk )


This is a (conditional) mixture model.

Moments of this mixture distribution are readily computed by tree traversal processes.
Dene


i
ij





E (yjx)
E (yjx !i)
E (yjx !i !ij )


ijk

E (yjx !i !ij !ijk

gi

and dene

gj i
gk ij
j

  )

P (!ijx )
P (!ij jx !i  i)
P (!ijk jx !i  ij )


(omitting the parameters for simplicity)

Then,

 = Xi gii
i = Xj gj iij
j

ij = Xk gk ij ijk
j

T x)
ijk = f (ijk

g1

gi

g1|i

gj|i

i1

gn|i

ij

i1

in
in

ij
g1|ij

ij1

ij1

gn

gk|ij
ijk
ijk

gn|ij
ijn
ijn

Component Models
Decision models

P (!ijx ) is a classication model


any parametric classication model is appropriate|
we use a multinomial logit model
this yields \soft" linear discriminants|soft
version of a CART/C4.5 tree
Leaf models

we use simple generalized linear models


Regression|linear regression
Binary classication|logistic regression
Multiway classication|multinomial logit model
(can also handle count estimates, failure estimates, etc.)

Multinomial logit model


the deterministic component:
i
e
gi = P e j
j

where

i = iT x

soft linear discriminants


{ the directions of the i determine the orientations of the discriminant surfaces (i.e.,
splits)
{ the magnitudes of the i determine the
sharpness of the splits
the probabilistic component:

P (yjx ) = gy gy
1

   gnyn

where yi 2 f0 1g and Pi yi = 1.

the log likelihood:

l( X ) = Xp Xi yi p log gi p
( )

( )

which is the cross-entropy function.


the gradient:

@l = X X(y p
@i p i i

( )

; gi

( )

)x p

( )

Computing the Hessian and substituting into the


Newton-Raphson formula yields a simple, quadraticallyconvergent iterative algorithm known as IRLS
(Iteratively-Reweighted Least Squares).

The Log Likelihood


E = Xp logXi gi p Xj gjpi Xk gkpij    Pijk (y p jx p )]
( )

( )

( )

( )

( )

Problem: The log is outside of the sums.


How can we optimize such a risk function efciently?
Solution: EM

The EM (Expectation-Maximization) Algorithm


(Baum, et al., 1971 Dempster, Laird, & Rubin, 1977)
Special cases:

mixture likelihood clustering (soft K-means)


many missing data algorithms
Baum-Welch algorithm for HMM's
Applications to supervised learning (regression,
classication)?

EM|Tutorial
Suppose that the problem of maximizing a
likelihood would be simplied if the values
of some additional variables|called \missing
variables"|were known
These values are not known, but given the
current values of the parameters, they can
be estimated (the E step).
Treat the estimated values as provisionally
correct and maximize the likelihood in the
usual way (the M step).
We now have better parameter values, so the
E step can be repeated. Iterate.

EM|Tutorial (cont.)
\missing" data:
\complete" data:

Z
Y

= fX

Zg

\complete" likelihood: lc( Y )

The complete likelihood is a random variable, so


average out the randomness:

E step:
Q( t ) = E lc( Y )jX t ]
( )

( )

This yields a xed function Q, which can be optimized:


M step:
t

( +1)

= arg max Q( t ):

( )

Applying EM to the HME architecture


The missing data are the unknown values of
the decisions in the decision tree.
Dene indicator variables zi, zj i, zk ij : : :
j

Complete likelihood:

lc( Y ) = Xp Xi zi p Xj zjpi    loggi p gjpi    Pijk (y p jx p )]


( )

( )

( )

( )

( )

Incomplete likelihood:

l( X ) = Xp logXi gi p Xj gjpi    Pijk (y p jx p )]


( )

( )
j

( )

( )

( )

We need to compute the expected values of the


missing indicator variables.
Note that, e.g.,

E (zi p jx p y p ) = P (!i p jx p y p )
( )

( )

( )

( )

( )

( )

Example

one-level tree
at each leaf, linear regression with Gaussian
errors
For the ith leaf and the tth data point:
t i t
y
g
e
t
i
hi = P t y t  t
j
j gj e

( )

( )

( )

1k
2

1k
2

( );

( );

() 2
k
() 2
k

where it = Ti x t .


( )

( )

This posterior is a normalized distance measure that reects the relative magnitudes of the
residuals y t ; it .
( )

( )

Posterior probabilities
hi
hj i
hk ij




P (!ijx y)
P (!ij jx y !i)
P (!ijk jx y !i !ij )


(cf. prior probabilities)

gi

gj i
gk ij
j

P (!ijx)
P (!ij jx !i)
P (!ijk jx !i !ij )


Bayes' rule yields:


P
P
g
i j gj i k gk ij Pijk (yjx)
hi = P g P P g P g P (yjx)
i i j j j i k k ij ijk
j

P
g
hj i = P jgi kP gkg ij PPijk (y(yjxjx) )
j j i k k ij ijk
j

hk ij = Pgkgij PPijk (y(yjxjx) )


k k ij ijk
j



Bayes' rule yields:


P
P
g
i j gj i k gk ij Pijk (yjx)
hi = P g P P g P g P (yjx)
i i j j j i k k ij ijk
j

P
g
hj i = P jgi kP gkg ij PPijk (y(yjxjx) )
j j i k k ij ijk
j

hk ij = Pgkgij PPijk (y(yjxjx) )


k k ij ijk
j



Posterior propagation

h1

hi

hn

h1 | i

hj | i

hn | i

h1| ij

hk | ij

hn| ij

ij1

ijk

ijn

The E step
compute the posterior probabilities (\up-down"
algorithm)

The M step
The Q function decouples into a set of separate maximum likelihood problems
At the nonterminals, t multinomial logit models, with the posteriors hit , hjti , etc., serving
as the targets
At the leaves, obtain weighted likelihoods where
the weights are the product of the posteriors
from root to leaf
( )

( )
j

The M step (in more detail)


The maximization of Q( t ) decouples into
a set of weighted MLE problems:
( )

XX p
p
i t = arg max
h
log
g
i
i
i p i
( +1)

( )

( )

(a cross-entropy cost)
XX p X p
p
ijt = arg max
h
h
log
g
i
ji
ij p i
j ji
( +1)

( )

( )

( )

(a weighted cross-entropy cost)


XX p X p
p jx p )
ijt = arg max
h
h
log
P
(
y
ijk
i
ij p i
j ji
( +1)

( )

( )

( )

( )

(a general weighted log likelihood)


Each of these are weighted ML problems for
generalized linear models (GLIM's). They can
be solved eciently using iteratively-reweighted
least squares (IRLS).

HME Parameter Estimation


x

g1

gi

gn

1
g1 |i

i1

gk |i

gk | ij

hi

hn

h1 |i

hj | i

hn |i

h 1|ij

h k | ij

hn | ij

n
gn |i

in

ij
g1 | ij

h1

gn | ij

ij1

ijk

ijn

drop the data set down the tree


for each data point, compute the posterior
probabilities for every branch of the tree
at each nonterminal, use the posterior probabilities as (soft) classication targets
at each leaf, t a local model, where each
data point is weighted by the product of the
posterior probabilities from the root to that
leaf

Model selection
How do we choose the structure of the tree?

initialize with CART or C4.5 (cf. K-means)

{ can preserve local variable selection


ridge regression
cross-validation stopping within a xed deep
hierarchy (EM iterations \grow" the eective
degrees of freedom)

Bayesian issues
Dirichlet priors
Gibbs' sampling is straightforward
Gaussian approximation of posterior via SEM
calculation of Hessian
Mean-eld approximation of posterior

Regression: A System Identication Problem


Forward dynamics of a four-joint, three-dimensional
arm
Twelve input variables, four output variables
15,000 points in the training set
5,000 points in the test set
Four-level tree, with binary branches
Compare to backpropagation in an MLP, with
60 hidden units
Compare to CART, MARS

Batch algorithms

0.8
0.4
0.0

Relative error

1.2

Backpropagation
HME (Algorithm 2)

10

100
Epochs

1000

Summary|batch algorithms
Architecture
Relative Error # Epochs
linear
.31
NA
backprop
.09
5,500
HME (Algorithm 1)
.10
35
HME (Algorithm 2)
.12
39
CART
.17
NA
CART (linear)
.13
NA
MARS
.16
NA

An On-Line Variant of HME


Use techniques from recursive estimation theory
(Ljung & Soderstrom, 1986) to obtain the following on-line algorithm:
Expert networks:

Uijt = Uijt + hit hjti(y t


( +1)

( )

( )

( )

( )

( t)
(t)
; ij ) (t)T Rij

where Rij is updated as follows:

Rijt

( )

= Rijt
(

;1)

t
t x t TR t
R
x
ij
ij
;

hijt ] + x t T Rijt x t
(

;1)

;1

( )

;1

and
is a decay parameter.

( )

( )

( )

;1)

;1)

( )

Top-level gating networks:

vit = vit + Si t (ln hit ; i t )x t


( +1)

( )

Si = Si
( )

;1)

( )

( )

( )

( )

t
t x t TS t
S
x
i
i
;

:
t
t
T
t

+ x Si x
(

;1)

;1

( )

( )

( )
(

;1)

;1)

( )

Lower-level gating networks:

vijt = vijt + Sijt hit (ln hjti ; ijt )x t


( +1)

( )

( )

( )

( )

( )

( )

Sijt = Sijt
( )

;1)

t
t x t TS t
S
x
ij
ij
:
;

t
t
t
t
T

hi ] + x Sij x
(

;1)

;1

( )

;1

( )

( )

( )

;1)

;1)

( )

Classication
Task Baseline CART
Heart
.44
.22
Pima
.35
.26
Orbitals .48
.29

HME
.18
.22
.23

Bayes
.18
.21
.21

(Error rates are computed using 10-fold crossvalidation)

Convergence results
(Jordan & Xu, 1994)

Theorem 1 Assume that the training set X is

generated by the mixture model (\realizable" case)


Let us denote

P = diagPg k P
( )



PK P

H ( ) = @@ l@( )
T

1



P K]


where Pi are covariance matrices of the component models.

Then with probability one,


(1) Letting ;M ;m ( here M > m > 0) be the
minimum and maximum eigenvalues of the negative de nite matrix (P 12 )T H ( )(P 21 ), we have

l( ) ; l( k )  rk l( ) ; l( )]
( )

kP

1
2

( k ;
( )

v
u
u
k=
2u
t
kj j u

2 l( ) ; l( )]
m
(2)
< 1. We also have

where r = 1 ; (1 ; M ) mM2
0 < jrj < 1 when M < 2.
(2) For any initial point
when M < 2.
2

(1)

2 D , limk

k =
( )

!1

Test Set Error


(Saul & Jordan, 1995)

Hard split model

1
1
p
y(x) = N (w  x)#(v x) + N (w  x)#(;v x)
Consider a structurally identically teacher with
weight vectors w w v .
p

Order parameters
0
BB
BB
BB
@

1
CC
CC
CC
A

0
BB
BB
BB
@

1
CC
CC
CC
A

v v w v w v
Rv X X
1
v
w w w w w :
R= Y R C =
N v w w w w w
Y C R
1





1
2

Loss
(
1
(v w w  x) = 2 y (x) ; w  x] #(v x)+
)
y (x) ; w  x] #(;v x)
1

Empirical risk (training energy)


P
X

E = p (v w w  x p )
=1

( )

Test set error (under a Gibbs' distribution)


g (R)

2
;1(Rv ) 3 0 R1 + R2 1 2 cos;1(Rv ) 3 0 C1 + C2 1
cos
75 @
75 @
6
A ; 64
A
1 ; 41 ;


(X1

; Xq 2)(Y1 ; Y2)
2 1 ; Rv2

High temperature limit

 ! 0 (where  = 1=T in the Gibbs' distribution)


 ! 1 ( = P=N )
~ remains nite (~ =   a signal-to-noise
ratio)
Results

g   (cf. perceptron)
2
~

A continuous phase transition at


r

~ c =  1 +  =8
2

4:695

1
8
6
4
2
0

2.5

7.5

10

12.5

15

2.5

7.5

10

12.5

15

2.5

7.5

10

12.5

15

1
8
6
4
2
0

1
8
6
4
2
0

A Histogram Tree
Epoch 0

Epoch 9

Epoch 19

Epoch 29

A Deviance Tree

Hidden Markov Decision Trees


x

i1

ij

ij1

ijk

in

ijn

ij1

ijk

ijn

Each decision at a node is dependent on the


decision at the previous moment at that node
{ This yields a Markov model at each node
An EM algorithm can be derived, treating
the Markov states as hidden variables
{ It combines a forward-backward pass with
an up-down pass

Conclusions
A probabilistic approach to decision tree modeling
{ ridge function splits
{ smooth regression functions
{ any GLIM can be used as a leaf model
EM algorithm (and SEM)
Bayesian methods
{ Gibbs' sampling
{ mean-eld methods

S-ar putea să vă placă și