Bayesian Learning Decision Tree

Bayesian Learning in Probabilistic
Decision Trees
Michael I. Jordan
MIT
Collaborators
Robert Jacobs (Rochester)
Lei Xu (Hong Kong)
Georey Hinton (Toronto)
Steven Nowlan (Synaptics)
Marina Meila (MIT)
Lawrence Saul (MIT)
Outline
decision trees
probabilistic decision trees
EM algorithm and extensions
model selection, Bayesian computations
empirical results
{ system identication
{ classication
theoretical results
{ training set error
{ test set error
Some problems with multi-layered neural networks

the learning algorithms are slow
hard to understand the network
hard to build in prior knowledge
poor performance on non-stationary data
not natural for some functions
Supervised learning (aka regression, classication)

We assume that the learner is provided with a
training set:
X
= f(x t y t )gT
( )
( )
where x is an input vector and y is an output

vector.
We will gauge performance on a test set:
Xs
= f(x t y t )gTs
( )
( )
Decision trees
x3 < 1.4
y
x1 < 0.5
x 7 < -2.1
drop the data set down the tree

at each node, try to nd a split of the input
space (a half-plane) that yields the largest
gain in \purity" on left and right
build a large tree and prune backward to create a nested sequence of trees
pick the best tree from the sequence using
cross-validation
Regression trees
x3 < 1.4
y
x1 < 0.5
x 7 < -2.1
y= 1Tx
y= 2Tx
y= 3Tx
y= 4Tx
splitting is based on RSS
Some advantages:
often much faster than neural networks
often more interpretable
allow operating points to be utilized
Some disadvantages:
non-smooth regression surface
coordinate dependent
batch methods
Probabilistic Decision Trees

(Hierarchical mixtures of experts|HME)
(Jordan & Jacobs, 1994)
Why probabilities?
smoother regression surface

error bars from likelihood/Bayesian theory
(e.g., SEM algorithm)
convergence results from likelihood/Bayesian
theory
can handle categorical variables and missing
data in principled ways
better performance (e.g., leverage issue)
Probabilistic Decision Trees

drop inputs down the tree and use probabilistic models for decisions
at leaves of trees use probabilistic models to
generate outputs from inputs
use a Bayes' rule recursion to compute posterior credit for nonterminals in the tree
The basic idea is to convert the decision tree into
a mixture model
1
1
i
i
i1
i1
ij
ij
ij1
ijk
in
in
ijn
ij1
ijk
ijn
Model the decisions in the decision tree using

categorical probability models
let !i, !ij , !ijk : : : represent multinomial decision variables at the nonterminals
these variables will be treated as \missing"

data (cf. states of an HMM)
each path down the tree denes a component
of a mixture
1
1
i
i
i1
i1
ij
ij
ij1
ijk
in
in
ijn
ij1
ijk
ijn
Decision models at the nonterminals:

P (!ijx )
P (!ij jx !i i)
P (!ijk jx !i ij )
Output models at the leaves:
P (yjx !i !ij !ijk : : : ijk )
The total probability of an output y given an

input x is given by the sum across all paths from
the root to the leaves:
P (yjx ) = Xi P (!ijx ) Xj P (!ij jx !i i)

P
(
!
ijk jx !i ij )
k
P (yjx !i !ij !ijk : : : ijk )

This is a (conditional) mixture model.
Moments of this mixture distribution are readily computed by tree traversal processes.
Dene

i
ij

E (yjx)
E (yjx !i)
E (yjx !i !ij )
ijk
E (yjx !i !ij !ijk
gi
and dene
gj i
gk ij
j
)
P (!ijx )
P (!ij jx !i i)
P (!ijk jx !i ij )
(omitting the parameters for simplicity)
Then,
= Xi gii
i = Xj gj iij
j
ij = Xk gk ij ijk
j
T x)
ijk = f (ijk
g1
gi
g1|i
gj|i
i1
gn|i
ij
i1
in
in
ij
g1|ij
ij1
ij1
gn
gk|ij
ijk
ijk
gn|ij
ijn
ijn
Component Models
Decision models
P (!ijx ) is a classication model

any parametric classication model is appropriate|
we use a multinomial logit model
this yields \soft" linear discriminants|soft
version of a CART/C4.5 tree
Leaf models
we use simple generalized linear models

Regression|linear regression
Binary classication|logistic regression
Multiway classication|multinomial logit model
(can also handle count estimates, failure estimates, etc.)
Multinomial logit model

the deterministic component:
i
e
gi = P e j
j
where
i = iT x
soft linear discriminants

{ the directions of the i determine the orientations of the discriminant surfaces (i.e.,
splits)
{ the magnitudes of the i determine the
sharpness of the splits
the probabilistic component:
P (yjx ) = gy gy
1
gnyn
where yi 2 f0 1g and Pi yi = 1.
the log likelihood:
l( X ) = Xp Xi yi p log gi p
( )
( )
which is the cross-entropy function.

the gradient:
@l = X X(y p
@i p i i
( )
; gi
( )
)x p
( )
Computing the Hessian and substituting into the

Newton-Raphson formula yields a simple, quadraticallyconvergent iterative algorithm known as IRLS
(Iteratively-Reweighted Least Squares).
The Log Likelihood

E = Xp logXi gi p Xj gjpi Xk gkpij Pijk (y p jx p )]
( )
( )
( )
( )
( )
Problem: The log is outside of the sums.

How can we optimize such a risk function efciently?
Solution: EM
The EM (Expectation-Maximization) Algorithm

(Baum, et al., 1971 Dempster, Laird, & Rubin, 1977)
Special cases:
mixture likelihood clustering (soft K-means)

many missing data algorithms
Baum-Welch algorithm for HMM's
Applications to supervised learning (regression,
classication)?
EM|Tutorial
Suppose that the problem of maximizing a
likelihood would be simplied if the values
of some additional variables|called \missing
variables"|were known
These values are not known, but given the
current values of the parameters, they can
be estimated (the E step).
Treat the estimated values as provisionally
correct and maximize the likelihood in the
usual way (the M step).
We now have better parameter values, so the
E step can be repeated. Iterate.
EM|Tutorial (cont.)
\missing" data:
\complete" data:
Z
Y
= fX
Zg
\complete" likelihood: lc( Y )
The complete likelihood is a random variable, so

average out the randomness:
E step:
Q( t ) = E lc( Y )jX t ]
( )
( )
This yields a xed function Q, which can be optimized:

M step:
t
( +1)
= arg max Q( t ):

( )
Applying EM to the HME architecture

The missing data are the unknown values of
the decisions in the decision tree.
Dene indicator variables zi, zj i, zk ij : : :
j
Complete likelihood:
lc( Y ) = Xp Xi zi p Xj zjpi loggi p gjpi Pijk (y p jx p )]

( )
( )
( )
( )
( )
Incomplete likelihood:
l( X ) = Xp logXi gi p Xj gjpi Pijk (y p jx p )]

( )
( )
j
( )
( )
( )
We need to compute the expected values of the

missing indicator variables.
Note that, e.g.,
E (zi p jx p y p ) = P (!i p jx p y p )
( )
( )
( )
( )
( )
( )
Example
one-level tree
at each leaf, linear regression with Gaussian
errors
For the ith leaf and the tth data point:
t i t
y
g
e
t
i
hi = P t y t t
j
j gj e
( )
( )
( )
1k
2
1k
2
( );
( );
() 2
k
() 2
k
where it = Ti x t .

( )
( )
This posterior is a normalized distance measure that reects the relative magnitudes of the
residuals y t ; it .
( )
( )
Posterior probabilities
hi
hj i
hk ij

P (!ijx y)
P (!ij jx y !i)
P (!ijk jx y !i !ij )
(cf. prior probabilities)
gi
gj i
gk ij
j
P (!ijx)
P (!ij jx !i)
P (!ijk jx !i !ij )
Bayes' rule yields:

P
P
g
i j gj i k gk ij Pijk (yjx)
hi = P g P P g P g P (yjx)
i i j j j i k k ij ijk
j
P
g
hj i = P jgi kP gkg ij PPijk (y(yjxjx) )
j j i k k ij ijk
j
hk ij = Pgkgij PPijk (y(yjxjx) )

k k ij ijk
j
Bayes' rule yields:

P
P
g
i j gj i k gk ij Pijk (yjx)
hi = P g P P g P g P (yjx)
i i j j j i k k ij ijk
j
P
g
hj i = P jgi kP gkg ij PPijk (y(yjxjx) )
j j i k k ij ijk
j
hk ij = Pgkgij PPijk (y(yjxjx) )

k k ij ijk
j
Posterior propagation
h1
hi
hn
h1 | i
hj | i
hn | i
h1| ij
hk | ij
hn| ij
ij1
ijk
ijn
The E step
compute the posterior probabilities (\up-down"
algorithm)
The M step
The Q function decouples into a set of separate maximum likelihood problems
At the nonterminals, t multinomial logit models, with the posteriors hit , hjti , etc., serving
as the targets
At the leaves, obtain weighted likelihoods where
the weights are the product of the posteriors
from root to leaf
( )
( )
j
The M step (in more detail)

The maximization of Q( t ) decouples into
a set of weighted MLE problems:
( )
XX p
p
i t = arg max
h
log
g
i
i
i p i
( +1)
( )
( )
(a cross-entropy cost)
XX p X p
p
ijt = arg max
h
h
log
g
i
ji
ij p i
j ji
( +1)
( )
( )
( )
(a weighted cross-entropy cost)

XX p X p
p jx p )
ijt = arg max
h
h
log
P
(
y
ijk
i
ij p i
j ji
( +1)
( )
( )
( )
( )
(a general weighted log likelihood)

Each of these are weighted ML problems for
generalized linear models (GLIM's). They can
be solved eciently using iteratively-reweighted
least squares (IRLS).
HME Parameter Estimation

x
g1
gi
gn
1
g1 |i
i1
gk |i
gk | ij
hi
hn
h1 |i
hj | i
hn |i
h 1|ij
h k | ij
hn | ij
n
gn |i
in
ij
g1 | ij
h1
gn | ij
ij1
ijk
ijn
drop the data set down the tree

for each data point, compute the posterior
probabilities for every branch of the tree
at each nonterminal, use the posterior probabilities as (soft) classication targets
at each leaf, t a local model, where each
data point is weighted by the product of the
posterior probabilities from the root to that
leaf
Model selection
How do we choose the structure of the tree?
initialize with CART or C4.5 (cf. K-means)
{ can preserve local variable selection

ridge regression
cross-validation stopping within a xed deep
hierarchy (EM iterations \grow" the eective
degrees of freedom)
Bayesian issues
Dirichlet priors
Gibbs' sampling is straightforward
Gaussian approximation of posterior via SEM
calculation of Hessian
Mean-eld approximation of posterior
Regression: A System Identication Problem

Forward dynamics of a four-joint, three-dimensional
arm
Twelve input variables, four output variables
15,000 points in the training set
5,000 points in the test set
Four-level tree, with binary branches
Compare to backpropagation in an MLP, with
60 hidden units
Compare to CART, MARS
Batch algorithms
0.8
0.4
0.0
Relative error
1.2
Backpropagation
HME (Algorithm 2)
10
100
Epochs
1000
Summary|batch algorithms
Architecture
Relative Error # Epochs
linear
.31
NA
backprop
.09
5,500
HME (Algorithm 1)
.10
35
HME (Algorithm 2)
.12
39
CART
.17
NA
CART (linear)
.13
NA
MARS
.16
NA
An On-Line Variant of HME

Use techniques from recursive estimation theory
(Ljung & Soderstrom, 1986) to obtain the following on-line algorithm:
Expert networks:
Uijt = Uijt + hit hjti(y t

( +1)
( )
( )
( )
( )
( t)
(t)
; ij ) (t)T Rij
where Rij is updated as follows:
Rijt
( )
= Rijt
(
;1)
t
t x t TR t
R
x
ij
ij
;
hijt ] + x t T Rijt x t
(
;1)
;1
( )
;1
and
is a decay parameter.
( )
( )
( )
;1)
;1)
( )
Top-level gating networks:
vit = vit + Si t (ln hit ; i t )x t

( +1)
( )
Si = Si
( )
;1)
( )
( )
( )
( )
t
t x t TS t
S
x
i
i
;
:
t
t
T
t
+ x Si x
(
;1)
;1
( )
( )
( )
(
;1)
;1)
( )
Lower-level gating networks:
vijt = vijt + Sijt hit (ln hjti ; ijt )x t

( +1)
( )
( )
( )
( )
( )
( )
Sijt = Sijt
( )
;1)
t
t x t TS t
S
x
ij
ij
:
;
t
t
t
t
T
hi ] + x Sij x
(
;1)
;1
( )
;1
( )
( )
( )
;1)
;1)
( )
Classication
Task Baseline CART
Heart
.44
.22
Pima
.35
.26
Orbitals .48
.29
HME
.18
.22
.23
Bayes
.18
.21
.21
(Error rates are computed using 10-fold crossvalidation)
Convergence results
(Jordan & Xu, 1994)
Theorem 1 Assume that the training set X is
generated by the mixture model (\realizable" case)

Let us denote
P = diagPg k P
( )
PK P
H ( ) = @@ l@( )
T
1
P K]

where Pi are covariance matrices of the component models.
Then with probability one,

(1) Letting ;M ;m ( here M > m > 0) be the
minimum and maximum eigenvalues of the negative denite matrix (P 12 )T H ( )(P 21 ), we have
l( ) ; l( k ) rk l( ) ; l( )]
( )
kP
1
2
( k ;
( )
v
u
u
k=
2u
t
kj j u
2 l( ) ; l( )]
m
(2)
< 1. We also have
where r = 1 ; (1 ; M ) mM2
0 < jrj < 1 when M < 2.
(2) For any initial point
when M < 2.
2
(1)
2 D , limk
k =
( )
!1
Test Set Error

(Saul & Jordan, 1995)
Hard split model
1
1
p
y(x) = N (w x)#(v x) + N (w x)#(;v x)
Consider a structurally identically teacher with
weight vectors w w v .
p
Order parameters
0
BB
BB
BB
@
1
CC
CC
CC
A
0
BB
BB
BB
@
1
CC
CC
CC
A
v v w v w v
Rv X X
1
v
w w w w w :
R= Y R C =
N v w w w w w
Y C R
1

1
2
Loss
(
1
(v w w x) = 2 y (x) ; w x] #(v x)+
)
y (x) ; w x] #(;v x)
1
Empirical risk (training energy)

P
X
E = p (v w w x p )
=1
( )
Test set error (under a Gibbs' distribution)

g (R)
2
;1(Rv ) 3 0 R1 + R2 1 2 cos;1(Rv ) 3 0 C1 + C2 1
cos
75 @
75 @
6
A ; 64
A
1 ; 41 ;
(X1
; Xq 2)(Y1 ; Y2)
2 1 ; Rv2
High temperature limit
! 0 (where = 1=T in the Gibbs' distribution)

! 1 ( = P=N )
~ remains nite (~ = a signal-to-noise
ratio)
Results
g (cf. perceptron)
2
~
A continuous phase transition at

r
~ c = 1 + =8
2
4:695
1
8
6
4
2
0
2.5
7.5
10
12.5
15
2.5
7.5
10
12.5
15
2.5
7.5
10
12.5
15
1
8
6
4
2
0
1
8
6
4
2
0
A Histogram Tree
Epoch 0
Epoch 9
Epoch 19
Epoch 29
A Deviance Tree
Hidden Markov Decision Trees

x
i1
ij
ij1
ijk
in
ijn
ij1
ijk
ijn
Each decision at a node is dependent on the

decision at the previous moment at that node
{ This yields a Markov model at each node
An EM algorithm can be derived, treating
the Markov states as hidden variables
{ It combines a forward-backward pass with
an up-down pass
Conclusions
A probabilistic approach to decision tree modeling
{ ridge function splits
{ smooth regression functions
{ any GLIM can be used as a leaf model
EM algorithm (and SEM)
Bayesian methods
{ Gibbs' sampling
{ mean-eld methods

Bayesian Learning Decision Tree

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bayesian Learning Decision Tree

Încărcat de

Drepturi de autor:

Formate disponibile

Bayesian Learning in Probabilistic

Some problems with multi-layered neural networks

Supervised learning (aka regression, classication)

where x is an input vector and y is an output

drop the data set down the tree

splitting is based on RSS

Probabilistic Decision Trees

smoother regression surface

Probabilistic Decision Trees

Model the decisions in the decision tree using

these variables will be treated as \missing"

Decision models at the nonterminals:

P (yjx !i !ij !ijk : : : ijk )

The total probability of an output y given an

P (yjx ) = Xi P (!ijx ) Xj P (!ij jx !i i)

P (yjx !i !ij !ijk : : : ijk )

E (yjx !i !ij !ijk

(omitting the parameters for simplicity)

P (!ijx ) is a classication model

we use simple generalized linear models

Multinomial logit model

soft linear discriminants

the log likelihood:

which is the cross-entropy function.

Computing the Hessian and substituting into the

The Log Likelihood

Problem: The log is outside of the sums.

The EM (Expectation-Maximization) Algorithm

mixture likelihood clustering (soft K-means)

\complete" likelihood: lc( Y )

The complete likelihood is a random variable, so

This yields a xed function Q, which can be optimized:

Applying EM to the HME architecture

lc( Y ) = Xp Xi zi p Xj zjpi loggi p gjpi Pijk (y p jx p )]

l( X ) = Xp logXi gi p Xj gjpi Pijk (y p jx p )]

We need to compute the expected values of the

where it = Ti x t .

(cf. prior probabilities)

Bayes' rule yields:

hk ij = Pgkgij PPijk (y(yjxjx) )

Bayes' rule yields:

hk ij = Pgkgij PPijk (y(yjxjx) )

The M step (in more detail)

(a weighted cross-entropy cost)

(a general weighted log likelihood)

HME Parameter Estimation

drop the data set down the tree

initialize with CART or C4.5 (cf. K-means)

{ can preserve local variable selection

Regression: A System Identication Problem

An On-Line Variant of HME

Uijt = Uijt + hit hjti(y t

where Rij is updated as follows:

Top-level gating networks:

vit = vit + Si t (ln hit ; i t )x t

Lower-level gating networks:

vijt = vijt + Sijt hit (ln hjti ; ijt )x t

(Error rates are computed using 10-fold crossvalidation)

Theorem 1 Assume that the training set X is

generated by the mixture model (\realizable" case)

where Pi are covariance matrices of the component models.

Then with probability one,

Test Set Error

Hard split model

P (yjx !i !ij !ijk : : : ijk )

P (yjx !i !ij !ijk : : : ijk )

P (!ijx ) is a classication model

This yields a xed function Q, which can be optimized:

lc( Y ) = Xp Xi zi p Xj zjpi loggi p gjpi Pijk (y p jx p )]

l( X ) = Xp logXi gi p Xj gjpi Pijk (y p jx p )]

where it = Ti x t .

vit = vit + Si t (ln hit ; i t )x t

vijt = vijt + Sijt hit (ln hjti ; ijt )x t

! 0 (where = 1=T in the Gibbs' distribution)