Sunteți pe pagina 1din 34

Lecture 12: Hidden Markov Models Machine Learning

Andrew Rosenberg

March 12, 2010

1 / 34

Last Time

Clustering

2 / 34

Today

Hidden Markov Models

3 / 34

Dice Example

Imagine a game of dice. When the croupier rolls 4,5,6 you win. When the croupier rolls 1,2,3 you lose. Model the likelihood of winning. IID multinomials

4 / 34

The Moody Croupier

Now imagine that the croupier cheats. There are three dice.
One fair (fair) One good for the house (bad) One good for you (good)

5 / 34

The Moody Croupier

Model the Likelihood of winning. IID multinomials Latent variable


' $

qi

i {0..n 1}
&

xi

%
6 / 34

The Moody Croupier

Model the Likelihood of winning. IID multinomials Latent variable Allow a prior over die choices ' qi i {0..n 1}
&

xi

%
7 / 34

The Moody Croupier


Now what if the dealer is moody? The dealer doesnt like to change the die that often The dealer doesnt like to switch from the good die to the bad die. No longer iid! The die he uses at time t is dependent on the die used at t 1 q0 x0 q1 x1 q2 . . . qT 1 x2 xT 1

8 / 34

Sequential Modeling
q0 x0 q1 x1 q2 . . . qT 1 x2 xT 1

Temporal or sequence model. Markov Assumption future past|present p(qt |qt1 , qt2 , qt3 , ..., q0 ) = p(qt |qt1 ) Get the overall likelihood from the graphical model.
T 1 T 1

p(x) = p(q0 )
t=1

p(qt |qt1 )
t=0

p(xt |qt )

9 / 34

Sequential Modeling
future past|present p(qt |qt1 , qt2 , qt3 , ..., q0 ) = p(qt |qt1 ) Get the overall likelihood from the graphical model.
T 1 T 1

p(x) = p(q0 )
t=1

p(qt |qt1 )
t=0

p(xt |qt )

p(qt |qt1 )?

10 / 34

HMMs as state machines

HMMs have two variables: state q and emission y In general the state is an unobserved latent variable. Can consider HMMs as stochastic automata weighted nite state machines.

11 / 34

HMM state machine


.5

bad .3 .3 .2 .2 .3 fair .5

.2

good

.5

12 / 34

HMMs as state machines

HMMs have two variables: state q and emission x In general the state is an unobserved latent variable. No observation of q directly. Only a related emission distribution. doubly-stochastic automaton.

13 / 34

HMM Applications

Speech Recognition (Rabiner): phonemes from audio cepstral vectors Language (Jelinek): part of speech tag from words Biology (Baldi): splice site from gene sequence Gesture (Starner): word from hand coordinates Emotion (Picard): emotion from EEG

14 / 34

Types of Variables

Continuous States
E.g. Kalman lters p(qt |qt 1) = N(qt |Aqt1 , Q)

Discrete States
E.g., Finite state machine j i M1 M1 qt1 qt p(qt |qt 1) = i =0 j=0 [ij ]

Continuous Observations
E.g. time series data p(xt |qt ) = N(xt |qt , qt )

Discrete Observations
E.g. strings p(xt |qt ) =
M1 i =0
i N1 qt xtj j=0 [ij ]

15 / 34

HMM Parameters
M states and N-class observations Complete likelihood from Graphical Model
p(x) = p(q0 )
T 1 Y t=1

p(qt |qt1 )

T 1 Y t=0

p(xt |qt )

Marginalize over unobserved hidden states


p(x) = X
q0

qT 1

p(q, x)

CPTs are reused: = {, , }


M1 M1 Y Y i =0 j=0 M1 N1 Y Y i =0 j=0 M1 Y i =0 M1 X j=0

p(qt |qt1 )

[ij ]qt1 xt
i j

ij

p(xt |qt )

[ij ]qt xt
i

N1 X j=0

ij

p(q0 )

[i ]q0

M1 X j=0

1
16 / 34

HMM Operations

Evaluate
Evaluate the likelihood of a model given data.

Decode
Identify the most likely sequence of states

Max Likelihood
Estimate the parameters.

17 / 34

JTA on HMM

Junction Tree

18 / 34

JTA on HMMs
Initialization

(q0 , x0 ) (qt , qt+1 ) (qt , xt ) Z (qt ) (qt )

= = = = = =

p(q0 )p(x0 , q0 ) p(qt+1 |qt ) = Aqt ,qt+1 p(xt |qt ) 1 1 1


19 / 34

JTA on HMMs
Update

Collect up from leaves dont change zeta separators. (qt ) = (qt1 , qt ) = (qt , xt ) =
xt (qt ) xt

p(xt |qt ) = 1

(qt )

(qt1 , qt ) = (qt1 , qt )
20 / 34

JTA on HMMs
Update

Collect left-right over phi state sequence becomes marginals. (q0 ) =


x0

(q0 , x0 ) = p(q0 ) (qt , qt1 ) = p(qt )


qt1 (q0 )

(qt ) = (q0 , q1 ) =

(q0 )

(q0 , q1 ) = p(q0 , q1 )
21 / 34

JTA on HMMs
Distribute

Distribute to separators
(qt ) (qt , xt ) = = X (qt1 , qt ) = X p(qt1 , qt ) = p(qt )

qt1

qt1

(qt ) p(qt ) (qt , xt ) = p(xt |qt ) = p(xt , qt ) (qt ) 1

22 / 34

Introduction of Evidence
T 1 T 1

p(q|) = p(q0 ) x
t=0

p(qt |qt1 )
t=0

p(t |qt ) x

Observe a sequence of data. Potentials become slices (qt , xt ) = p(t |qt ) x (qt ) = (qt , xt ) = p(t |qt ) x (qt ) =
xt

(qt , xt )

Collect zeta separators bottom up


(qt ) = (qt , xt ) = p(t |qt ) x

Collect phi separators to the right


(q0 ) =
x0

(q0 , x0 )(x0 x0 ) = p(q0 , x0 )


23 / 34

JTA collect
Collecting up and to the left, updating potentials by left and bottom separators (qt , qt+1 ) = (qt+1 ) =
qt

(qt ) (qt+1 ) (qt , qt+1 ) = (qt )p(t+1 |qt+1 )qt qt+1 x 1 1 (qt , qt+1 ) =
qt

(qt )p(t+1 |qt+1 )qt qt+1 x

Note:
(q0 ) (q1 ) (q2 ) (qt+1 )

= = = =

p(0 , q0 ) x X p(0 , q0 )p(1 |q1 )p(q1 |q0 ) = p(0 , x1 , q1 ) x x x


q0

X
q1 qt

p(0 , x1 , q0 )p(2 |q2 )p(q2 |q1 ) = p(0 , x1 , x2 , q2 ) x x x p(0 , . . . , xt+1 , qt+1 )p(t+1 |qt+1 )p(qt+1 |qt ) = p(0 , . . . , xt+1 , qt+1 ) x x x
24 / 34

Evaluation

Compute the likelihood of the sequence. Collection is sucient. From previous slide
(qt+1 ) = X
qt

p(x0 , . . . , xt , qt )p(xt+1 , qt+1 )p(qt+1 |qt ) = p(x0 , . . . , xt+1 , qt+1 )

So the rightmost node gives:


(qT 1 ) = p(x0 , . . . , xT , qT 1 ) 1

The likelihood just requires marginalization over qT 1 .


p(x0 , . . . , xT ) = 1
qT 1

p(x0 , . . . , xT , qT 1 ) = 1

qT 1

(qT 1 )

25 / 34

Distribute
But the potentials cannot be read as marginals without the Distribute step of the JTA. Last state of collection
(qT 2 , qT 1 ) =

(qT 2 ) (qT 1 ) (qT 2 , qT 1 ) = (qT 2 )p(T 1 |qT 1 )qT 2 qT x 1 1

Distribute ** along the state nodes to the left. Distribute ** down from state nodes to observation nodes. Update parameters.
(qT 2 , qT 1 ) (qt ) (qt+1 ) (qt , qt+1 )

= = = =

(qT 2 , qT 1 ) X (qt , qt+1 )


qt+1

X
qt

(qt , qt+1 )

(qt+1 ) (qt , qt+1 ) (qt+1 )


26 / 34

Decoding
Decode: Given x0 , . . . xT 1 identify the most likely q0 , . . . qT 1 . Now that JTA is nished we have marginals in the potentials and separators (qt ) p(qt |0 , . . . , xT 1 ) x (qt+1 ) p(qt+1 |0 , . . . , xT 1 ) x (qt , qt+1 ) p(qt , qt+1 |0 , . . . , xT 1 ) x Need to nd the most likely path from q0 to qT 1 Argmax JTA.
Run JTA but rather than sums in the update rule, use the max operator. Then nd the largest entry in the separators

qt = argmax (qt )
qt
27 / 34

Viterbi Decoding

Finding an optimal state sequence can be intractable. There are M T possible paths, for M states and T time steps.
T can easily be on the order of 1000 in speech recognition.

Construct a Lattice of state transitions

28 / 34

Viterbi Decoding

Only continue to explore paths with likelihood greater than some threshold, or only continue to explore the top N-paths Also known as beam search Polynomial time algorithm to approximately decode a lattice. Algorithm: Initialize paths at every state. For each transition follow only the most likely edge. or Initialize paths at every state. For each transition follow only those paths that have a likelihood over some threshold.

29 / 34

Viterbi decoding

30 / 34

Maximum Likelihood
Training parameters with observed states. Maximum likelihood (as ever).
l() = = log(p(q, x )) log p(q0 )
T 1 Y t=1

p(qt |qt1 )

log p(q0 ) +
M1 Y i =0

T 1 X t=1

T 1 Y t=0

p(xi |qi )
T 1 X t=0
i

log p(qt |qt1 ) +


T 1 X t=1 M1 M1 Y Y i =0 j=0

log p(xi |qi )


j

log

[i ]q0 +

log

[ij ]qt1 qt +

M1 X i =0

i q0 log i +

T 1 M1 M1 XXX t=1 i =0 j=0

T 1 X t=0

log

i j qt1 qt log ij +

T 1 M1 N1 XXX t=0 i =0 j=0

M1 N1 Y Y i =0 j=0

[ij ]qt xt

i j

i qt xtj log ij

Introduce Lagrange multipliers, take partials, set to zero.


31 / 34

Maximum Likelihood
Training parameters with observed states. Maximum likelihood as ever.
l() =
M1 X i =0 i q0 log i + T 1 M1 M1 XXX t=1 i =0 j=0 i j qt1 qt log ij + T 1 M1 N1 XXX t=0 i =0 j=0 i qt xtj log ij

Introduce Lagrange multipliers, take partials, set to zero.

M1

i
i =0 M1

= 1

i ij

i = q0

= =

ij
j=0 N1

= 1 ij

T 2 i j t=0 qt qt+1 M1 T 2 i k k=1 t=0 qt qt+1 T 1 i j t=0 qt xt M1 T 1 i k k=1 t=0 qt xt


32 / 34

ij
j=0

= 1

Expectation Maximization

However, we may not have observed state sequences.


The Moody Croupier

Need to do unsupervised learning (clustering) on the states.


Maximize the Expected likelihood given a guess for p(q) Expectation Maximization Covered when we move to unsupervised techniques

33 / 34

Bye

Next
Perceptron and Neural Networks

34 / 34

S-ar putea să vă placă și