Andrew Rosenberg - Lecture 12: Hidden Markov Models Machine Learning

Lecture 12: Hidden Markov Models Machine Learning
Andrew Rosenberg
March 12, 2010
1 / 34
Last Time
Clustering
2 / 34
Today
Hidden Markov Models
3 / 34
Dice Example
Imagine a game of dice. When the croupier rolls 4,5,6 you win. When the croupier rolls 1,2,3 you lose. Model the likelihood of winning. IID multinomials
4 / 34
The Moody Croupier
Now imagine that the croupier cheats. There are three dice.
One fair (fair) One good for the house (bad) One good for you (good)
5 / 34
The Moody Croupier
Model the Likelihood of winning. IID multinomials Latent variable

' $
qi
i {0..n 1}
&
xi
%
6 / 34
The Moody Croupier
Model the Likelihood of winning. IID multinomials Latent variable Allow a prior over die choices ' qi i {0..n 1}
&
xi
%
7 / 34
The Moody Croupier

Now what if the dealer is moody? The dealer doesnt like to change the die that often The dealer doesnt like to switch from the good die to the bad die. No longer iid! The die he uses at time t is dependent on the die used at t 1 q0 x0 q1 x1 q2 . . . qT 1 x2 xT 1
8 / 34
Sequential Modeling
q0 x0 q1 x1 q2 . . . qT 1 x2 xT 1
Temporal or sequence model. Markov Assumption future past|present p(qt |qt1 , qt2 , qt3 , ..., q0 ) = p(qt |qt1 ) Get the overall likelihood from the graphical model.
T 1 T 1
p(x) = p(q0 )
t=1
p(qt |qt1 )
t=0
p(xt |qt )
9 / 34
Sequential Modeling
future past|present p(qt |qt1 , qt2 , qt3 , ..., q0 ) = p(qt |qt1 ) Get the overall likelihood from the graphical model.
T 1 T 1
p(x) = p(q0 )
t=1
p(qt |qt1 )
t=0
p(xt |qt )
p(qt |qt1 )?
10 / 34
HMMs as state machines
HMMs have two variables: state q and emission y In general the state is an unobserved latent variable. Can consider HMMs as stochastic automata weighted nite state machines.
11 / 34
HMM state machine

.5
bad .3 .3 .2 .2 .3 fair .5
.2
good
.5
12 / 34
HMMs as state machines
HMMs have two variables: state q and emission x In general the state is an unobserved latent variable. No observation of q directly. Only a related emission distribution. doubly-stochastic automaton.
13 / 34
HMM Applications
Speech Recognition (Rabiner): phonemes from audio cepstral vectors Language (Jelinek): part of speech tag from words Biology (Baldi): splice site from gene sequence Gesture (Starner): word from hand coordinates Emotion (Picard): emotion from EEG
14 / 34
Types of Variables
Continuous States
E.g. Kalman lters p(qt |qt 1) = N(qt |Aqt1 , Q)
Discrete States
E.g., Finite state machine j i M1 M1 qt1 qt p(qt |qt 1) = i =0 j=0 [ij ]
Continuous Observations
E.g. time series data p(xt |qt ) = N(xt |qt , qt )
Discrete Observations
E.g. strings p(xt |qt ) =
M1 i =0
i N1 qt xtj j=0 [ij ]
15 / 34
HMM Parameters
M states and N-class observations Complete likelihood from Graphical Model
p(x) = p(q0 )
T 1 Y t=1
p(qt |qt1 )
T 1 Y t=0
p(xt |qt )
Marginalize over unobserved hidden states

p(x) = X
q0
qT 1
p(q, x)
CPTs are reused: = {, , }

M1 M1 Y Y i =0 j=0 M1 N1 Y Y i =0 j=0 M1 Y i =0 M1 X j=0
p(qt |qt1 )
[ij ]qt1 xt
i j
ij
p(xt |qt )
[ij ]qt xt
i
N1 X j=0
ij
p(q0 )
[i ]q0
M1 X j=0
1
16 / 34
HMM Operations
Evaluate
Evaluate the likelihood of a model given data.
Decode
Identify the most likely sequence of states
Max Likelihood
Estimate the parameters.
17 / 34
JTA on HMM
Junction Tree
18 / 34
JTA on HMMs
Initialization
(q0 , x0 ) (qt , qt+1 ) (qt , xt ) Z (qt ) (qt )
= = = = = =
p(q0 )p(x0 , q0 ) p(qt+1 |qt ) = Aqt ,qt+1 p(xt |qt ) 1 1 1

19 / 34
JTA on HMMs
Update
Collect up from leaves dont change zeta separators. (qt ) = (qt1 , qt ) = (qt , xt ) =
xt (qt ) xt
p(xt |qt ) = 1
(qt )
(qt1 , qt ) = (qt1 , qt )
20 / 34
JTA on HMMs
Update
Collect left-right over phi state sequence becomes marginals. (q0 ) =

x0
(q0 , x0 ) = p(q0 ) (qt , qt1 ) = p(qt )

qt1 (q0 )
(qt ) = (q0 , q1 ) =
(q0 )
(q0 , q1 ) = p(q0 , q1 )
21 / 34
JTA on HMMs
Distribute
Distribute to separators
(qt ) (qt , xt ) = = X (qt1 , qt ) = X p(qt1 , qt ) = p(qt )
qt1
qt1
(qt ) p(qt ) (qt , xt ) = p(xt |qt ) = p(xt , qt ) (qt ) 1
22 / 34
Introduction of Evidence
T 1 T 1
p(q|) = p(q0 ) x
t=0
p(qt |qt1 )
t=0
p(t |qt ) x
Observe a sequence of data. Potentials become slices (qt , xt ) = p(t |qt ) x (qt ) = (qt , xt ) = p(t |qt ) x (qt ) =
xt
(qt , xt )
Collect zeta separators bottom up

(qt ) = (qt , xt ) = p(t |qt ) x
Collect phi separators to the right

(q0 ) =
x0
(q0 , x0 )(x0 x0 ) = p(q0 , x0 )

23 / 34
JTA collect
Collecting up and to the left, updating potentials by left and bottom separators (qt , qt+1 ) = (qt+1 ) =
qt
(qt ) (qt+1 ) (qt , qt+1 ) = (qt )p(t+1 |qt+1 )qt qt+1 x 1 1 (qt , qt+1 ) =
qt
(qt )p(t+1 |qt+1 )qt qt+1 x
Note:
(q0 ) (q1 ) (q2 ) (qt+1 )

= = = =
p(0 , q0 ) x X p(0 , q0 )p(1 |q1 )p(q1 |q0 ) = p(0 , x1 , q1 ) x x x

q0
X
q1 qt
p(0 , x1 , q0 )p(2 |q2 )p(q2 |q1 ) = p(0 , x1 , x2 , q2 ) x x x p(0 , . . . , xt+1 , qt+1 )p(t+1 |qt+1 )p(qt+1 |qt ) = p(0 , . . . , xt+1 , qt+1 ) x x x
24 / 34
Evaluation
Compute the likelihood of the sequence. Collection is sucient. From previous slide
(qt+1 ) = X
qt
p(x0 , . . . , xt , qt )p(xt+1 , qt+1 )p(qt+1 |qt ) = p(x0 , . . . , xt+1 , qt+1 )
So the rightmost node gives:

(qT 1 ) = p(x0 , . . . , xT , qT 1 ) 1
The likelihood just requires marginalization over qT 1 .

p(x0 , . . . , xT ) = 1
qT 1
p(x0 , . . . , xT , qT 1 ) = 1
qT 1
(qT 1 )
25 / 34
Distribute
But the potentials cannot be read as marginals without the Distribute step of the JTA. Last state of collection
(qT 2 , qT 1 ) =
(qT 2 ) (qT 1 ) (qT 2 , qT 1 ) = (qT 2 )p(T 1 |qT 1 )qT 2 qT x 1 1
Distribute ** along the state nodes to the left. Distribute ** down from state nodes to observation nodes. Update parameters.
(qT 2 , qT 1 ) (qt ) (qt+1 ) (qt , qt+1 )

= = = =
(qT 2 , qT 1 ) X (qt , qt+1 )

qt+1
X
qt
(qt , qt+1 )
(qt+1 ) (qt , qt+1 ) (qt+1 )

26 / 34
Decoding
Decode: Given x0 , . . . xT 1 identify the most likely q0 , . . . qT 1 . Now that JTA is nished we have marginals in the potentials and separators (qt ) p(qt |0 , . . . , xT 1 ) x (qt+1 ) p(qt+1 |0 , . . . , xT 1 ) x (qt , qt+1 ) p(qt , qt+1 |0 , . . . , xT 1 ) x Need to nd the most likely path from q0 to qT 1 Argmax JTA.
Run JTA but rather than sums in the update rule, use the max operator. Then nd the largest entry in the separators
qt = argmax (qt )
qt
27 / 34
Viterbi Decoding
Finding an optimal state sequence can be intractable. There are M T possible paths, for M states and T time steps.
T can easily be on the order of 1000 in speech recognition.
Construct a Lattice of state transitions
28 / 34
Viterbi Decoding
Only continue to explore paths with likelihood greater than some threshold, or only continue to explore the top N-paths Also known as beam search Polynomial time algorithm to approximately decode a lattice. Algorithm: Initialize paths at every state. For each transition follow only the most likely edge. or Initialize paths at every state. For each transition follow only those paths that have a likelihood over some threshold.
29 / 34
Viterbi decoding
30 / 34
Maximum Likelihood
Training parameters with observed states. Maximum likelihood (as ever).
l() = = log(p(q, x )) log p(q0 )
T 1 Y t=1
p(qt |qt1 )
log p(q0 ) +
M1 Y i =0
T 1 X t=1
T 1 Y t=0
p(xi |qi )
T 1 X t=0
i
log p(qt |qt1 ) +

T 1 X t=1 M1 M1 Y Y i =0 j=0
log p(xi |qi )

j
log
[i ]q0 +
log
[ij ]qt1 qt +
M1 X i =0
i q0 log i +
T 1 M1 M1 XXX t=1 i =0 j=0
T 1 X t=0
log
i j qt1 qt log ij +
T 1 M1 N1 XXX t=0 i =0 j=0
M1 N1 Y Y i =0 j=0
[ij ]qt xt
i j
i qt xtj log ij
Introduce Lagrange multipliers, take partials, set to zero.

31 / 34
Maximum Likelihood
Training parameters with observed states. Maximum likelihood as ever.
l() =
M1 X i =0 i q0 log i + T 1 M1 M1 XXX t=1 i =0 j=0 i j qt1 qt log ij + T 1 M1 N1 XXX t=0 i =0 j=0 i qt xtj log ij
Introduce Lagrange multipliers, take partials, set to zero.
M1
i
i =0 M1
= 1
i ij
i = q0
= =
ij
j=0 N1
= 1 ij
T 2 i j t=0 qt qt+1 M1 T 2 i k k=1 t=0 qt qt+1 T 1 i j t=0 qt xt M1 T 1 i k k=1 t=0 qt xt

32 / 34
ij
j=0
= 1
Expectation Maximization
However, we may not have observed state sequences.

The Moody Croupier
Need to do unsupervised learning (clustering) on the states.

Maximize the Expected likelihood given a guess for p(q) Expectation Maximization Covered when we move to unsupervised techniques
33 / 34
Bye
Next
Perceptron and Neural Networks
34 / 34

Andrew Rosenberg - Lecture 12: Hidden Markov Models Machine Learning

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Andrew Rosenberg - Lecture 12: Hidden Markov Models Machine Learning

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 12: Hidden Markov Models Machine Learning

March 12, 2010

Hidden Markov Models

The Moody Croupier

The Moody Croupier

Model the Likelihood of winning. IID multinomials Latent variable

The Moody Croupier

The Moody Croupier

HMMs as state machines

HMM state machine

HMMs as state machines

Marginalize over unobserved hidden states

CPTs are reused: = {, , }

(q0 , x0 ) (qt , qt+1 ) (qt , xt ) Z (qt ) (qt )

p(q0 )p(x0 , q0 ) p(qt+1 |qt ) = Aqt ,qt+1 p(xt |qt ) 1 1 1

Collect left-right over phi state sequence becomes marginals. (q0 ) =

(q0 , x0 ) = p(q0 ) (qt , qt1 ) = p(qt )

(qt ) p(qt ) (qt , xt ) = p(xt |qt ) = p(xt , qt ) (qt ) 1

Collect zeta separators bottom up

Collect phi separators to the right

(q0 , x0 )(x0 x0 ) = p(q0 , x0 )

(qt )p(t+1 |qt+1 )qt qt+1 x

p(0 , q0 ) x X p(0 , q0 )p(1 |q1 )p(q1 |q0 ) = p(0 , x1 , q1 ) x x x

p(x0 , . . . , xt , qt )p(xt+1 , qt+1 )p(qt+1 |qt ) = p(x0 , . . . , xt+1 , qt+1 )

So the rightmost node gives:

The likelihood just requires marginalization over qT 1 .

(qT 2 ) (qT 1 ) (qT 2 , qT 1 ) = (qT 2 )p(T 1 |qT 1 )qT 2 qT x 1 1

(qT 2 , qT 1 ) X (qt , qt+1 )

(qt+1 ) (qt , qt+1 ) (qt+1 )

Construct a Lattice of state transitions

log p(qt |qt1 ) +

log p(xi |qi )

T 1 M1 M1 XXX t=1 i =0 j=0

T 1 M1 N1 XXX t=0 i =0 j=0

Introduce Lagrange multipliers, take partials, set to zero.

Introduce Lagrange multipliers, take partials, set to zero.

T 2 i j t=0 qt qt+1 M1 T 2 i k k=1 t=0 qt qt+1 T 1 i j t=0 qt xt M1 T 1 i k k=1 t=0 qt xt

However, we may not have observed state sequences.

Need to do unsupervised learning (clustering) on the states.

S-ar putea să vă placă și