Documente Academic
Documente Profesional
Documente Cultură
Andrew Rosenberg
1 / 34
Last Time
Clustering
2 / 34
Today
3 / 34
Dice Example
Imagine a game of dice. When the croupier rolls 4,5,6 you win. When the croupier rolls 1,2,3 you lose. Model the likelihood of winning. IID multinomials
4 / 34
Now imagine that the croupier cheats. There are three dice.
One fair (fair) One good for the house (bad) One good for you (good)
5 / 34
qi
i {0..n 1}
&
xi
%
6 / 34
Model the Likelihood of winning. IID multinomials Latent variable Allow a prior over die choices ' qi i {0..n 1}
&
xi
%
7 / 34
8 / 34
Sequential Modeling
q0 x0 q1 x1 q2 . . . qT 1 x2 xT 1
Temporal or sequence model. Markov Assumption future past|present p(qt |qt1 , qt2 , qt3 , ..., q0 ) = p(qt |qt1 ) Get the overall likelihood from the graphical model.
T 1 T 1
p(x) = p(q0 )
t=1
p(qt |qt1 )
t=0
p(xt |qt )
9 / 34
Sequential Modeling
future past|present p(qt |qt1 , qt2 , qt3 , ..., q0 ) = p(qt |qt1 ) Get the overall likelihood from the graphical model.
T 1 T 1
p(x) = p(q0 )
t=1
p(qt |qt1 )
t=0
p(xt |qt )
p(qt |qt1 )?
10 / 34
HMMs have two variables: state q and emission y In general the state is an unobserved latent variable. Can consider HMMs as stochastic automata weighted nite state machines.
11 / 34
bad .3 .3 .2 .2 .3 fair .5
.2
good
.5
12 / 34
HMMs have two variables: state q and emission x In general the state is an unobserved latent variable. No observation of q directly. Only a related emission distribution. doubly-stochastic automaton.
13 / 34
HMM Applications
Speech Recognition (Rabiner): phonemes from audio cepstral vectors Language (Jelinek): part of speech tag from words Biology (Baldi): splice site from gene sequence Gesture (Starner): word from hand coordinates Emotion (Picard): emotion from EEG
14 / 34
Types of Variables
Continuous States
E.g. Kalman lters p(qt |qt 1) = N(qt |Aqt1 , Q)
Discrete States
E.g., Finite state machine j i M1 M1 qt1 qt p(qt |qt 1) = i =0 j=0 [ij ]
Continuous Observations
E.g. time series data p(xt |qt ) = N(xt |qt , qt )
Discrete Observations
E.g. strings p(xt |qt ) =
M1 i =0
i N1 qt xtj j=0 [ij ]
15 / 34
HMM Parameters
M states and N-class observations Complete likelihood from Graphical Model
p(x) = p(q0 )
T 1 Y t=1
p(qt |qt1 )
T 1 Y t=0
p(xt |qt )
qT 1
p(q, x)
p(qt |qt1 )
[ij ]qt1 xt
i j
ij
p(xt |qt )
[ij ]qt xt
i
N1 X j=0
ij
p(q0 )
[i ]q0
M1 X j=0
1
16 / 34
HMM Operations
Evaluate
Evaluate the likelihood of a model given data.
Decode
Identify the most likely sequence of states
Max Likelihood
Estimate the parameters.
17 / 34
JTA on HMM
Junction Tree
18 / 34
JTA on HMMs
Initialization
= = = = = =
JTA on HMMs
Update
Collect up from leaves dont change zeta separators. (qt ) = (qt1 , qt ) = (qt , xt ) =
xt (qt ) xt
p(xt |qt ) = 1
(qt )
(qt1 , qt ) = (qt1 , qt )
20 / 34
JTA on HMMs
Update
(qt ) = (q0 , q1 ) =
(q0 )
(q0 , q1 ) = p(q0 , q1 )
21 / 34
JTA on HMMs
Distribute
Distribute to separators
(qt ) (qt , xt ) = = X (qt1 , qt ) = X p(qt1 , qt ) = p(qt )
qt1
qt1
22 / 34
Introduction of Evidence
T 1 T 1
p(q|) = p(q0 ) x
t=0
p(qt |qt1 )
t=0
p(t |qt ) x
Observe a sequence of data. Potentials become slices (qt , xt ) = p(t |qt ) x (qt ) = (qt , xt ) = p(t |qt ) x (qt ) =
xt
(qt , xt )
JTA collect
Collecting up and to the left, updating potentials by left and bottom separators (qt , qt+1 ) = (qt+1 ) =
qt
(qt ) (qt+1 ) (qt , qt+1 ) = (qt )p(t+1 |qt+1 )qt qt+1 x 1 1 (qt , qt+1 ) =
qt
Note:
(q0 ) (q1 ) (q2 ) (qt+1 )
= = = =
X
q1 qt
p(0 , x1 , q0 )p(2 |q2 )p(q2 |q1 ) = p(0 , x1 , x2 , q2 ) x x x p(0 , . . . , xt+1 , qt+1 )p(t+1 |qt+1 )p(qt+1 |qt ) = p(0 , . . . , xt+1 , qt+1 ) x x x
24 / 34
Evaluation
Compute the likelihood of the sequence. Collection is sucient. From previous slide
(qt+1 ) = X
qt
p(x0 , . . . , xT , qT 1 ) = 1
qT 1
(qT 1 )
25 / 34
Distribute
But the potentials cannot be read as marginals without the Distribute step of the JTA. Last state of collection
(qT 2 , qT 1 ) =
Distribute ** along the state nodes to the left. Distribute ** down from state nodes to observation nodes. Update parameters.
(qT 2 , qT 1 ) (qt ) (qt+1 ) (qt , qt+1 )
= = = =
X
qt
(qt , qt+1 )
Decoding
Decode: Given x0 , . . . xT 1 identify the most likely q0 , . . . qT 1 . Now that JTA is nished we have marginals in the potentials and separators (qt ) p(qt |0 , . . . , xT 1 ) x (qt+1 ) p(qt+1 |0 , . . . , xT 1 ) x (qt , qt+1 ) p(qt , qt+1 |0 , . . . , xT 1 ) x Need to nd the most likely path from q0 to qT 1 Argmax JTA.
Run JTA but rather than sums in the update rule, use the max operator. Then nd the largest entry in the separators
qt = argmax (qt )
qt
27 / 34
Viterbi Decoding
Finding an optimal state sequence can be intractable. There are M T possible paths, for M states and T time steps.
T can easily be on the order of 1000 in speech recognition.
28 / 34
Viterbi Decoding
Only continue to explore paths with likelihood greater than some threshold, or only continue to explore the top N-paths Also known as beam search Polynomial time algorithm to approximately decode a lattice. Algorithm: Initialize paths at every state. For each transition follow only the most likely edge. or Initialize paths at every state. For each transition follow only those paths that have a likelihood over some threshold.
29 / 34
Viterbi decoding
30 / 34
Maximum Likelihood
Training parameters with observed states. Maximum likelihood (as ever).
l() = = log(p(q, x )) log p(q0 )
T 1 Y t=1
p(qt |qt1 )
log p(q0 ) +
M1 Y i =0
T 1 X t=1
T 1 Y t=0
p(xi |qi )
T 1 X t=0
i
log
[i ]q0 +
log
[ij ]qt1 qt +
M1 X i =0
i q0 log i +
T 1 X t=0
log
i j qt1 qt log ij +
M1 N1 Y Y i =0 j=0
[ij ]qt xt
i j
i qt xtj log ij
Maximum Likelihood
Training parameters with observed states. Maximum likelihood as ever.
l() =
M1 X i =0 i q0 log i + T 1 M1 M1 XXX t=1 i =0 j=0 i j qt1 qt log ij + T 1 M1 N1 XXX t=0 i =0 j=0 i qt xtj log ij
M1
i
i =0 M1
= 1
i ij
i = q0
= =
ij
j=0 N1
= 1 ij
ij
j=0
= 1
Expectation Maximization
33 / 34
Bye
Next
Perceptron and Neural Networks
34 / 34