09 - Hidden Markov Model

Md.
Jalil Piran, PhD

Asst. Professor
Computer Science and Engineering
Sejong University
Fall, 2019
• The project
• No need survey paper
• Just write a report:
• General information about the topic
• How it can be applied in your field?
• Find at least 50 related papers and analysis them and find:
• Problem statement
• Objective
• Contributions
• Results
• Then compare them
• Conclusion
Outline
• Markov Decision Process (MDP)
• Hidden Markov Model (HMM)

• Applications
• Bayesian Networks
• The Forward algorithm

• Viterbi algorithm
• Baum–Welch algorithm (Forward-Backward Algorithm)
• Implementation
Reinforcement Learning
• In RL, Markov decision process (MDP) is

used as the mathematical approach for
mapping a solution.
• MDP is a discrete time stochastic control
process.
• MDPs are employed for non-deterministic
search problems.
Andrey Markov
(1856–1922)
...
Markov Chain
• A Markov chain is a stochastic model.
• It is a mathematical system that

experiences transitions from one state
to another according to certain
probabilistic rules.
• The probability of each event depends

only on the state attained in the previous
event.
If the Markov process is in state A:
• The probability it changes to state E is 0.4,
• The probability it remains in state A is 0.6.
...
MDP
• A Markov decision process consist of:
• A stochastic process
• A decision maker that observes the process and is able
to select actions that influence its development over time
• The decision maker receives a series of rewards:
• Positive rewards
• Negative rewards
• An observable Markov process:

Stop Prepare to go Go Prepare to stop
...
MDP
• Markov property:
• Markov means outcome of an action depend only on current state
• Independent from the future and the past states.
𝑷(𝑺𝒕+𝟏 = 𝒔′ |𝑺𝒕 = 𝒔𝒕 , 𝑨𝒕 = 𝒂𝒕 , 𝑺𝒕−𝟏 = 𝒔𝒕−𝟏 , 𝑨𝒕−𝟏 = 𝒂𝒕−𝟏 , … , 𝑺𝟎 = 𝒔𝟎 )
= can be ignored!
Markov property:
𝑷 𝑺𝒕+𝟏 = 𝒔′ 𝑺𝒕 = 𝒔𝒕 , 𝑨𝒕 = 𝒂𝒕 )
• Markov Chain are NOT so useful for most agents!

• Need observations to update our beliefs. .
Homonyms
• “A crane is a long-legged and long-necked bird.”
• “She had to crane her neck to see the movie.”
• “They had to use a crane to lift heavy objects.”
Homonyms
Speech Recognition
Behavior Prediction
• How do you know your spouse is happy or not?
• Mind mining
HMM
• In ML, many states are hard to determine or even not

observable.
• HMM is a way of relating a sequence of observations to a
sequence of hidden states that explain the observations.
• HMM models a process with a Markov process.
• HMM assumptions:
• The “next state” and the “current observation” solely depend on the
“current state” only.
.
HMM
Types
• Profile HMM
• A linear state machine consisting of a series of nodes, each of which corresponds
roughly to a position in the alignment from which it was build.
• Hierarchical HMM
• Each state is considered to be self-contained probabilistic model, e.g. each state
of HHMM is itself an HHMM.
• Factorial HMM
• Allows for a single observation to be conditioned on the corresponding variables of a set of
𝐾 independent Markov chains, rather than using a single Markov chain.
• Coupled HMM
• Such state-space models that form a natural extension to standard HMM.
• Hidden Semi Markov

• A statistical model with the same structure as HMM except that the unobservable process is
semi-Markov rather than Markov.
• Layered HMM
• Consists of 𝑁 levels of HMMs, where the HMMs on level 𝑖 + 1 corresponds to
observation symbols or probability generators at level 𝑖.
.
Applications
Face Recognition Speech Recognition
Samsung Health
Pattern Recognition Activity Recognition ...
Applications
Robot Localization …
Transportation Forecasting
Gene Prediction
...
Applications
• Language Modeling
• Sentence completion
• Predictive text input
• Classification
• Navie Bayes == unigram
• Machine translation
QuickPath Swipe Keyboad
Text-to-Speech
Morgen fliege ich Nach kanada Zur konferenz
Tomorrow I will fly to the conference in Canada

.
HMM
Parenting
• A baby can be in different states:
Hungry Happy Sleepy
• We just observe the behavior of a baby.

• The baby cannot tell us his/her real feeling.
• But, the main intentions is something else! What is her feeling?
...
HMM
Notations
• 𝑻: Length of the observations sequence
• 𝑿: Set of States of the baby:
𝑋 = {ℎ𝑢𝑛𝑔𝑟𝑦, ℎ𝑎𝑝𝑝𝑦, 𝑠𝑙𝑒𝑒𝑝𝑦}
• a state: 𝒙𝒊 e.g. 𝑥2 is happy
• 𝑵: Number of states in the model
• 𝑶: Possible observation: 𝑂 = {distressed, crying, 𝑠𝑚𝑖𝑙𝑖𝑛𝑔}
• an observation: 𝑜𝑘 e.g. 𝑜3 is ‘smiling’.
Note) the number of states and observations do not necessarily need to be the same ...
HMM
HMM Model 𝝀 = (𝑨, 𝑩, 𝝅)
• Transition probability 𝒂𝒊𝒋
• Based on experience, if the baby was in state 𝒙𝒊 an hour ago (time 𝒕 − 𝟏), we may
know the probability of it being in state 𝒙𝒋 now (time 𝒕).
𝑗
𝑖
A = 𝑎𝑖𝑗 𝑤ℎ𝑒𝑟𝑒 𝑎𝑖𝑗 = 𝑝 𝑥𝑡 |𝑥𝑡−1
the baby is in state
• Emission probability 𝒆𝒋𝒌 𝒊 at time 𝒕 − 𝟏
• Also from the experience with many babies, we can infer the probability that we
observe a behavior 𝒐𝒌 when the baby is in state 𝒙𝒋.
B = 𝑒 𝑗𝑘 𝑤ℎ𝑒𝑟𝑒 𝑒 𝑗𝑘 = 𝑝 𝑜𝑘 |𝑥 𝑗
• Initial state probability 𝝅𝒊

#{𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑤𝑖𝑡ℎ 𝑠𝑖 }
𝜋 = 𝜋𝑖 𝑤ℎ𝑒𝑟𝑒 𝜋𝑖 ≡ ...
#{𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠}
HMM
Notations
HMM Transition probability
0.3
Emission probability 0.4 0.4 Emission probability
0.4
• distressed 0.4% Hungry Happy • distressed 0.2%
• crying 0.4% • crying 0.01%
• smiling 0.2% 0.3 0.2 • smiling 0.79%
0.5 0.3
Sleepy
0.2 Emission probability

• distressed 0.5%
• crying 0.4%
• smiling 0.1%
• The baby’s behavior depends on its state (𝑝 𝑜 𝑘 |𝑥 𝑖 )

• The baby’s state at time t depends on its previous state at time t-1,
𝑗 𝑘
𝑝 𝑥𝑡 |𝑥𝑡−1 ...
HMM
Notations
• Consider the baby will always be in one of the states THEN
the sum of transition probabilities is 1.
𝑗 𝑖
෍ 𝑃 𝑥𝑡 |𝑥𝑡−1 = ෍ 𝑎𝑖𝑗 = 1
𝑗 𝑗
• If the baby always react with one of the behaviors THEN the
sum of emission probabilities is 1.
෍ 𝑃 𝑜 𝑘 |𝑥 𝑗 = ෍ 𝑒 𝑗𝑘 = 1
𝑘 𝑘
...
HMM
Tabular Presentation
Distribution 𝝅𝒊
Intial State
𝑃 𝑥1 𝑃 𝑥1 = 𝐻𝑢𝑛𝑔𝑟𝑦 = 0.4
𝝅 𝑃 𝑥1 = 𝐻𝑎𝑝𝑝𝑦 = 0.3
𝑃 𝑥1 = 𝑆𝑙𝑒𝑒𝑝𝑦 = 0.3
𝑖
baby is … (𝑥𝑡−𝑖 ) Future Present states
states
Probabilities 𝒂𝒊𝒋
Hungry Happy Sleepy

Transition
𝐀 𝑗 Hungry 0.4 0.4 0.5 If the baby is

baby will be (𝑥𝑡 ) sleepy, the
Happy 0.3 0.4 0.3 probability that it
Sleepy 0.3 0.2 0.2 will be hungry is
0.5
If baby is … (𝑥 𝑗 )
Probabilities 𝐞𝐣𝐤
Future Present states

Emission
states
Hungry Happy Sleepy

𝐁 If the baby is
Acts (𝑜𝑘 ) distressed 0.4 0.2 0.5 sleepy, the
probability that it
crying 0.4 0.01 0.4
will ACT distress
Smiling 0.2 0.79 0.1 0.5
HMM
• Once we have an HMM, there are three problems of interest:
• Evaluation; to compute the observation probability:
Pattern Recognition
• Given the mode 𝝀 = 𝑨, 𝑩, 𝝅 and a sequence of observations
𝑶 ={𝒐𝟏 , 𝒐𝟐 , … , 𝒐𝑻}, how to find 𝑷 𝑶 𝝀 =?
• Solution: Forward Algorithm
• Decoding (inference): to find the sequence of hidden states based on

the current model and observation:
• Given 𝑶 ={𝒐𝟏 , 𝒐𝟐 , … , 𝒐𝑻} and mode 𝝀 = 𝑨, 𝑩, 𝝅 , how to find the best
sequence of states 𝑿 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝑻 } corresponding to the observations?
• Solution: Viterbi algorithm
• Learning: to learn and generate a HMM model based on observations.

• Given an observation sequence 𝑶 ={𝒐𝟏 , 𝒐𝟐 , … , 𝒐𝑻 }and dimensions 𝐍 (number
of states) and 𝐌, find the model 𝝀 parameters that maximize 𝑷 𝑶 𝝀 ?
• Solution: Baum–Welch algorithm (forward-backward algorithm)
.
Bayesian Networks
• Goal: to find the state based on a given observation?

• Bayesian networks are used to model many observations.
• aka. Bayes network, belief network, decision network, Bayes(ian)

model or probabilistic directed acyclic graphical model.
• It is a probabilistic graphical model that represents a set of

variables and their conditional dependencies via a directed
acyclic graph (DAG).
• Bayesian networks are ideal for taking an event that occurred
and predicting the likelihood that any one of several possible
known causes was the contributing factor.
• Example) the probabilistic relationships between diseases

and symptoms.
...
Dynamic Bayesian Network
• To model many observations

State at time 𝑡2 depends
on state at time 𝑡1
𝒙𝟎 𝒙𝟏 𝒙𝟐 𝒙𝒕−𝟏 𝒙𝒕
State at
time 𝑡1
𝒐𝟏 𝒐𝟐 𝒐𝒕−𝟏 𝒐𝒕
Observation
at time 𝑡1
Observation at time 𝑡2 depends on state 𝑥2
𝑆𝑡𝑖𝑚𝑒𝑆𝑙𝑖𝑐𝑒 ∈ ℎ𝑎𝑝𝑝𝑦, ℎ𝑢𝑛𝑔𝑟𝑦, 𝑠𝑙𝑒𝑒𝑝𝑦 ;

𝑂𝑡𝑖𝑚𝑒𝑆𝑙𝑖𝑐𝑒 ∈ distressed, 𝑠𝑚𝑖𝑙𝑖𝑛𝑔, 𝑐𝑟𝑦𝑖𝑛𝑔 ;
• every 𝑥𝑡 depends only on 𝑥𝑡−1

• Every 𝑜𝑡 depends only on 𝑥𝑡 ...
HMM
Based on our observations, we can:
• Compute the most likely state after a sequence of observations (HMM).

• Work out the most probable sequence of states given the observations.
• Given several sets of observations (many babies) generate a better HMM
model for the data.
...
HMM
The first problem:

• Evaluation; to compute the observation probability
• Given the mode 𝝀 = 𝑨, 𝑩, 𝝅 and a sequence of observations
𝑶 = {𝒐𝟏 , 𝒐𝟐 , … , 𝒐𝑻 } , how to find 𝑷 𝑶 𝝀 =?
• Solution: the Forward Algorithm
...
The Forward Algorithm
• The forward algorithm (a.k.a. filtering) is to predicate a state after

𝒏 observations.
• If we observe the baby is ‘smiling’ at time 𝒕 and we know the baby
was ‘hungry’ at time 𝒕 − 𝟏, what state is the baby in?
𝑗
𝑜𝑡𝑘 = smiling; 𝑥𝑡−1 = ℎ𝑢𝑛𝑔𝑟𝑦; 𝑥𝑡𝑖 =? ;
We need to compute:
What value
in 𝑥 will argmax 𝑝 smiling 𝑡 |𝑥𝑡𝑖 , ℎ𝑢𝑛𝑔𝑟𝑦𝑡−1
give the 𝑥 𝑖 ∈ ℎ𝑢𝑛𝑔𝑟𝑦,ℎ𝑎𝑝𝑝𝑦,𝑠𝑙𝑒𝑒𝑝𝑦
highest
probability
𝑗
=argmax 𝑝 𝑜𝑡𝑘 = 𝑠𝑚𝑖𝑙𝑖𝑛𝑔|𝑥𝑡𝑖 , 𝑥𝑡−1 = ℎ𝑢𝑛𝑔𝑟𝑦
𝑥𝑖
Current state previous state
...
Using Bayes’ Theorem:

𝑗
argmax 𝑝 𝑜𝑡𝑘 = 𝑠𝑚𝑖𝑙𝑖𝑛𝑔|𝑥𝑡𝑖 , 𝑥𝑡−1 = ℎ𝑢𝑛𝑔𝑟𝑦
𝑥𝑖
• We decompose it:
(1) current state depends on the previous state (transition probability)
(2) current observation depends on the current state (emission probability)
𝑗
= argmax[ 𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 = ℎ𝑢𝑛𝑔𝑟𝑦) × 𝑝 𝑜𝑡𝑘 = smiling 𝑥𝑡𝑖
𝑥𝑖
• Possible state changes 𝒙𝒊𝒕 : ‘hungry’, ‘happy’ or ‘sleepy’?
• Replace 𝒙𝒊 and determine 𝒂𝒓𝒈𝒎𝒂𝒙 according to the tables.
Bayes’ Theorem describes the probability of an event, based on prior

knowledge of conditions that might be related to the event.
...
Initial probabilities
• So, we know how to compute the probability of the current

observation given the previous and current state.
• But at some point we need probability for the starting state.
• Solution: we use ‘initial probabilities’, from our experience

we may claim that babies are:
• ‘hungry’ 30% of the time,
• ‘happy’ 40% of the time,
• ‘sleepy’ 30% of the time.
...
The forward algorithm
Recall general probability rules:
• Remember conditioning probability:

𝑃 𝐴 = σ𝑏 𝑃 𝐴 𝐵 𝑃(𝐵)
So, the probability of the model producing a set of observations

𝑂 from a set of states 𝑋:
𝑅
𝑃 𝑂 = 𝑃 𝒳, 𝑂 = ෍ 𝑃 𝑂 𝒳𝑟 𝑃(𝒳𝑟 )
𝑟=1
The observed event Emission Transition
probability probability
Goal: to find the likelihood of observation O.
...
𝑅
𝑃 𝒳, 𝑂 = ෍ 𝑃 𝑂 𝒳𝑟 𝑃(𝒳𝑟 )
𝑟=1
But we know that
𝑇 𝑇
𝑖 𝑗
𝑃 𝒳𝑟 = ෑ 𝑃 𝑥𝑡 𝑥𝑡−1 = ෑ 𝑎𝑖𝑗
and 𝑡=1 𝑡=1
𝑇 𝑇
𝑗
𝑃 𝑂|𝒳𝑟 = ෑ 𝑃 𝑜𝑡𝑘 𝑥𝑡 = ෑ 𝑒 𝑗𝑘
𝑡=1 𝑡=1
Therefore, the probability of the model with 𝑹 possible

combinations and 𝑻 time-slice is:
𝑅 𝑇
𝑃 𝒳, O = ෍ ෑ 𝑎𝑖𝑗 𝑒 𝑗𝑘
𝑟=1 𝑡=1 ...
• But it seems to complex: 

𝑅 𝑇
𝑃 𝒳, O = ෍ ෑ 𝑎𝑖𝑗 𝑒 𝑗𝑘
𝑟=1 𝑡=1
• For 𝑵 hidden states in 𝑻 time slices, computation is:
𝑶(𝑵𝑻 𝑻)
• Because there are 𝑵𝑻 combinations, so 𝑹 = 𝑵𝑻 and the

products is executed 𝑻 times for each combination.
• How to reduce the complexity?
...
Complexity Reduction All the previous
observation
• Partial probability: including the
current one
αi t = p xti , o1:t
• The probability of being in state 𝒙𝒊 at time slice 𝒕 with a history of

observation 𝑶𝟏:𝒕 is:
j
αi t = p xti , o1:t = ෍ p xti , xt−1 , o1:t
j
• Using the chain rule:
j j j
αi t = ෍ p ot |xti , xt−1 , o1:t−1 p xti xt−1 , o1:t−1 p(xt−1 , o1:t−1 )
j 𝛂𝐣 𝐭−𝟏
𝐩 𝐨𝐭 |𝐱 𝐭𝐢 𝐣
𝐩 𝐱 𝐭𝐢 |𝐱 𝐭−𝟏
...
𝑗 𝑗 𝑗
𝛼𝑖 𝑡 = ෍ 𝑝 𝑜𝑡 |𝑥𝑡𝑖 , 𝑥𝑡−1 , 𝑜1:𝑡−1 𝑝 𝑥𝑡𝑖 𝑥𝑡−1 , 𝑜1:𝑡−1 𝑝(𝑥𝑡−1 , 𝑜1:𝑡−1 )
𝑗 𝑝 𝑜𝑡 |𝑥𝑡𝑖 𝑗
𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 𝑡−1
therefore
𝑗
𝛼𝑖 𝑡 = 𝑝 𝑜𝑡 |𝑥𝑡𝑖 ෍ 𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 (𝑡 − 1)
𝑗
𝛼𝑖 0 = 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑥 𝑖
• The computation time is 𝑂 𝑁 2 𝑇 
• Next, we show this procedure in a trellis.

...
Trellis representation
Suppose we observed the baby during three hours and its behaviors
at every hour were:
smiling1 , distressed2 , smiling 3
1- what is the most likely state the baby can be in?

2- what is the probability distribution of the states the baby can be in?
...
Trellis representation sl: sleepy
ha: happy
• Observation at 𝒕 = 𝟏 : ‘smiling’ (sm) hu: hungry
Initial The probability at

probabilities time 1
𝜶𝒔𝒍 (𝟎) Emission 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)

Transition Initial
prob. prob. Prob.
+
𝟎. 𝟑
𝜶𝒉𝒂 (𝟎) 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎) 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑 𝑗
𝑗
...
• Observation at 𝒕 = 𝟏 : smiling(sm)
• We repeat the same procedure for ‘happy’
𝜶𝒔𝒍 (𝟎) 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)
𝟎. 𝟑
𝑃 𝑠𝑚 ℎ𝑎 𝑃 ℎ𝑎 ℎ𝑎 𝛼ℎ𝑎 (0)
𝜶𝒉𝒂 (𝟎) + 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎) 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
𝑗
𝑗
...
• Observation at 𝒕 = 𝟏 : smiling(sm)
• We repeat the same procedure for ‘hungry’
𝜶𝒔𝒍 (𝟎) 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)
𝟎. 𝟑
𝜶𝒉𝒂 (𝟎) 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝑃 𝑠𝑚 ℎ𝑢 𝑃 ℎ𝑢 ℎ𝑢 𝛼ℎ𝑢 (0)
𝜶𝒉𝒖 (𝟎) + 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
𝑗
𝑗
...
Observation at 𝒕 = 𝟐 : distressed
We follow the same procedure as for time 𝑡 = 1
𝜶𝒔𝒍 (𝟎) 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)
𝟎. 𝟑
𝜶𝒉𝒂 (𝟎) 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎) 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
.
• The second problem:
• Decoding (inference): to find the sequence of hidden
states based on the current model and observation.
• Given 𝑶 = {𝑶𝟏 , 𝑶𝟐 , … , 𝑶𝑻 } and mode 𝝀, how to find the best
sequence of states 𝑿 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝑻 } corresponding to the
observations?
• Solution: Viterbi algorithm
...
Viterbi Algorithm
• Viterbi is a common decoding algorithm for

HMM.
• It uses dynamic programming trellis.
• Goal: find the most probable sequence of states
given the observations
• Given some observations:
Andrew Viterbi
American Electrical Engineering
{distressed, smiling, smiling, crying} Co-founder of Qualcom Inc.
Q) what is the most likely sequence of states?
• Just like the trellis, but maintaining only the highest probability at
each timestep 𝒕.
...
Viterbi Algorithm
The Algorithm
• We defined:
• 𝒂𝒊𝒋 : the transition probability from state 𝒊 to 𝒋.
• 𝝅𝒊 : the initial probability of being in state 𝒊.
• 𝑿: the set of possible states.
• Initialization step:
𝑉1,𝑘 = 𝑝 𝑜1 |𝑥1𝑘 𝜋𝑘 ∀ 𝑘 ∈ 𝑋
• Recursion step:
• For each timeslice 𝑡 > 1
• For each state 𝑘 ∈ 𝑋 the maximum probability of
𝑉𝑡,𝑘 = max 𝑝 𝑜𝑡 |𝑥𝑡𝑘 𝑎 𝑗𝑘 𝑉𝑡−1,𝑖 being in state 𝒌 at time 𝒕
Viterbi 𝑖∈𝑋
𝑏𝑝𝑡,𝑘 = 𝑎𝑟𝑔max 𝑝 𝑜𝑡 |𝑥𝑡𝑘 𝑎 𝑗𝑘 𝑉𝑡−1,𝑖 the state that had the

backpointer 𝑖∈𝑋
maximum probability.
In the code we shown be 𝛿 and 𝜙
Note) in the Forward algorithm we took sum, but here we take max! .
Step 0
𝜶𝒔𝒍 (𝟎)
• The step 0 is to just list out all possible
state at time 0, and their corresponding
𝟎. 𝟑 probability values,
𝜶𝒉𝒂 (𝟎)
• We do not decide which state is chosen at
this stage.
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎)
𝟎. 𝟑
...
Step 1
Out of 3 input the max value is selected
𝜶𝒔𝒍 (𝟎) Emission M 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)

Transition Initial A
prob. prob. Prob. X
𝟎. 𝟑
𝜶𝒉𝒂 (𝟎) 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎) 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
...
Step 1
𝜶𝒔𝒍 (𝟎) 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)
𝟎. 𝟑
𝑃 𝑠𝑚 ℎ𝑎 𝑃 ℎ𝑎 ℎ𝑎 𝛼ℎ𝑎 (0)
𝜶𝒉𝒂 (𝟎) 𝑴𝒂𝒙
𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎) 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
...
Step 1
𝜶𝒔𝒍 (𝟎) 𝜶𝒔𝒍 (𝟏) 𝜶𝒔𝒍 (𝟐)
𝟎. 𝟑
𝜶𝒉𝒂 (𝟎) 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝑃 𝑠𝑚 ℎ𝑢 𝑃 ℎ𝑢 ℎ𝑢 𝛼ℎ𝑢 (0) 𝑴𝒂𝒙

𝜶𝒉𝒖 (𝟎) 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
We continue the same procedure for the next timeslice .

• The third problem:
• Learning. Learn the HMM model.
• Given an observation sequence 𝑶 = {𝑶𝟏 , 𝑶𝟐 , … , 𝑶𝑻 } and
dimensions 𝑵 and 𝑴, find the model 𝝀 parameters that
maximize 𝑷 𝑶 𝝀 ?
• Solution: Baum–Welch algorithm (forward-backward
algorithm)
...
Baum-welch
Forward-backward algorithm
Refining the model
• The transition and observation probabilities come from
somewhere, but they may not be always available or even
not accurate.
• Idea: use some kind of estimation techniques (e.g. EM) to
find parameters that better fit the observations, e.g. learning.
• Baum-Welch’s algorithm is a kind EM that is used to find

the unknown parameters of HMM.
Leonard E. Baum Lioyd R. Welch ...

Baum-welch
• We will estimate the forward probabilities to time 𝒕 and then

the backward probability 𝜷𝒕 (𝒙𝒊 ).
• Forward 𝜶𝒊 (𝒕): is the probability of seeing 𝒐𝟏 , … , 𝒐𝒕 and
being in state 𝒙𝒊 at time 𝒕.
• In other words; given at time 𝑡 with internal state 𝑖, the probability of
seeing all the observations so far.
• Backward 𝜷𝒊 (𝒕) : is the probability of the ending sequence

𝒐𝒕+𝟏 , … , 𝒐𝑻 given starting state 𝒙𝒊 at time 𝒕.
• In other words; given at time 𝑡 with internal state 𝑖, the probability of
seeing all coming observations.
• 𝜸𝟏 (𝒊): the expected frequency of staying in 𝒙𝒊 at time 𝒕.
...
Baum-welch
• The forward procedure: just like before

𝛼𝑖 1 = 𝜋𝑖 𝑝 𝑜1 𝑥1𝑖
𝑗
𝛼𝑖 𝑡 + 1 = 𝑝 𝑜𝑡 𝑥𝑡𝑖 Σ𝑗 𝑝 𝑥𝑡𝑖 𝑥𝑡−1 𝛼𝑗 𝑡 − 1
• The backward procedure

𝛽𝑖 𝑇 = 1
𝑗 𝑗
𝛽𝑖 𝑡 = Σ𝑗=1 𝛽𝑗 𝑡 + 1 𝑝 𝑥𝑡+1 𝑥𝑡𝑖 𝑝(𝑜𝑡 |𝑥𝑡 )
...
Baum-welch
• Now, we need to update our probabilities.
𝒋
• Remember that 𝒂𝒊𝒋 is a transition probability 𝒑 𝒙𝒕 𝒙𝒊𝒕−𝟏 and
𝒋
𝒆𝒋𝒌 is the emission probability 𝒑 𝒐𝒌𝒕 𝒙𝒕 .
• Now, we need a “correction” or “update” expectation 𝜸𝒊𝒋 (𝒕)

to apply to 𝒂𝒊𝒋 and 𝒆𝒋𝒌 .
• e.g. 𝜸𝒊 (𝒕): the expected frequency of staying in 𝒙𝒊 at time 𝒕.

𝑇
෍ 𝛾𝑖 (𝑡)
𝑡=1
...
Baum-welch
• Then, using the forward and backward probabilities:

𝛼𝑖 𝑡 − 1 𝑎𝑖𝑗 𝑒 𝑗𝑘 𝛽𝑗 (𝑡)
𝛾𝑖𝑗 𝑡 =
𝑝(𝑂|𝑎𝑖𝑗 , 𝑒 𝑗𝑘 )
where
𝑁
𝑝 𝑂 𝑎𝑖𝑗 , 𝑒 𝑗𝑘 = ෍ 𝛼𝑠 𝑡 − 1 𝛽𝑠 (𝑡)
𝑠=1
...
Baum-welch
• Then, we update the parameters

σ 𝑇
𝑖𝑗 𝑡=1 𝛾𝑖𝑗 (𝑡)
𝑎 = 𝑇
σ𝑡=1 σ𝑚 𝛾𝑖𝑚 (𝑡)
σ𝑡=1;𝑜𝑡 =𝑜𝑘 σ𝑚 𝛾𝑘𝑚 (𝑡)

𝑒 𝑗𝑘 =
σ𝑇𝑡=1 σ𝑚 𝛾𝑗𝑚 (𝑡)
.
Implementation
• We want to model the future probability that a baby is of
three states given its current state.
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
• first, we need to define the state space, the initial

probabilities and the transition probabilities.
• Suppose a baby can be in a three states e.g. distressed,
crying, smiling
states = [‘distressed', 'crying', ‘smiling']
• We assume that the initial probabilities for the three states

are 35%, 35%, and 30%
pi = [0.35, 0.35, 0.30]
state_space = pd.Series(pi, index=states, name='states')
print(state_space)
print(state_space.sum())
distressed 0.35
crying 0.35
smiling 0.30
Name: states,
dtype: float64 1.0
• Now, we need to define the transition probabilities e.g. staying in the
same state or moving from one state to another state given the current
state.
• The matrix is in 𝑀 × 𝑀 size; 𝑀 is the number of states.
q_df = pd.DataFrame(columns=states, index=states)

q_df.loc[states[0]] = [0.4, 0.4, 0.5]
q_df.loc[states[1]] = [0.3, 0.4, 0.3]
q_df.loc[states[2]] = [0.3, 0.2, 0.2] Transition Probabilities:
distressed crying smiling
distressed0.4 0.4 0.5
print('Transition Probabilities:') crying 0.3 0.4 0.3
print(q_df) smiling 0.3 0.2 0.2
[[0.4 0.4 0.5]
q = q_df.values [0.3 0.4 0.3]
print('\n', q, q.shape, '\n') [0.3 0.2 0.2]] (3, 3)
print(q_df.sum(axis=1)) distressed 1.3
crying 1.0
smiling 0.7
dtype: float64
• Next, we create a graph of states and transition probabilities, where the
baby's possible states are the nodes and the edges are the lines that
connect the nodes having weights e.g. transition probabilities.
from pprint import pprint
• We define a function that maps transition probability data-

frame to Markov edges and weights
def _get_markov_edges(Q):
edges = {} {('crying', 'crying'): 0.4,
for col in Q.columns: ('crying', 'distressed'): 0.3,
for idx in Q.index: ('crying', 'smiling'): 0.3,
edges[(idx,col)] = Q.loc[idx,col] ('distressed', 'crying'): 0.4,
return edges ('distressed', 'distressed'):
0.4, ('distressed', 'smiling'):
0.5, ('smiling', 'crying'): 0.2,
edges_wts = _get_markov_edges(q_df)
('smiling', 'distressed'): 0.3,
pprint(edges_wts) ('smiling', 'smiling'): 0.2}
• After that, we are ready to create our graph e.g. graph object,
nodes, edges, and labels.
# graph object
G = nx.MultiDiGraph()
# nodes correspond to states
G.add_nodes_from(states)
print(f'Nodes:\n{G.nodes()}\n')
# edges represent transition probabilities
for k, v in edges_wts.items():
tmp_origin, tmp_destination = k[0], k[1]
G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)
print(f'Edges:')
pprint(G.edges(data=True))
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos)
# edge labels
edge_labels = {(n1,n2):d['label'] for n1,n2,d in
G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels)
nx.drawing.nx_pydot.write_dot(G, 'pet_baby_markov.dot')
Nodes: ['distressed', 'crying', 'smiling']
Edges: OutMultiEdgeDataView([
('distressed', 'distressed', {'weight': 0.4, 'label': 0.4}),
('distressed', 'crying', {'weight': 0.4, 'label': 0.4}),
('distressed', 'smiling', {'weight': 0.5, 'label': 0.5}),
('crying', 'distressed', {'weight': 0.3, 'label': 0.3}),
('crying', 'crying', {'weight': 0.4, 'label': 0.4}),
('crying', 'smiling', {'weight': 0.3, 'label': 0.3}),
('smiling', 'distressed', {'weight': 0.3, 'label': 0.3}),
('smiling', 'crying', {'weight': 0.2, 'label': 0.2}),
('smiling', 'smiling', {'weight': 0.2, 'label': 0.2})])
• One edge from any node to another node 0.4

represent the probability that the baby will 0.4 0.4
transit to another state. 0.3
• For example; if the baby is distressed: distressed crying
• 0.4 probability that the baby will keep
distressed, 0.2 0.3
• 0.4 chance that the baby will be crying
• 0.2 probability that it will be smiling. 0.3 0.3
smiling
0.4
• Now, we move to HMM
• If we see the baby is distressed, how we can infer that the reason is being
hungry or sleepy?
• In this situation, the true state of the baby is unknown for us, thus "hidden"
from us.
• one way to model this situation is to assume that the baby has observable
behaviors that represents the true, hidden state.
# create state space and initial state probabilities
hidden_states = ['Hungry', 'Happy', 'Sleepy']

pi = [0.3, 0.3, 0.4]
state_space = pd.Series(pi, index=hidden_states,
name='states') Hungry 0.3
print(state_space) Happy 0.3
print('\n', state_space.sum()) Sleepy 0.4
Name: states,
dtype: float64
1.0
• We need to create our transition matrix for the hidden states
as well.
a_df = pd.DataFrame(columns=hidden_states,
index=hidden_states)
a_df.loc[hidden_states[0]] = [0.4, 0.4, 0.5]
print(a_df)
Hungry Happy Sleepy
a = a_df.values Hungry 0.4 0.4 0.5
print('\n', a, a.shape, '\n') Happy 0.3 0.4 0.3
print(a_df.sum(axis=1)) Sleepy 0.3 0.2 0.2
[[0.4 0.4 0.5]
[0.3 0.4 0.3]
[0.3 0.2 0.2]] (3, 3)
Hungry 1.3
Happy 1.0
Sleepy 0.7
dtype: float64
• Next, we create the emission observations probability matrix, the
probability that the baby is in one of the hidden states, given the current
observable state.
• 𝑀 × 𝑂: 𝑀 is the number of hidden states and 𝑂 is the number of possible
observations.
observable_states = states
b_df = pd.DataFrame(columns=observable_states,
index=hidden_states)
b_df.loc[hidden_states[0]] = [0.4, 0.4, 0.2]
b_df.loc[hidden_states[1]] = [0.2, 0.01, 0.79]
b_df.loc[hidden_states[2]] = [0.5,0.4,0.1]
print(b_df) distressed crying smiling

Hungry 0.4 0.4 0.2
Happy 0.2 0.01 0.79
b = b_df.values
Sleepy 0.5 0.4 0.1
print('\n', b, b.shape, '\n') [[0.4 0.4 0.2]
print(b_df.sum(axis=1)) [0.2 0.01 0.79]
[0.5 0.4 0.1]] (3, 3)
Hungry 1.0 Happy 1.0 Sleepy 1.0
dtype: float64
• Next, we create our graph edges and weights
hide_edges_wts = _get_markov_edges(a_df)
pprint(hide_edges_wts)
{('Happy', 'Happy'): 0.4,
emit_edges_wts = _get_markov_edges(b_df) ('Happy', 'Hungry'): 0.3,
pprint(emit_edges_wts) ('Happy', 'Sleepy'): 0.3,
('Hungry', 'Happy'): 0.4,
('Hungry', 'Hungry'): 0.4,
('Hungry', 'Sleepy'): 0.5,
('Sleepy', 'Happy'): 0.2,
('Sleepy', 'Hungry'): 0.3,
('Sleepy', 'Sleepy'): 0.2}
{('Happy', 'crying'): 0.01,
('Happy', 'distressed'): 0.2,
('Happy', 'smiling'): 0.79,
('Hungry', 'crying'): 0.4,
('Hungry', 'distressed'): 0.4,
('Hungry', 'smiling'): 0.2,
('Sleepy', 'crying'): 0.4,
('Sleepy', 'distressed'): 0.5,
('Sleepy', 'smiling'): 0.1}
# create graph object
G = nx.MultiDiGraph()
# nodes correspond to states

G.add_nodes_from(hidden_states)
print(f'Nodes:\n{G.nodes()}\n')
# edges represent hidden probabilities

for k, v in hide_edges_wts.items():
G.add_edge(tmp_origin, tmp_destination, weight=v,
label=v)
# edges represent emission probabilities

for k, v in emit_edges_wts.items():
G.add_edge(tmp_origin, tmp_destination, weight=v,
label=v)
print(f'Edges:')
pprint(G.edges(data=True))
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='neato')
nx.draw_networkx(G, pos)
# create edge labels for jupyter plot but is not necessary

emit_edge_labels = {(n1,n2):d['label'] for n1,n2,d in
G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos,
edge_labels=emit_edge_labels)
nx.drawing.nx_pydot.write_dot(G,
'pet_baby_hidden_markov.dot')
Nodes: ['Hungry', 'Happy', 'Sleepy']
Edges: OutMultiEdgeDataView([
('Hungry', 'Hungry', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'Happy', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'Sleepy', {'weight': 0.5, 'label': 0.5}),
('Hungry', 'distressed', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'crying', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'smiling', {'weight': 0.2, 'label': 0.2}),
('Happy', 'Hungry', {'weight': 0.3, 'label': 0.3}),
('Happy', 'Happy', {'weight': 0.4, 'label': 0.4}),
('Happy', 'Sleepy', {'weight': 0.3, 'label': 0.3}),
('Happy', 'distressed', {'weight': 0.2, 'label': 0.2}),
('Happy', 'crying', {'weight': 0.01, 'label': 0.01}),
('Happy', 'smiling', {'weight': 0.79, 'label': 0.79}),
('Sleepy', 'Hungry', {'weight': 0.3, 'label': 0.3}),
('Sleepy', 'Happy', {'weight': 0.2, 'label': 0.2}), hungry happy sleepy distressed crying smiling
hungry 0.4 0.4 0.5 0.4 0.4 0.2
('Sleepy', 'Sleepy', {'weight': 0.2, 'label': 0.2}), happy 0.3 0.4 0.3 0.2 0.01 0.79
('Sleepy', 'distressed', {'weight': 0.5, 'label': 0.5}), Sleepy 0.3 0.2 0.2 0.5 0.4 0.1
('Sleepy', 'crying', {'weight': 0.4, 'label': 0.4}),
('Sleepy', 'smiling', {'weight': 0.1, 'label': 0.1})])
The dotfile:
for example, if the baby is “smiling” there is high probability that is “happy” and very low probability
that it is “hungry”.
The dotfile in graph:
smiling
0.79
0.2
0.4 0.4
0.4
Hungry Happy
0.3
0.1
0.4 0.2
0.4
0.5
crying 0.3 0.01
0.2 0.3 distressed
Sleepy 0.5
0.4
0.2
• suppose we have the following observation sequence of
baby's behavior encoded numerically
obs_map = {'distressed':0, 'crying':1, 'smiling':2}
obs = np.array([1,1,2,1,0,1,2,1,0,2,2,0,1,0,1])
inv_obs_map = dict((v,k) for k, v in Obs_code Obs_seq

obs_map.items()) 0 1 crying
obs_seq = [inv_obs_map[v] for v in 1 1 crying
list(obs)] 2 2 smiling
3 1 crying
4 0 distressed
print(
5 1 crying
pd.DataFrame(np.column_stack([obs, 6 2 smiling
obs_seq]), 7 1 crying
columns=['Obs_code', 8 0 distressed
'Obs_seq']) ) 9 2 smiling
10 2 smiling
11 0 distressed
12 1 crying
13 0 distressed
14 1 crying
Viterbi algorithm
def viterbi(pi, a, b, obs):
nStates = np.shape(b)[0]
T = np.shape(obs)[0]
# init blank path

path = np.zeros(T,dtype=int)
# delta --> highest probability of any path that reaches state i

delta = np.zeros((nStates, T))
# phi --> argmax by time step for each state

phi = np.zeros((nStates, T))
# init delta and phi

delta[:, 0] = pi * b[:, obs[0]]
phi[:, 0] = 0
print('\nStart Walk Forward\n')
# the forward algorithm extension
for t in range(1, T):
for s in range(nStates):
delta[s, t] = np.max(delta[:, t-1] * a[:, s]) *
b[s, obs[t]]
phi[s, t] = np.argmax(delta[:, t-1] * a[:, s])
print('s={s} and t={t}: phi[{s}, {t}] =
{phi}'.format(s=s, t=t, phi=phi[s, t]))
# find optimal path
print('-'*50)
print('Start Backtrace\n')
path[T-1] = np.argmax(delta[:, T-1])
#p('init path\n t={} path[{}-1]={}\n'.format(T-1, T,
path[T-1]))
for t in range(T-2, -1, -1):
path[t] = phi[path[t+1], [t+1]]
#p(' '*4 + 't={t}, path[{t}+1]={path},
[{t}+1]={i}'.format(t=t, path=path[t+1], i=[t+1]))
print('path[{}] = {}'.format(t, path[t]))
return path, delta, phi
path, delta, phi = viterbi(pi, a, b, obs)

print('\nsingle best state path: \n', path)
print('delta:\n', delta)
print('phi:\n', phi)
Start Walk Forward ...
s=0 and t=1: phi[0, 1] = 0.0 s=0 and t=8: phi[0, 8] = 0.0
…
Start Backtrace
path[13] = 0
path[12] = 0
path[11] = 0
path[10] = 1
path[9] = 1
path[8] = 0
path[7] = 0
path[6] = 1
path[5] = 0
path[4] = 2
path[3] = 0
path[2] = 1
path[1] = 0
path[0] = 0
single best state path: [0 0 1 0 2 0 1 0 0 1 1 0 0 0 2]

delta: [[1.40000000e-01 2.24000000e-02 1.79200000e-03 8.49408000e-04
1.35905280e-04 2.54822400e-05 2.03857920e-06 9.66286541e-07 1.54605847e-07
1.44942981e-08 2.93132685e-09 1.85259857e-09 2.96415771e-10 4.74265234e-11
8.89247313e-12]
[3.50000000e-03 5.60000000e-04 7.07840000e-03 2.83136000e-05
6.79526400e-05 5.43621120e-07 8.05238784e-06 3.22095514e-08 7.73029233e-08
4.88554475e-08 1.54383214e-08 1.23506571e-09 7.41039428e-12 2.37132617e-11
1.89706093e-13]
[1.20000000e-01 2.80000000e-02 1.12000000e-03 8.49408000e-04
2.12352000e-04 2.71810560e-05 1.27411200e-06 9.66286541e-07 2.41571635e-07
7.73029233e-09 1.46566343e-09 2.31574821e-09 3.70519714e-10 7.41039428e-11
9.48530467e-12]]
phi: [[0. 0. 0. 1. 0. 2. 0. 1. 0. 2. 1. 1. 0. 0. 2.]

[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0.]]
• The results:
state_map = {0:'Hungry', 1:'Happy', 2:'Sleepy'}
state_path = [state_map[v] for v in path]
Observation Best_Path
(pd.DataFrame()
0 crying Sleepy
.assign(Observation=obs_seq)
1 crying Hungry
.assign(Best_Path=state_path))
2 smiling Happy
3 crying Hungry
4 distressed Sleepy
5 crying Hungry
6 smiling Happy
7 crying Hungry
8 distressed Hungry
9 smiling Happy
10 smiling Happy
11 distressed Hungry
12 crying Hungry
13 distressed Hungry
14 crying Sleepy

09 - Hidden Markov Model

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

09 - Hidden Markov Model

Încărcat de

Drepturi de autor:

Formate disponibile

Md.

Jalil Piran, PhD

• Markov Decision Process (MDP)

• Hidden Markov Model (HMM)

• The Forward algorithm

• In RL, Markov decision process (MDP) is

• A Markov chain is a stochastic model.

• It is a mathematical system that

• The probability of each event depends

• An observable Markov process:

𝑷(𝑺𝒕+𝟏 = 𝒔′ |𝑺𝒕 = 𝒔𝒕 , 𝑨𝒕 = 𝒂𝒕 , 𝑺𝒕−𝟏 = 𝒔𝒕−𝟏 , 𝑨𝒕−𝟏 = 𝒂𝒕−𝟏 , … , 𝑺𝟎 = 𝒔𝟎 )

• Markov Chain are NOT so useful for most agents!

• In ML, many states are hard to determine or even not

• Hidden Semi Markov

Face Recognition Speech Recognition

QuickPath Swipe Keyboad

Morgen fliege ich Nach kanada Zur konferenz

Tomorrow I will fly to the conference in Canada

Hungry Happy Sleepy

• We just observe the behavior of a baby.

• Initial state probability 𝝅𝒊

0.2 Emission probability

• The baby’s behavior depends on its state (𝑝 𝑜 𝑘 |𝑥 𝑖 )

Hungry Happy Sleepy

𝐀 𝑗 Hungry 0.4 0.4 0.5 If the baby is

Future Present states

Hungry Happy Sleepy

• Once we have an HMM, there are three problems of interest:

• Evaluation; to compute the observation probability:

• Decoding (inference): to find the sequence of hidden states based on

• Learning: to learn and generate a HMM model based on observations.

• Goal: to find the state based on a given observation?

• aka. Bayes network, belief network, decision network, Bayes(ian)

• It is a probabilistic graphical model that represents a set of

• Example) the probabilistic relationships between diseases

• To model many observations

𝑆𝑡𝑖𝑚𝑒𝑆𝑙𝑖𝑐𝑒 ∈ ℎ𝑎𝑝𝑝𝑦, ℎ𝑢𝑛𝑔𝑟𝑦, 𝑠𝑙𝑒𝑒𝑝𝑦 ;

• every 𝑥𝑡 depends only on 𝑥𝑡−1

Based on our observations, we can:

• Compute the most likely state after a sequence of observations (HMM).

The first problem:

• The forward algorithm (a.k.a. filtering) is to predicate a state after

Using Bayes’ Theorem:

Bayes’ Theorem describes the probability of an event, based on prior

• So, we know how to compute the probability of the current

• But at some point we need probability for the starting state.

• Solution: we use ‘initial probabilities’, from our experience

• ‘happy’ 40% of the time,

• ‘sleepy’ 30% of the time.

Recall general probability rules:

• Remember conditioning probability:

So, the probability of the model producing a set of observations

Goal: to find the likelihood of observation O.

Therefore, the probability of the model with 𝑹 possible

• But it seems to complex: 

• For 𝑵 hidden states in 𝑻 time slices, computation is:

• Because there are 𝑵𝑻 combinations, so 𝑹 = 𝑵𝑻 and the

• How to reduce the complexity?

• The probability of being in state 𝒙𝒊 at time slice 𝒕 with a history of

• Using the chain rule:

𝛼𝑖 0 = 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑓𝑜𝑟 𝑥 𝑖

• The computation time is 𝑂 𝑁 2 𝑇 

• Next, we show this procedure in a trellis.