Documente Academic
Documente Profesional
Documente Cultură
• Bayesian Networks
• Implementation
Reinforcement Learning
...
Markov Chain
• Negative rewards
...
MDP
• Markov property:
• Markov means outcome of an action depend only on current state
• Independent from the future and the past states.
= can be ignored!
Markov property:
𝑷 𝑺𝒕+𝟏 = 𝒔′ 𝑺𝒕 = 𝒔𝒕 , 𝑨𝒕 = 𝒂𝒕 )
.
HMM
Types
• Profile HMM
• A linear state machine consisting of a series of nodes, each of which corresponds
roughly to a position in the alignment from which it was build.
• Hierarchical HMM
• Each state is considered to be self-contained probabilistic model, e.g. each state
of HHMM is itself an HHMM.
• Factorial HMM
• Allows for a single observation to be conditioned on the corresponding variables of a set of
𝐾 independent Markov chains, rather than using a single Markov chain.
• Coupled HMM
• Such state-space models that form a natural extension to standard HMM.
• Layered HMM
• Consists of 𝑁 levels of HMMs, where the HMMs on level 𝑖 + 1 corresponds to
observation symbols or probability generators at level 𝑖.
.
Applications
Samsung Health
Pattern Recognition Activity Recognition ...
Applications
Robot Localization …
Transportation Forecasting
Gene Prediction
...
Applications
• Language Modeling
• Sentence completion
• Predictive text input
• Classification
• Navie Bayes == unigram
• Machine translation
Text-to-Speech
Note) the number of states and observations do not necessarily need to be the same ...
HMM
HMM Model 𝝀 = (𝑨, 𝑩, 𝝅)
• Transition probability 𝒂𝒊𝒋
• Based on experience, if the baby was in state 𝒙𝒊 an hour ago (time 𝒕 − 𝟏), we may
know the probability of it being in state 𝒙𝒋 now (time 𝒕).
𝑗
𝑖
A = 𝑎𝑖𝑗 𝑤ℎ𝑒𝑟𝑒 𝑎𝑖𝑗 = 𝑝 𝑥𝑡 |𝑥𝑡−1
the baby is in state
• Emission probability 𝒆𝒋𝒌 𝒊 at time 𝒕 − 𝟏
• Also from the experience with many babies, we can infer the probability that we
observe a behavior 𝒐𝒌 when the baby is in state 𝒙𝒋.
B = 𝑒 𝑗𝑘 𝑤ℎ𝑒𝑟𝑒 𝑒 𝑗𝑘 = 𝑝 𝑜𝑘 |𝑥 𝑗
𝑗 𝑖
𝑃 𝑥𝑡 |𝑥𝑡−1 = 𝑎𝑖𝑗 = 1
𝑗 𝑗
• If the baby always react with one of the behaviors THEN the
sum of emission probabilities is 1.
𝑃 𝑜 𝑘 |𝑥 𝑗 = 𝑒 𝑗𝑘 = 1
𝑘 𝑘
...
HMM
Tabular Presentation
Distribution 𝝅𝒊
Intial State
𝑃 𝑥1 𝑃 𝑥1 = 𝐻𝑢𝑛𝑔𝑟𝑦 = 0.4
𝝅 𝑃 𝑥1 = 𝐻𝑎𝑝𝑝𝑦 = 0.3
𝑃 𝑥1 = 𝑆𝑙𝑒𝑒𝑝𝑦 = 0.3
𝑖
baby is … (𝑥𝑡−𝑖 ) Future Present states
states
Probabilities 𝒂𝒊𝒋
If baby is … (𝑥 𝑗 )
Probabilities 𝐞𝐣𝐤
states
Pattern Recognition
• Given the mode 𝝀 = 𝑨, 𝑩, 𝝅 and a sequence of observations
𝑶 ={𝒐𝟏 , 𝒐𝟐 , … , 𝒐𝑻}, how to find 𝑷 𝑶 𝝀 =?
• Solution: Forward Algorithm
𝒙𝟎 𝒙𝟏 𝒙𝟐 𝒙𝒕−𝟏 𝒙𝒕
State at
time 𝑡1
𝒐𝟏 𝒐𝟐 𝒐𝒕−𝟏 𝒐𝒕
Observation
at time 𝑡1
Observation at time 𝑡2 depends on state 𝑥2
...
HMM
...
The Forward Algorithm
We need to compute:
What value
in 𝑥 will argmax 𝑝 smiling 𝑡 |𝑥𝑡𝑖 , ℎ𝑢𝑛𝑔𝑟𝑦𝑡−1
give the 𝑥 𝑖 ∈ ℎ𝑢𝑛𝑔𝑟𝑦,ℎ𝑎𝑝𝑝𝑦,𝑠𝑙𝑒𝑒𝑝𝑦
highest
probability
𝑗
=argmax 𝑝 𝑜𝑡𝑘 = 𝑠𝑚𝑖𝑙𝑖𝑛𝑔|𝑥𝑡𝑖 , 𝑥𝑡−1 = ℎ𝑢𝑛𝑔𝑟𝑦
𝑥𝑖
Current state previous state
...
The Forward Algorithm
...
The forward algorithm
...
The Forward Algorithm
𝑅
𝑃 𝒳, 𝑂 = 𝑃 𝑂 𝒳𝑟 𝑃(𝒳𝑟 )
𝑟=1
But we know that
𝑇 𝑇
𝑖 𝑗
𝑃 𝒳𝑟 = ෑ 𝑃 𝑥𝑡 𝑥𝑡−1 = ෑ 𝑎𝑖𝑗
and 𝑡=1 𝑡=1
𝑇 𝑇
𝑗
𝑃 𝑂|𝒳𝑟 = ෑ 𝑃 𝑜𝑡𝑘 𝑥𝑡 = ෑ 𝑒 𝑗𝑘
𝑡=1 𝑡=1
𝑃 𝒳, O = ෑ 𝑎𝑖𝑗 𝑒 𝑗𝑘
𝑟=1 𝑡=1 ...
The Forward Algorithm
𝑃 𝒳, O = ෑ 𝑎𝑖𝑗 𝑒 𝑗𝑘
𝑟=1 𝑡=1
𝑶(𝑵𝑻 𝑻)
...
The Forward Algorithm
Complexity Reduction All the previous
observation
• Partial probability: including the
current one
αi t = p xti , o1:t
j j j
αi t = p ot |xti , xt−1 , o1:t−1 p xti xt−1 , o1:t−1 p(xt−1 , o1:t−1 )
j 𝛂𝐣 𝐭−𝟏
𝐩 𝐨𝐭 |𝐱 𝐭𝐢 𝐣
𝐩 𝐱 𝐭𝐢 |𝐱 𝐭−𝟏
...
The Forward Algorithm
𝑗 𝑗 𝑗
𝛼𝑖 𝑡 = 𝑝 𝑜𝑡 |𝑥𝑡𝑖 , 𝑥𝑡−1 , 𝑜1:𝑡−1 𝑝 𝑥𝑡𝑖 𝑥𝑡−1 , 𝑜1:𝑡−1 𝑝(𝑥𝑡−1 , 𝑜1:𝑡−1 )
𝑗 𝑝 𝑜𝑡 |𝑥𝑡𝑖 𝑗
𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 𝑡−1
therefore
𝑗
𝛼𝑖 𝑡 = 𝑝 𝑜𝑡 |𝑥𝑡𝑖 𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 (𝑡 − 1)
𝑗
Suppose we observed the baby during three hours and its behaviors
at every hour were:
...
The Forward Algorithm
Trellis representation sl: sleepy
ha: happy
• Observation at 𝒕 = 𝟏 : ‘smiling’ (sm) hu: hungry
𝟎. 𝟒
𝟎. 𝟑 𝑗
𝛼𝑖 𝑡 = 𝑝 𝑜𝑡 |𝑥𝑡𝑖 𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 (𝑡 − 1)
𝑗
...
The Forward Algorithm
• Observation at 𝒕 = 𝟏 : smiling(sm)
• We repeat the same procedure for ‘happy’
𝟎. 𝟑
𝑃 𝑠𝑚 ℎ𝑎 𝑃 ℎ𝑎 ℎ𝑎 𝛼ℎ𝑎 (0)
𝜶𝒉𝒂 (𝟎) + 𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝟎. 𝟑
𝑗
𝛼𝑖 𝑡 = 𝑝 𝑜𝑡 |𝑥𝑡𝑖 𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 (𝑡 − 1)
𝑗
...
The Forward Algorithm
• Observation at 𝒕 = 𝟏 : smiling(sm)
• We repeat the same procedure for ‘hungry’
𝟎. 𝟑
𝟎. 𝟒
𝑃 𝑠𝑚 ℎ𝑢 𝑃 ℎ𝑢 ℎ𝑢 𝛼ℎ𝑢 (0)
𝜶𝒉𝒖 (𝟎) + 𝜶𝒉𝒖 (𝟏) 𝜶𝒉𝒖 (𝟐)
𝟎. 𝟑
𝑗
𝛼𝑖 𝑡 = 𝑝 𝑜𝑡 |𝑥𝑡𝑖 𝑝 𝑥𝑡𝑖 |𝑥𝑡−1 𝛼𝑗 (𝑡 − 1)
𝑗
...
The Forward Algorithm
Observation at 𝒕 = 𝟐 : distressed
We follow the same procedure as for time 𝑡 = 1
𝟎. 𝟑
𝟎. 𝟒
𝟎. 𝟑
.
• The second problem:
• Decoding (inference): to find the sequence of hidden
states based on the current model and observation.
• Given 𝑶 = {𝑶𝟏 , 𝑶𝟐 , … , 𝑶𝑻 } and mode 𝝀, how to find the best
sequence of states 𝑿 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝑻 } corresponding to the
observations?
• Solution: Viterbi algorithm
...
Viterbi Algorithm
• Just like the trellis, but maintaining only the highest probability at
each timestep 𝒕.
...
Viterbi Algorithm
The Algorithm
• We defined:
• 𝒂𝒊𝒋 : the transition probability from state 𝒊 to 𝒋.
• 𝝅𝒊 : the initial probability of being in state 𝒊.
• 𝑿: the set of possible states.
• Initialization step:
𝑉1,𝑘 = 𝑝 𝑜1 |𝑥1𝑘 𝜋𝑘 ∀ 𝑘 ∈ 𝑋
• Recursion step:
• For each timeslice 𝑡 > 1
• For each state 𝑘 ∈ 𝑋 the maximum probability of
𝑉𝑡,𝑘 = max 𝑝 𝑜𝑡 |𝑥𝑡𝑘 𝑎 𝑗𝑘 𝑉𝑡−1,𝑖 being in state 𝒌 at time 𝒕
Viterbi 𝑖∈𝑋
Note) in the Forward algorithm we took sum, but here we take max! .
Trellis representation
Step 0
𝜶𝒔𝒍 (𝟎)
• The step 0 is to just list out all possible
state at time 0, and their corresponding
𝟎. 𝟑 probability values,
𝜶𝒉𝒂 (𝟎)
• We do not decide which state is chosen at
this stage.
𝟎. 𝟒
𝜶𝒉𝒖 (𝟎)
𝟎. 𝟑
...
Trellis representation
Step 1
Out of 3 input the max value is selected
𝟎. 𝟑
𝟎. 𝟒
𝟎. 𝟑
...
Trellis representation
Step 1
Out of 3 input the max value is selected
𝟎. 𝟑
𝑃 𝑠𝑚 ℎ𝑎 𝑃 ℎ𝑎 ℎ𝑎 𝛼ℎ𝑎 (0)
𝜶𝒉𝒂 (𝟎) 𝑴𝒂𝒙
𝜶𝒉𝒂 (𝟏) 𝜶𝒉𝒂 (𝟐)
𝟎. 𝟒
𝟎. 𝟑
...
Trellis representation
Step 1
Out of 3 input the max value is selected
𝟎. 𝟑
𝟎. 𝟒
𝟎. 𝟑
...
Baum-welch
Forward-backward algorithm
Refining the model
• The transition and observation probabilities come from
somewhere, but they may not be always available or even
not accurate.
• Idea: use some kind of estimation techniques (e.g. EM) to
find parameters that better fit the observations, e.g. learning.
...
Baum-welch
Forward-backward algorithm
...
Baum-welch
Forward-backward algorithm
• Now, we need to update our probabilities.
𝒋
• Remember that 𝒂𝒊𝒋 is a transition probability 𝒑 𝒙𝒕 𝒙𝒊𝒕−𝟏 and
𝒋
𝒆𝒋𝒌 is the emission probability 𝒑 𝒐𝒌𝒕 𝒙𝒕 .
𝛾𝑖 (𝑡)
𝑡=1
...
Baum-welch
Forward-backward algorithm
where
𝑁
𝑝 𝑂 𝑎𝑖𝑗 , 𝑒 𝑗𝑘 = 𝛼𝑠 𝑡 − 1 𝛽𝑠 (𝑡)
𝑠=1
...
Baum-welch
Forward-backward algorithm
.
Implementation
• We want to model the future probability that a baby is of
three states given its current state.
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
def _get_markov_edges(Q):
edges = {} {('crying', 'crying'): 0.4,
for col in Q.columns: ('crying', 'distressed'): 0.3,
for idx in Q.index: ('crying', 'smiling'): 0.3,
edges[(idx,col)] = Q.loc[idx,col] ('distressed', 'crying'): 0.4,
return edges ('distressed', 'distressed'):
0.4, ('distressed', 'smiling'):
0.5, ('smiling', 'crying'): 0.2,
edges_wts = _get_markov_edges(q_df)
('smiling', 'distressed'): 0.3,
pprint(edges_wts) ('smiling', 'smiling'): 0.2}
• After that, we are ready to create our graph e.g. graph object,
nodes, edges, and labels.
# graph object
G = nx.MultiDiGraph()
# nodes correspond to states
G.add_nodes_from(states)
print(f'Nodes:\n{G.nodes()}\n')
# edges represent transition probabilities
for k, v in edges_wts.items():
tmp_origin, tmp_destination = k[0], k[1]
G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)
print(f'Edges:')
pprint(G.edges(data=True))
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos)
# edge labels
edge_labels = {(n1,n2):d['label'] for n1,n2,d in
G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels)
nx.drawing.nx_pydot.write_dot(G, 'pet_baby_markov.dot')
Nodes: ['distressed', 'crying', 'smiling']
Edges: OutMultiEdgeDataView([
('distressed', 'distressed', {'weight': 0.4, 'label': 0.4}),
('distressed', 'crying', {'weight': 0.4, 'label': 0.4}),
('distressed', 'smiling', {'weight': 0.5, 'label': 0.5}),
('crying', 'distressed', {'weight': 0.3, 'label': 0.3}),
('crying', 'crying', {'weight': 0.4, 'label': 0.4}),
('crying', 'smiling', {'weight': 0.3, 'label': 0.3}),
('smiling', 'distressed', {'weight': 0.3, 'label': 0.3}),
('smiling', 'crying', {'weight': 0.2, 'label': 0.2}),
('smiling', 'smiling', {'weight': 0.2, 'label': 0.2})])
0.4
• Now, we move to HMM
• If we see the baby is distressed, how we can infer that the reason is being
hungry or sleepy?
• In this situation, the true state of the baby is unknown for us, thus "hidden"
from us.
• one way to model this situation is to assume that the baby has observable
behaviors that represents the true, hidden state.
print(a_df)
Hungry Happy Sleepy
a = a_df.values Hungry 0.4 0.4 0.5
print('\n', a, a.shape, '\n') Happy 0.3 0.4 0.3
print(a_df.sum(axis=1)) Sleepy 0.3 0.2 0.2
[[0.4 0.4 0.5]
[0.3 0.4 0.3]
[0.3 0.2 0.2]] (3, 3)
Hungry 1.3
Happy 1.0
Sleepy 0.7
dtype: float64
• Next, we create the emission observations probability matrix, the
probability that the baby is in one of the hidden states, given the current
observable state.
• 𝑀 × 𝑂: 𝑀 is the number of hidden states and 𝑂 is the number of possible
observations.
observable_states = states
b_df = pd.DataFrame(columns=observable_states,
index=hidden_states)
b_df.loc[hidden_states[0]] = [0.4, 0.4, 0.2]
b_df.loc[hidden_states[1]] = [0.2, 0.01, 0.79]
b_df.loc[hidden_states[2]] = [0.5,0.4,0.1]
print(f'Edges:')
pprint(G.edges(data=True))
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='neato')
nx.draw_networkx(G, pos)
Edges: OutMultiEdgeDataView([
('Hungry', 'Hungry', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'Happy', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'Sleepy', {'weight': 0.5, 'label': 0.5}),
('Hungry', 'distressed', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'crying', {'weight': 0.4, 'label': 0.4}),
('Hungry', 'smiling', {'weight': 0.2, 'label': 0.2}),
('Happy', 'Hungry', {'weight': 0.3, 'label': 0.3}),
('Happy', 'Happy', {'weight': 0.4, 'label': 0.4}),
('Happy', 'Sleepy', {'weight': 0.3, 'label': 0.3}),
('Happy', 'distressed', {'weight': 0.2, 'label': 0.2}),
('Happy', 'crying', {'weight': 0.01, 'label': 0.01}),
('Happy', 'smiling', {'weight': 0.79, 'label': 0.79}),
('Sleepy', 'Hungry', {'weight': 0.3, 'label': 0.3}),
('Sleepy', 'Happy', {'weight': 0.2, 'label': 0.2}), hungry happy sleepy distressed crying smiling
hungry 0.4 0.4 0.5 0.4 0.4 0.2
('Sleepy', 'Sleepy', {'weight': 0.2, 'label': 0.2}), happy 0.3 0.4 0.3 0.2 0.01 0.79
('Sleepy', 'distressed', {'weight': 0.5, 'label': 0.5}), Sleepy 0.3 0.2 0.2 0.5 0.4 0.1
('Sleepy', 'crying', {'weight': 0.4, 'label': 0.4}),
('Sleepy', 'smiling', {'weight': 0.1, 'label': 0.1})])
The dotfile:
for example, if the baby is “smiling” there is high probability that is “happy” and very low probability
that it is “hungry”.
The dotfile in graph:
smiling
0.79
0.2
0.4 0.4
0.4
Hungry Happy
0.3
0.1
0.4 0.2
0.4
0.5
crying 0.3 0.01
0.2 0.3 distressed
Sleepy 0.5
0.4
0.2
• suppose we have the following observation sequence of
baby's behavior encoded numerically
obs_map = {'distressed':0, 'crying':1, 'smiling':2}
obs = np.array([1,1,2,1,0,1,2,1,0,2,2,0,1,0,1])
nStates = np.shape(b)[0]
T = np.shape(obs)[0]