Documente Academic
Documente Profesional
Documente Cultură
Pair HMM
HMM for pairwise sequence alignment, which
incorporates affine gap scores.
Hidden States
Match (M)
Insertion in x (X)
insertion in y (Y)
Observation Symbols
Match (M): {(a,b)| a,b in }.
Insertion in x (X): {(a,-)| a in }.
Insertion in y (Y): {(-,a)| a in }.
Pair HMMs
1--2 1- X
1-2 M
End
Begin
Y
1-
Alignment: a path a hidden state
sequence
A T - G T T A T
A T C G T - A C
M M Y M M X M M
Multiple sequence alignment
(Globin family)
Profile model (PSSM)
Begin Mj End
Components of profile HMMs
Delete states
No emission prob.
Cost of a deletion
MD, DD, DM
Each DD might be different
Dj
Begin Mj End
Full structure of profile HMMs
Dj
Ij
Begin Mj End
Deriving HMMs from multiple
alignments
Key idea behind profile HMMs
Model representing the consensus for the
alignment of sequence from the same family
Not the sequence of any particular member
HBA_HUMAN ...VGA--HAGEY...
HBB_HUMAN ...V----NVDEV...
MYG_PHYCA ...VEA--DVAGH...
GLB3_CHITP ...VKG------D...
GLB5_PETMA ...VYS--TYETS...
LGB2_LUPLU ...FNA--NIPKH...
GLB1_GLYDI ...IAGADNGAGV...
*** *****
Deriving HMMs from multiple
alignments
Basic profile HMM parameterization
Aim: making the higher probability for sequences
from the family
Parameters
the probabilities values : trivial if many of
independent alignment sequences are given.
Akl Ek ( a )
akl ek (a )
l ' Akl ' a ' Ek ( a ' )
length of the model: heuristics or systematic way
Sequence conservation: entropy profile
of the emission probability distributions
Searching with profile HMMs
Main usage of profile HMMs
Detecting potential sequences in a family
Matching a sequence to the profile HMMs
Viterbi algorithm or forward algorithm
Comparing the resulting probability with
random model
P ( x | R ) q xi
i
Searching with profile HMMs
Viterbi algorithm (optimal log-odd
alignment)
V jM1 (i 1) log aM M ,
eM ( xi ) j 1 j
V jM (i 1) log aM I ,
eI j ( xi ) I j j
V j (i ) log
I
max V j (i 1) log aI j I j ,
q xi V D (i 1) log a ;
j D jI j
V jM1 (i ) log aM D ,
j 1 j
Ij
Mj
Begin End
Q Q
Variants for non-global alignments
Overlap alignments
Only transitions to the first model state are allowed.
When expecting to find either present as a whole or
absent
Transition to first delete state allows missing first
residue
Dj
Q Ij Q
Begin Mj End
Variants for non-global alignments
Repeat alignments
Transition from right flanking state back to random
model
Can find multiple matching segments in query string
Dj
Ij
Mj
Begin Q End
Estimation of prob.
Maximum likelihood (ML) estimation
given observed freq. cja of residue a in position j.
c ja
eM j (a )
a ' c ja '
Simple pseudocounts
qa: background distribution
A: weight factor
c ja Aqa
eM j (a )
A a ' c ja '
Optimal model construction:
mark columns
(a) Multiple alignment: (c) Observed emission/transition counts
x x . . . x
0 1 2 3
bat A G - - - C A - 4 0 0
rat A - A G - C match C - 0 0 4
cat A G - A A - emissions
G - 0 3 0
gnat - - A A A C T - 0 0 0
goat A G - - - C A 0 0 6 0
insert C 0 0 0 0
1 2 . . . 3 emissions G 0 0 1 0
(b) Profile-HMM architecture: T 0 0 0 0
M-M 4 3 2 4
M-D 1 1 0 0
D D D
state M-I 0 0 1 0
transitions I-M 0 0 2 0
I-D 0 0 1 0
I I I I
I-I 0 0 4 0
D-M - 0 0 1
D-D - 1 0 0
beg M M M end
D-I - 0 2 0
0 1 2 3 4
Optimal model construction
MAP (match-insert assignment)
Recursive calculation of a number Sj
Sj: log prob. of the optimal model for alignment up to and
including column j, assuming j is marked.
Sj is calculated from Si and summed log prob. between i
and j.
Tij: summed log prob. of all the state transitions between
marked i and j.
Tij c
x , yM,
log a
xy xy
D, I partial state paths implied by marking i and j.
cxy are obtained from
Optimal model construction
Algorithm: MAP model construction
Initialization:
S0 = 0, ML+1 = 0.
Recurrence: for j = 1,..., L+1:
S j max Si Tij M j I i 1, j 1 ;
0i j
7
V7
t6 = 3 I1+I2+I3
6 V6
t5 = 3 t4 = 8 I1+I2 I4
t3 = 5
5 V5
t1 = 2 t2 = 2 I3
I1 I2
1 2 3 4
j (Q j , j , j , j ) :
Qj : substitution rate matrix
j : background frequencies
j : binary tree
j : branch lengths
The Phylogenetic Model
j ( A , j , C , j , G, j , T , j )
j represents the transition/transversion rate ratio for j
-s indicate quantities required to normalize each row.
State sequences in Phylo-HMMs