Documente Academic
Documente Profesional
Documente Cultură
Lawrence Rabiner
Center for Advanced Information Processing (CAIP)
Rutgers University
Piscataway, NJ 08854
lrr@caip.rutgers.edu
File Location: (c:\data\LaTex files\speech recognition course\digit recognition project
\word recognition project HMMs.tex)
Word HMMs
(1)
Oi is of the form:
(i) (i)
(i)
(c1 , c2 , . . . , cp )
(i) (i)
(i)
(i)
(i)
(i)
(c1 , c2 , . . . , cp , c1 , c2 , . . . , cp )
first differences
(i)
(i)
(i)
(i)
(i)
2 (i)
2 (i)
2 (i)
(c(i)
1 , c2 , . . . , cp , c1 , c2 , . . . , cp , c1 , c2 , . . . , cp )
cl [m] =
k clk [m] G, 1 m Q, G = 0.375
(4)
k=K
m = 1, 2, . . . , Q, G1 = 0.375
(5)
For the HMM model of each digit, we assume the state transition coefficients
are of the form:
(
aii = 0.95 states 1 to N S -1
a=
(6)
ai,i+1 = 0.05 state N S
If we assume that the feature vector (observation) probability density is a mixture of Gaussians, we can write the probability of observing feature vector O in
state j of the HMM model (for either training or testing) as:
bj (O) =
M
X
(7)
m=1
where:
j is the state of the model, j = 1, 2, . . . , N S
m is the mixture number, m = 1, 2, . . . , M
O is the observation vector of cepstral coefficients and possible delta and
delta-delta cepstral coefficients
3
gmj is the mixture gain for the mth mixture in the j th state
N is a Gaussian density fuction
mj is the mean vector for the mth mixture in state j
Umj is the covariance of the mth mixture in state j (assumed to be a
diagonal covariance matrix)
Combining terms, the probability of observing feature vector O in state j of the
HMM model is:
"
#
Q
M
X
Y
(O[q] mjq )2
gmj
exp
2
2mjq
m=1
q=1
bj (O) =
(8)
!1/2
Q
Y
2
(2)Q/2
mjq
q=1
Training Procedure
Figure 3: Use of the Viterbi algorithm to resegment words into HMM states in
an optimal manner
For a vocabulary of V words, we independently form a whole word HMM
for each individual word, vi , i = 1, 2, . . . , V using the 3-step training procedure
outlined below:
Step 1 - Initialization
assume an initial uniform segmentation of each training token of each
word, vi , into states
determine the mean vector, mj , and the diagonal covariance matrix,
2
, for all mixtures and states (assume that the number of mixtures
mj
is 1 for the time being)
this process step gives initial word models for the V words in the
vocabulary, namely 1 , 2 , . . . , V
Step 2 Viterbi alignment and segmentation
- resegment each training utterance into states using the Viterbi algorithm as shown in Figure 3
Step 3 - Iteration
2
re-estimate model parameters (mj , mj
) from re-segmented utterances
Viterbi decoding finds the best alignment path between the feature vector of the
whole word input signal and the HMM model states using a 5-step procedure.
If we assume that the whole word model, v , for the v th word in the vocabulary
is of the form:
v = {iv , avij , bvi },
1 i, j N S,
1vV
(9)
and we assume that there are T frames in the feature vector of the whole word
model, then the implementation of log Viterbi decoding (we omit the superscript
v in the steps below for ease of notation) is the following:
1. Preprocessing
i = log(i )
bi (Ot ) = log[bi (Ot )]
1 i NS
(10)
1 i N S,
1tT
1 i, j N S
(11)
a
ij
log(aij )
(12)
1 (i)
log(1 (i)) =
i + bi (O1 )
1 i NS
(13)
1 (i)
1 i NS
(14)
2. Initialization
3. Recursion
t (j)
t (j)
1iN S
2 t T,
1 j NS
(16)
(17)
1iN S
2 t T,
(15)
1 j NS
(18)
4. Termination
P
qT
max [T (i)]
1iN S
(19)
(20)
5. Backtracking
qt = t+1 (qt+1
),
t = T 1, T 2, . . . , 1
(21)
Figure 4: Use of log Viterbi decoding to find optimal decoding path for aligning
frames of a word token with a given word model.
Figure 4 shows the computation of the best path using the log Viterbi decoding
algorithm given above.