Sunteți pe pagina 1din 4

SPEAKER RECOGNITION MODELS

Kin Yu, John Mason & John Oglesby


Speech Research Group, Department of Electrical and Electronic Engineering
University of Wales Swansea SA2 8PP, UK
e-mail: k.yu@swansea.ac.uk & eemasonj@swansea.ac.uk
Phone: +44 792 294564
Fax: +44 792 295686

ABSTRACT DTW, VQ or CDHMM? This paper attempts to ad-


This paper evaluates continuous density hidden dress these questions.
Markov models (CDHMM), dynamic time warp- 2. APPROACHES TO RECOGNITION
ing (DTW) and distortion-based vector quantisa-
tion (VQ) for speaker recognition, across incremen- The VQ technique used for experimentation follows
tal amounts of training data. In comparing VQ and that of the classical generalised Lloyd or Linde, Buzo
CDHMMs for text-independent (TI) speaker recog- and Gray (LBG) algorithm [3]. Among the rst to ap-
nition, it is shown that VQ performs better than an ply this technique to speaker recognition were Soong
equivalent CDHMM with one training version, but et al [4] and Buck et al [5]. Recognition uses a Eu-
is outperformed by the CDHMM when trained with clidean distance metric. VQ is inherently TI, though
ten training versions. In text-dependent (TD) exper- a level of text dependency can be achieved with ap-
iments, a comparison of DTW, VQ and CDHMMs propriate selection of training data. Thus TI and TD
shows that DTW outperforms VQ and CDHMMs for experiments are performed using this modelling tech-
sparse amounts of training data, but with more data, nique.
the performance of each model is indistinguishable. The DTW technique chosen for experimentation is an
Further analysis shows TD to be superior to TI ar- implementation proposed by Furui [6], but here using
chitecture for speaker recognition, and TD digit per- static features only. There are no global constraints
formance illustrates zero, 1 and 9 to be good discrim- on the warp path except for xed endpoints, and
inators. recognition is performed by summing a `city-block'
1. INTRODUCTION metric of a three point unweighted symmentric warp
path. Since the method of template construction is
Several analytical approaches have been applied to inherently TD, only TD experiments are reasonable
the task of speaker recognition, many of which orig- using this technique.
inate in speech recognition. Dynamic time warp- The hidden Markov model toolkit (HTK [7]) is used in
ing (DTW), vector quantisation (VQ) and continuous these experiments for training and recognition, with
density hidden Markov models (CDHMM) are three diagonal covariance matrices are assumed in the es-
of the most common approaches and are the three timation of the output distributions. For TI experi-
considered in this paper. ments the CDHMMs are con gured ergodically. For
Irvine [1] in text-dependent (TD) experiments com- TD experiments, a `left-to-right' zero skip constrained
pares the three approaches considered here, conclud- CDHMMs are used, which imposes a time structure
ing that VQ provides the best performance. Mat- on the resulting model.
sui and Furui [2] compare VQ, discrete HMMs and
CDHMMs for text-independent (TI) speaker recogni- 2.1 Speech database and pre-processing
tion, illustrating improved performance of CDHMMs The recognition task is performed on a subset of the
over discrete HMMs, and VQ over CDHMMs with BT Millar speech database. This database is collected
limited amounts of training data. in a quiet environment, using a high quality micro-
This paper concerns a comparison of DTW, VQ and phone. During collection, each speaker responds to
CDHMMs for TI and TD recognition and also shows a visual prompt to utter isolated digits (one to nine,
performance trends in each case as more training data zero, oh and nought) in a random order, a total of ve
is made available. The emphasis of the experiments times in each of the ve sessions, the approximate tim-
is on the performance of the models under incremen- ings of which are indicated in Figure 1. The database
tal amounts of training data in an attempt to identify correspondingly contains 25 repetitions of each of
the best approach. Consider the scenario of a single the vocabulary items from each speaker. The ses-
enrolement session where the client might reasonably sions take place over a period of approximately three
be expected to utter just one or two versions of the months with speakers encouraged to divide sessions
digit set. Under these circumstances which approach evenly across this period. The speech is recorded at
to recognition gives the best performance: TD or TI; 20kHz using 16 bits (linear) per sample. In these ex-
week 0 week 2 week 4 week 6 week 8 50 30
1vt 16 codewords
45 32 codewords
Session 1 Session 2 Session 3 Session 4 Session 5 25 64 codewords
40 64m1s 128 codewords
256 codewords
1-5 6-10 11-15 16-20 21-25 35 32m1s 20
32m1s = 32 mixtures 1 state
30 1vt = 1 version training

%Error
%Error
Training versions Testing versions
25 15

Fig. 1. Illustration of the segmentation of the 20

database collected over a period of three months


5vt
128m1s 10
15 32m1s 64m1s

into training and testing sets 10 2m16s


8m4s 5
10vt
periments the data is bandpass ltered to telephone
5 32m1s 64m1s 128m1s

bandwidth and downsampled to 8 kHz prior to fea-


0 0
16 32 48 64 80 96 112 128 1 2 3 4 5 6 7 8 9 10

ture extraction.
Total number of mixtures Number of training versions

(a) (b)
The database is divided into training and testing sets.
The rst ten versions, i.e. the rst two collection ses- Fig. 2. TI parameter variation: %error against (a)
sions are reserved for training, with the remaining total number of mixtures for ergodic CDHMMs,
fteen repetitions reserved for testing. (b) number of training versions for VQ codebooks
An incremental training data set selection is per- pro les of the thirty-two mixture equivalent models
formed until all the data from the training set is (2m16s, 8m4s and 32m1s), the 32m1s con guration is
exhausted, thus a series of experiments using one chosen for subsequent text-independent experiments,
through ten training versions is utilised. A subset a form which has been used by other researchers
of speakers is adopted. The data from twenty males, [2][8][9]. For a comparison we require the second mod-
all of approximately the same age is used, and the elling technique, VQ, to be of a similar size. Figure
vocabulary is reduced to ten digits, 1 through 9 and 2b shows codebook performance as a function of the
zero. codebook rate and amount of training data. Notice-
Mel-scale warped cepstra from a Hamming window able performance di erences occur between 16 and
of 32ms, with 50% overlap is used to parameterise 32 codewords, and 32 and 64 codewords. Above 64
the speech, and pooled inverse variance weighting is codewords performance improvements are small. A
applied to each of the 14 cepstral coecients. 32 element VQ codebook is chosen, despite its slight
sub-optimal performance, to be similar in size to the
3. MODEL PARAMETERS CDHMM.
In DTW the model parameters are fully determined
by the training data and the vocabulary. 3.2 Text-dependent parameters
In contrast, in the cases of VQ and CDHMM's, de- Results for corresponding TD experiments are shown
cisions on model parameters need to be made. For for CDHMM and VQ in Figure 3a and Figure 3b re-
VQ the primary factor is the codebook size. The spectively. For VQ, the performance again improves
CDHMM case is less straight-forward since the model with the codebook size with little improvement be-
topology is also a variable and as a consequence two yond codebook size of 8.
primary parameters are the number of states and the For text-dependent CDHMM (Figure 3b) the TD re-
number of mixtures in each state. sults show a clear minimum region when the total
In the following sections we look at recognition per- number of mixtures is between 8 and 16. Similar
formance for VQ and CDHMM's in terms of the re- curves are observed for di erent amounts of training
spective model parameters, for both TI and TD con- data. Within this region the state/mixture combi-
ditions. nations which give the best performances are 5m2s,
2m6s and 1m8s, suggesting that performance is lit-
3.1 Text-independent parameters tle a ected by the state transition parameters of the
Results from experiments with various CDHMM CDHMM.
topologies trained with 1,5 and 10 versions are sum- Hence in the TD case, an 8-element VQ codebook and
marised in Figure 2a. From these results it is noticed a constrained 8-state single-mixture CDHMM is cho-
that the performance of the model correlates highly sen to compare with the DTW modelling technique.
with the total number of mixtures in the model, i.e.
the number of states times the number of mixtures 4. PERFORMANCE COMPARISON
per state. The trend shown here for 10 version train- TI performance: Figure 4a illustrates the identi ca-
ing (10vt) has also been observed by others [2][8], tion performance of a 32-element VQ codebook and a
also in TI experiments. The pro les for 1vt, and to a 32-mixture single state CDHMM. For 1 and 2 version
lesser extent 5vt, show the e etcs of insucient train- training VQ performs better than the CDHMM, but
ing data on parameter estimation. for 7,8,9 and 10 version training the CDHMM out-
Given the near optimal performance is in the three performs the simpler modelling technique. Between
30 30 30
14 2 codewords 32 VQ 8 VQ
1m8s = 1 mixture 8 states 4 codewords 32m1s CDHMM DTW
25 8 codewords 25 25 1m8s CDHMM
12 16 codewords
32 codewords
%Error
10 20 20 20

%Error

%Error

%Error
8
15 15 15
6
10 10 10
4

5 5 5
2
1m8s
0 0 0 0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Total number of mixtures Number of training versions Number of training versions Number of training versions

(a) (b) (a) (b)


Fig. 3. TD parameter variation: %error against Fig. 4. Performance comparisons for (a) TI (b) TD
(a) total number of mixtures for constrained against number of training versions
CDHMMs (10vt), (b) number of training versions nates the performance characteristic. As mentioned
for VQ codebooks above in the region of good performance, 8 to 16
these two regions the performance of the two clas- states, the best topologies include 5m2s, 2m6s and
si ers is essentially the same. Clearly the CDHMM 1m8s, suggesting any in uence of transistion infor-
requires more training data than an equivalent sized mation is negligable. This leads to the conclusion
VQ. that subdividing the training data to form multiple
TD performance: Figure 4b illustrates the identi - models for a speaker is a principal criterion for im-
cation performance of an 8-element VQ codebook, proving performance, whether a VQ or CDHMM is
DTW and a single mixture 8-state CDHMM. DTW chosen. This subclassi cation also explains the good
is consistently the best performer. The VQ and performance of the DTW technique, where the prob-
CDHMM show similar trends to those of the TI ex- lem is inherently divided according to the vocabulary
perimental results (Figure 4a) with a cross-over in and acoustic segments of the training data.
the region of 6-version training beyond which the
CDHMM gives a lower error rate than VQ. Perfor- No. of training Best overall model
mances for the three approaches converge with an in- versions First Second Third
creasing number of training versions. McNemar's test TD DTW TD 16 VQ TI 32m1s CDHMM
with a 95% con dence level is considered at the 1,5 1 21.5% 23.5% 35.83%
TD DTW TD 32 VQ TD 1m8s CDHMM
and 10 training version points. In summary we can 5 5.8% 9.1% 13.4%
conclude that with 1 version training the di erence TD DTW TD 32 VQ TD 1m8s CDHMM
between VQ and DTW is not signi cant. However 10 2.8% 3.8% 3.9%
at 5 and 10 version training the superiority of DTW
over both VQ and CDHMM is signi cant. TABLE II
Summary of best overall performance with 1,5 and
4.1 Comments on TD and TI 10 training versions
In both VQ and CDHMM experiments TD perfor- NB the best results are all TD
mance is better than TI. Table I emphasises the point 4.2 Digit performance
of TD superiority by comparing values for the best Text-dependent DTW digit performance is illustrated
text-dependent and independent VQ codebooks irre- in Figure 5a which plot digits (from worst to best
spective of size. Table II reinforces this di erence performance at 4 training versions): 4,2,7,3 and zero,
across the three modelling techniques by ordering the and Figure 5b which plot digits (again from worst to
best overall models for 1,5 and 10 training versions. best performance at 4 training versions): 8,6,5,1 and
It is noticed from Figure 3b for TD CDHMMs, that 9.
irrespective of the structure of the CDHMM, it is the The best digit across the range of training versions
total number of mixtures in the model which domi- is zero (this is found to be true for both VQ and
CDHMMs although not shown). The good perfor-
mance can be attributed to (i) its length: zero is found
No. of training TI TD to be on average the longest utterance, hence pre-
versions 256 VQ 16 VQ senting more information for both testing and train-
1 24.5% 23.5% ing, and (ii) the voiced fricative of the rst phoneme,
5 11.8% 9.2%
10 5.2% 3.8% shown by Parris and Carey [10] to be a particularly
useful phoneme in speaker recognition.
TABLE I Consistently good performance across various train-
Summary of best TI and TD VQ performance
20 20 An alternative question is how to customise a DTW
two
three
one
five system, which is shown to harness some time-
four
seven
six
eight sequence information to a more generalised form.
15 zero 15 nine
One answer is to apply variances to each observation
in the reference template, and through an adaptive
training procedure adjust the variances accordingly
%Error

%Error
10 10
for each speaker. This will utilise the advantages of
the DTW approach which provides good performance
5 5
with small amounts of training data.
6. ACKNOWLEDGEMENTS
The Authors wish to thank BT Labs for the use of
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

the Millar database, and continuing nancial support


Number of training versions Number of training versions

(a) (b)
for this work.
Fig. 5. DTW text-dependent digit performance for
(a) 2,3,4,7 and zero and (b) 1,5,6,8 and 9 REFERENCES
ing versions is also illustrated for digits 1 and 9. The [1] D. A. Irvine and F. J. Owens. A comparison
worst performers are digits 4,8,6 and 2. of speaker recognition techniques for telephone
A large variation is observed across the digits. For speech. Proc. Eurospeech-93, 3:2275{2278, 1993.
example Figure 5a shows digit 4 performing badly, [2] T. Matsui and S. Furui. Comparison of text-
while digit zero performs well across all training ver- independent speaker recognition methods using
sions, with a performance di erence of 6.3% at their VQ-distortion and discrete/continuous HMMs.
closest point. Hence, in a password system consist- IEEE Trans. Speech and Audio Processing,
ing only of digits, judicious choice could signi cantly 2:456{459, 1994.
improve performance. [3] Y. Linde, A. Buzo, and R. M. Gray. An algo-
rithm for vector quantizer design. IEEE Trans.
5. CONCLUSION Communications, 28:84{95, 1980.
[4] F. K. Soong, A. E. Rosenberg, L. R. Rabiner,
Perhaps the most surprising overall nding presented and B. H. Juang. A vector quantization approach
in this paper is the superior performance of DTW to speaker recognition. Proc. ICASSP-85, 1:387
over both VQ and the CDHMM. As mentioned above, { 390, March 1985.
the CDHMM performance is likely to improve with [5] J. T. Buck, D. K. Burton, and J. E. Shore.
certain parameter estimation from pooled data, with Text-dependent speaker recognition using vec-
only the means being updated on a speaker speci c tor quantisation. Proc. ICASSP-85, 1:391{394,
basis. 1985.
This can be viewed as one step in moving the [6] S. Furui. Cepstral analysis technique for auto-
CDHMM towards a DTW or a VQ approach, and matic speaker veri cation. IEEE Trans. ASSP-
continuing in this vein, the DTW may be viewed as 29, pages 254{272, 1981.
merely a degenerate case of the CDHMM. In turn the [7] S. J. Young and P. C. Woodland. HTK: Hid-
VQ approach may be regarded as a degenerate case den Markov model toolkit V1.4 User manual.
of DTW. Considering rst the latter pair, the essen- Cambridge University Engineering Department,
tial di erence between VQ and DTW is the inherent Speech Group, 1992.
time-alignment within DTW and the results indicate [8] X. Zhu, Y. Gao, S. Ran, F. Chen, I. Macleod,
that some speaker-speci c time-sequence information B. Millar, and M. Wagner. Text-independent
within speech, completely lost in VQ, is captured by speaker recognition using VQ, mixture Gaussian
DTW. In contrast, the lack of recognition sensitivity VQ and ergodic HMMs. Proc. ESCA-1994, pages
to the number of CDHMM states suggests that the 55{58, 1994.
state transition probabilities do not themselves con- [9] J. de Veth and H. Bourland. Comparison of hid-
tribute to discrimination, but serve merely to align den Markov model techniques for speaker veri -
speech events to states. cation. Proc. ESCA-94, 1994.
These observations raises the question on how [10] E. S. Parris and M. J. Carey. Discrimina-
a CDHMM might be customised to harness the tive phonemes for speaker identi cation. Proc.
time-sequence information, thereby equaling or out- ICSLP-94, 4:1843{1846, 1994.
performing the DTW approach. This can be done
by assigning each oberservation in the reference tem-
plate as a state, and attening the variances and the
transition probabilities. This does not however pre-
vent bad parameter estimates with small amounts of
training data using existing algorithms.

S-ar putea să vă placă și