Documente Academic
Documente Profesional
Documente Cultură
SPEECH RECOGNITION
Accuracy (%)
short-time analysis is performed, where the instantaneous envelope 45
R t +T
is averaged and log compressed A = log( t00 [a(t)]2 dt), whereas
for the frequency estimation an amplitude weighting is performed 40
[8] R t0 +T
t
f (t)[a(t)]2 dt 35
Fw = 0R t0 +T (3) LogAmpl
Freq
t0
[a(t)]2 dt Bandwidth
For the bandwidth estimation both frequency and amplitude 30
0 5 10 15 20 25 30 35
components are considered, and similar weighting with the squared number of bands
There is a close relationship of the above frequency and band- Considering the number of analysis bands, previous studies (e.g.
width estimates, with the first and second normalized spectral mo- [3]) have reported an optimal number of analysis bands, beyond
ments. More specifically under certain conditions they are consid- which there is a degradation in the recognition performance. This re-
ered to be equivalent [1]. The general n-th spectral moment of a sult is replicated in Fig. 1, where the short-time frequency and band-
short-time resonant signal rk (t), corresponding to the output of the width feature recognition rates are plotted in relation to the number
k-th filter in a filterbank analysis, is defined as of analysis bands (TIMIT phone recognition task, see also the next
Z section). For comparison, the recognition rates using energy-based
Skn = |Rk ()| n d (8) features are also plotted (log amplitude without DCT).
0 We can see that energy- and frequency-based features have sim-
where Rk () is the fourier transform of rk (t). The n-th normalized ilar performance until around 12 to 16 filters. Further increase in
spectral moment is defined as the number of filters gives no improvement for the energy features,
whereas for the frequency features a serious degradation is observed.
Nkn = Skn /Sk0 (9) Bandwidth features, in general, have lower performance, and similar
The standard MFCC features can be considered as the DCT of the behavior to frequencies. For the frequency and bandwidth features,
(log) zero order spectral moment (S 0 ) with exponent = 2. The the degradation is due to the narrowing of the filters, since their over-
first normalized spectral moment (N 1 ) has also been used for speech lap is kept at 50%. The bandwidth reduction results in a high influ-
recognition [3], termed as Spectral Subband Centroids. ence from the harmonics of the fundamental frequency3 . This is es-
2 DESA is actually a family of efficient algorithms that use various dis- 3 If a filter is narrow including only one strong harmonic, the frequency
crete time approximations of the continuous TEO [9, 10]. estimate Fw is dominated by the frequency of the harmonic.
Correlation of Log Amplitude features (test/dr1/faks0/sa1) Correlation for DCT of Log Ampl. features (test/dr1/faks0/sa1)
pecially pronounced in the lower and phonetically critical bands if a
2 2
log scale is used (such as in our case), where the bandwidth becomes 4 4
comparable to the interharmonic distance. This is probably one of 6 6
the reasons that some previous efforts involving frequency features 8 8
alent overlap based on the energy overlap4 , which for the triangular 14 14