Sunteți pe pagina 1din 4

SPEECH FEATURE EXTRACTION USING INDEPENDENT COMPONENT ANALYSIS Jong-Hwan Lee1 , Ho-Young Jung2 , Te-Won Lee3 , Soo-Young Lee1

Brain Science Research Center and Department of Electrical Engineering Korea Advanced Institute of Science and Technology 373-1 Kusong-Dong, Yusong-Gu, Taejon, 305-701 Korea (TEL: +82-42-869-8031, FAX: +82-42-869-8570, E-mail: jhlee@neuron.kaist.ac.kr) 2 Electronics and Telecommunications Research Institute 161 Kajong-dong, Yusong-Gu, Taejon, 305-350, Korea 3 Computational Neurobiology Laboratory, The Salk Institute 10010 N. Torrey Pines Road, La Jolla, California 92037, USA and the Institute for Neural Computation, University of California, San Diego, USA
ABSTRACT In this paper, we proposed new speech features using independent component analysis to human speeches. When independent component analysis is applied to speech signals for efcient encoding the adapted basis functions resemble Gabor-like features. Trained basis functions have some redundancies, so we select some of the basis functions by reordering method. The basis functions are almost ordered from low frequency basis vector to high frequency basis vector. And this is compatible with the fact that human speech signals have much more information on low frequency range. Those features can be used in automatic speech recognition systems and the proposed method gives much better recognition rates than conventional mel-frequency cepstral features. 1. INTRODUCTION Speech signals are composed of independent higher order statistical characteristics. Independent component analysis (ICA) had extracted feature vectors based on these higher order statistics from natural scenes and music sound [1], [2]. These features are localized in both time (space) and frequency. However, no feature has been extracted for human speeches for speech recognition. In this paper, we report the extraction of Gabor-like features from natural human speeches. Extracted speech features look like bandpass lters which they have center frequencies and limited bandwidth. In many of the lter bank approaches, bandpass lters are designed to have mel-scaled center frequencies by mathematical process. And their bandwidthes are also determined by certain abstract mathematical properties. In auditory modeled feature extraction process, lter bank resembles the characteristics of the basilar membrane (BM). In the inner ears cochlea, the input speech signals induce mechnical vibration on the basilar membrane. And each position of basilar membrane responds to some localized frequency information of the speech signals. Then in auditory based feature extraction process each bandpass lters are modeled by this frequency characteristics of basilar membrane.
This research was supported as Brain Science & Engineering Research Program by Korean Ministry of Science and Technology.

On the other hand, in this paper trained basis vectors reect the statistical properties of input speeches better than any other lter bank methods. For each time frame, extracted feature coefcient vectors are obtained using trained basis vectors. Finally, recognition rates with the ICA-based features are compared to those with the mel-frequency cepstral coefcients (MFCCs) for isolated-word recognition tasks. 2. EXTRACTING SPEECH FEATURES USING ICA To extract independent feature vectors from speech signals, ICA algorithm is applied to a number of human speech segments. An ICA network is trained to obtain independent components u from speech segment x, and the trained weight matrix W extract basis function coefcients u from x. ICA assume the observation x is the linear mixture of the independent components u. If A denote the inverse matrix of W then the columns of A represent basis feature vectors of observation x.

u=W x x=A u
To extract basis functions one has to train mixing matrix A or unmixing matrix W, and we trained the mixing matrix W.
x
x1
Input speech signal

u
u1

y1

y2

x2

u2
g(u)

yN

xN

uN

Figure 1: ICA network for training the basis vectors. The learning rule is based on maximization of joint entropy H(y), and is represented as [3]

y; (y W / @I@(Wx) = @HW ) @ W / W T ]? +
1

(1) (2)

@p(u) @u

p(u)

xT
= ( )

where p(u) denotes the approximation of the speech signal @yi =@ui @g ui =@ui . probability density function, p ui Here, g(u) is a nonlinearity function, which approximates the cumulative distribution function of the source signal u [3]. Natural gradient is also introduced to improve a converging speed [4]. Particulary, this method does not require the inverse of matrix W, and provides the following rule:

( )=

(y W / @HW ) WT W = I ? '(u)uT W; (3) @ where '(u) is related to the source probability density function
and called as the score function.

(a)

(b) (a) Figure 3: (a)Ordered basis vectors (column vectors of mixing matrix A), (b)frequency spectrum.

ICA network is composed of N inputs and N outputs, and N basis vectors are produced from N by N matrix A (A = ?1 ).

3. SELECTION OF DOMINANT FEATURE VECTORS For speech recognition, one may select dominant feature vectors from the N basis vectors. The ICA algorithm nds independent components corresponding to the dimensionality of the input, and may result in redundant components. To reduce this redundancy, several techniques have been proposed [5]. In this paper, the contribution of basis vectors to the speech signal and the variability of the basis vector coefcients are considered. The contribution means the power of the basis vector in speech signals, and the L2 -norm ( i , where i is the i-th column vector of A) can represent the relative importance of basis vectors. Therefore, from N basis vectors ordered in decreasing L2 -norm, M dominant feature vectors can be selected. The variability denotes the variance of the basis vector coefcients, and this can represent the relative importance of basis vectors in recognizing speech signals. Fig.4(a) shows the L2 -norm of the reordered basis vectors and

(b) Figure 2: (a)Ordered row vectors of unmixing matrix W, (b)frequency spectrum. Using the learning rule in Eq.(3), W is iteratively updated by gradient ascent manner until convergence. Lets denote N as the size of speech segments, which are randomly generated from training speech signals. Fig.1 shows the basis vector training network.

jja jj

(b) shows the coefcient variance of the corresponding basis vector. One can see those two ordering methods provide almost same basis vector order and basis vectors after about 30th are negligible in both contribution and variability. The obtained M feature vectors constitute the M-channel lterbank, and provide a spectral vector every time frame.
0.35

(a) 8000

Center frequency (Hz)

6000

4000

2000

10

15

20 25 30 ICA basis vector index (b)

35

40

45

50

6000

Center frequency (Hz)

0.3

5000 4000 3000 2000 1000 0 0 2 4 6 8 10 MFCC filter bank index 12 14 16 18

0.25

L2norm

0.2

0.15

0.1

Figure 5: (a)Center frequencies of the ICA trained basis vectors, (b)center frequencies of the MFCC lter banks.

0.05

10

15

20 25 30 Basis vector index

35

40

45

50

(a)
0.25

0.2

0.15

0.1

0.05

10

15

20 25 30 Basis vector index

35

40

45

50

(b) Figure 4: (a)Basis vector contributions to speech signals, (b)Variance of basis vector coefcients.

This assumption is based on Laplacian distribution of real speech signal components. It improves the coding efciency of speech signals, since most of the coefcients on u are almost zero and only a few important informations of speech signals are encoded in the tails of the Laplacian distribution. 300 sweeps through whole segments were performed, and W was updated every 100 input speech segments. The learning rate was xed to 0.001 during the rst 100 sweeps, 0.0005 during the next 100 sweeps, and 0.0001 during the last 100 sweeps. Obtained unmixing vectors are shown in g.2(a) and (b) shows their frequency magnitude spectra. Finally, 50 basis vectors were obtained from columns of the inverse matrix of W. Fig.3(a) and (b) show 50 basis vectors and their frequency magnitude spectra ordered by the contribution to speech signals. The corresponding contribution values are also shown in Fig.4(a). Several learned basis functions are localized in both frequency and time and resemble Gaborlike lters. In Fig.3(b), the basis vectors are almost ordered from low frequency component to high frequency component, which comes from relatively larger energy in low frequencies of the human speeches.
N basis functions

4. TRAINING USING REAL DATA AND RECOGNITION EXPERIMENTS To train the basis vectors from natural human speech signals, 75 phonetically balanced Korean words uttered from 59 speakers were used. Speech segments composed of 50 samples, i.e., 3.1ms time interval at 16kHz sampling rates, were randomly generated. Total 5 segments were generated, and each segment was pre-whitened to improve the convergence speed [1]. The whitening lter, Wz is:

Variance

speech signal

Select M basis functions

M-channel outputs

10

Frame analysis for each channel

Wz =< (x? < x >)(x? < x >)T >? 2


I

This removes both rst- and second-order statistics from the input data, x and makes the covariance matrix of x equal to I. Then, unmixing matrix W in ICA network was obtained by the learning rule Eq.(3) using the speech segments. W was initialized to an identity matrix, and ' was assumed as a sign function.

Frame vector sequence composed of M spectral components

Figure 6: Block diagram of feature extraction. The ICA-based features were applied to an isolated-word recog-

(u)

nition task. The vocabulary consists of 75 Korean words, and 38 and 10 speakers uttered once to form training and test data, respectively. Whole word models were used for classication, and were represented by 15-state left-to-right continuous-density hidden Markov model (CDHMM). Speech features were extracted with the top M feature vectors in Fig.3(a) on 30ms time window every 10ms. The M spectral components energies were scaled logarithmically, and 13 cepstral coefcients were extracted. Fig.6 shows the block diagram of this feature extraction process. For comparison, standard MFCC features were extracted. In MFCC feature extraction process, lter bank which have 18 melscaled center frequencies in g.5(b) were used. Then, logarithmically scaled 18 spectral components were transformed to 13 cepstral coefcients by discrete cosine transform (DCT). Fig.5(a) and (b) show the center frequencies of ICA basis vectors and MFCC lter bank. In comparison with the MFCC lter banks, ICA basis vectors have linearly distributed center frequencies. 30th basis vector has about 4600Hz center frequency and in this range one may get sufcient spectral information to recognize speech signals.

[2] Bell A.J. and Sejnowski T.J.: Learning the higher-order structure of a natural sound, Network: Computation in Neural Systems, 1996, 7, pp.261-266 [3] Lee T.W.: Independent component analysis - Theory and applications, Kluwer Academic Publishers, 1998 [4] Amari S., Cichocki A., and Yang H.: A new learning algorithm for blind signal separation, Advances in Neural Information Processing Systems, 1996, 8, pp.757-763 [5] Bartlett M.S., Lades H.M., and Sejnowski T.J.: Independent component representations for face recognition, Proceedings of the SPIE Symposium on Electronic Imaging: Science and Technology; Conference on Human Vision and Electronic Imaging III, San Jose, California, January 1998 [6] Oja E.: The nonlinear PCA learning rule in independent component analysis, Neurocomputing, 1997, 17, (1), pp.25-46 [7] Olshausen B.A. and Field D.J.: Emergence of simple-cell receptive eld properties by learning a sparse code for natural images, Nature, 1996, 381, pp.607-609

Table 1: Recognition Results. Error Rates(%) MFCC 3.8 10 basis 5.1 Proposed 20 basis 2.0 30 basis 2.4 Method 40 basis 3.9 50 basis 4.3

Table 1 shows the performance of the standard MFCC and proposed feature extraction method for various M values. When 20 feature vectors were selected in the contribution order shown in Fig.4(a), the proposed method yielded 47.4% error reduction from the case of the standard MFCC. This result shows that the only a few active coefcients of u are sufcient for encoding the speech signal. 5. CONCLUSION In this paper, we proposed new speech features for speech recognition using information maximization algorithm of ICA. ICA is successfully applied to extract features that efciently encode the speech signals. Many of the extracted features are localized both time and frequency and are much similar with Gabor lters. These speech features make new lter bank and each lter provides localized spectral compoents in every time frame of input speeches. The new features demonstrated better recognition performance than the standard MFCC features. ICA basis vectors capture the higher order structures of speech signals better than MFCC lter bank. 6. REFERENCES [1] Bell A.J. and Sejnowski T.J.: The Independent Components of natural scenes are edge lters, Vision research, 1997, vol. 37, (23), pp.3327-3338

S-ar putea să vă placă și