Sunteți pe pagina 1din 5

A Comparison of Combined Classifier Architectures for Arabic Speech Recognition

1 2

E. M. Essa1, A. S. Tolba1 and S. Elmougy2

Dept. of Computer Science, Faculty of Computer and Information Sciences, Mansoura University, Mansoura 35516, Egypt Dept. of Computer Science, College of Computer and Information Sciences, King Saud Univ., Riyadh 11543, Saudi Arabia ehab_essa@mans.edu.eg, astolba@mans.edu.eg, mougy@ccis.ksu.edu.sa three short vowels, while American English has at least 12 vowels [4]. The researchers are facing many problems such as different dialects, homophone, rich morphology and absence of diacritization. Arabic language is a collection of different dialects (e.g., gulf Arabic, Egyptian Arabic, North African Arabic). In general there are two classes of Arabic language, classical Arabic and modern standard Arabic (MSA). In this paper, we are concerned with MSA. Many techniques were used in speech recognition such as DTW, HMM and Neural Network. This paper investigates the combined classifier techniques. Combined Classifier is a set of diverse individual classifiers which has been trained on the same data or subsets of data and combining their output to obtain the final decision. A set of different combined classifier architectures based on neural networks has been designed and implemented. Boosting is a common method to combine a set of weak classifiers. Here, we propose an enhancement of the well known AdaBoost.M1 algorithm with Neural Network as a base classifier. II. PREPROCESSING The speech signals are sampled with 16 KHz and 16 bit quantization level. We do some preprocessing on speech data to transform it into a format that will be more easily and effectively processed for the next purpose; the signal goes through the following steps: A. End-Point Detection End-Point Detection separates raw speech in a continuously recorded signal from other parts of the signal (e.g., background noise). Manual segmentation is better when creating a speech database because it is more accurate and controlled but automatic segmentation is suitable for real-time systems. Since there is no standard database for modern standard Arabic [5], the database is built for spoken Arabic words and manual segmentation is used to detect speech from background noise. B. Pre-emphasis Signal-to-noise ratio is a measure of signal strength relative to background noise. Pre-emphasis is used to improve the overall signal-to-noise ratio by increasing the magnitude of some high frequencies with respect to the overall signal frequency. Pre-emphasis can be accomplished after the digitization of a speech signal through the application of the first-order Finite Impulse Response (FIR) filter: (1) H(z) = 1 z-1

Abstract-Combined classifiers offer solution to the pattern classification problems which arise from variation of the data acquisition conditions, the signal representing the pattern to be recognized and classifier architecture itself. This paper studies the effect of classifier architecture on the overall performance of the Arabic Speech Recognition System. Five different proposed combined classifier architectures are studied and a comparison of their performance is conducted. Boosting is another type of combined classifier to improve the performance of almost any learning algorithm. We investigate the effect of combining Neural Networks by AdaBoost.M1 and propose an enhancement for AdaBoost.M1 algorithm. It is found that the proposed enhanced AdaBoost.M1 outperforms either the architectures based on ensemble approaches or the modular approaches.

I.

INTRODUCTION

More recently, it has become apparent that there are many tasks which cannot be effectively solved by means of training a simple classifier. There are two main approaches for combining classifiers; ensemble-based approach (sometimes called the committee framework) and modular approach [1]. The ensemble-based approach by which a set of classifiers is trained on what is essentially the same task, and then the outputs of the networks are combined. The aim is to obtain a more reliable and accurate ensemble output than would be obtained by selecting the best classifier. This can be contrasted with a modular approach, which means a problem is decomposed into a number of subtasks. Such decomposition may be accomplished either by explicit means, or automatically [1]. Speech recognition is the process of finding sequence of meaningful words which are spoken. The main goal is to convert a given speech signal into text written by a computer. Speech recognition is a special case of pattern recognition which includes two phases: training and testing. The process of extraction of features relevant for classification is common to both phases. During the training phase, the parameters of the classification model are estimated. On the other hand, during the testing or recognition phase, the features of test patterns are matched with the trained model of each and every class [2]. There are a number of factors which affect the automatic speech recognition such as room's acoustics, background noise, speaker or speech variability, and microphone characteristics. These speech variations create mismatches between the training data and the test data and lead to a decrease in systems performance. In this paper, we focus on Arabic speech recognition. Arabic is a Semitic language, and it is one of the oldest languages in the world. It is considered as the sixth widely used language nowadays [3]. Standard Arabic has 34 basic phonemes, of which six are vowels and 28 are consonants. Arabic has fewer vowels than English. It has three long and

978-1-4244-2116-9/08/$25.00 2008 IEEE

149

Where is the Pre-emphasis parameter set to a value close to 1, in this case 0.93, which gives rise to more than 20dB amplification of the high frequency spectrum [6]. C. Frame Blocking Spectral evaluation is a reliable frame blocking technique in the case of signals characteristics that are invariant with respect to time. This hold only within the short time intervals called frames [6]. Speech signal is divided into a fixed number of frames with variable length to deal with the nonuniform word length in the speech recordings. The input layer of LVQ or Back-Propagation neural networks must have the same size of each input vectors. D. Windowing Each frame is multiplied by a hamming window to gradually attenuate the amplitude at both ends of the extraction interval to prevent an abrupt change at the endpoints [7]. Hamming window is defined as: W(n) = 0.54 - 0.46 cos(2n/N-1) n=0,..,N-1 (2) III. FEATURE EXTRACTION Feature extraction is an essential pre-processing step to speech recognition. It includes transformation of signal raw data into higher-level characteristic variables. There exist several techniques of feature extraction such as FFT, LPC, Real Cepstrum, MFCC and Perceptual Linear Coding. In this paper, we use MFCC method for feature extraction. A. Mel Frequency Cepstrum Coefficients (MFCC) MFCCs are commonly derived as follows [8]: Fourier transform is applied to the windowed speech segment. The spectrum is warped along its frequency into the Melfrequency axis then convolved with 24 triangular bandpass filter. Take the Discrete Cosine Transform of the list of Mel log-amplitudes, as if it were a signal.

It propagates the input through all layers to the output layer. At the output layer, the errors are determined. These errors are then propagated back through the network from the output layer to the previous layer [9]. LVQ is a supervised version of Kohonen's SOM. It is a feed-forward net with a single hidden layer of neurons, fully connected with the input layer it apply a nearest-neighbor rule and a winner-takes-it-all paradigm [10]. V. PROPOSED COMBINED CLASSIFIER ARCHITECTURES A combined classifier is based on a number of neural networks based classifier; each classifier has to be capable of assigning one of the classes to a given pattern then combining the outputs of classifiers to obtain the finial decision. Different architectures are examined in this work: the ensemble-based approach and the modular approach. Obviously, neural networks should be different from each other. There is no advantage to combine identical nets. Here, we implement number of ensembles by varying the network architecture, type, weights and training data. The first and the second ensembles are an example of ensemble-based approach. The third, fourth, and fifth ensembles are an example of modular approach. The first ensemble has been designed by using a set of individual Back-Propagation neural networks with different parameters, different numbers of hidden neurons, learning rate and epochs (Fig. 1). Each classifier has initial random weights. The same training data is applied to each individual Back-Propagation neural networks and the finial decision is obtained by using simple majority vote rule.

BKP_1

Speech Signal
Pre-processor BKP_2

Final Decision Voting

Xk =

x
n =0

N 1

1 cos ( n + ) k , k=0,.., L 2 N

(3)
BKP_n

where N is the number of triangular band-pass filters, L is the number of Mel-scale cepstral coefficients and xn is the logarithm of the weighted average of the speech energy in the bandwidth defined by the kth mel filter bank. Compute delta parameters by taking the basic difference of coefficients between consecutive frames. IV. NEURAL NETWORKS An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. In this paper, Back-Propagation and LVQ neural networks have been used as basic units of combined classifiers. BackPropagation is a feed-forward multi-layer network with an input layer, an output layer, and at least a single hidden layer.

Figure 1. BKP Based Combined Classifier.

The second ensemble has been constructed by two learning levels: the first level is made by Back-Propagation neural network and LVQ neural network, while the second level is made by a Back-Propagation neural network only (Fig. 2). In the first level, the two neural networks have been trained on the same training set then examining the training data to determine which patterns need more training. The second classification level analyzes only the patterns that are classified incorrectly by the first level networks. The final decision is taken by first checking the neural networks in the first level if there is agreement on a class that will be the decision, if not check the second level to adjudge between them. The third ensemble is designed by segmenting the training data into a number of pieces, each piece contains from 2 to 3

150

words and each word has been included in two pieces then a number of Back-Propagation neural networks are used to train and recognize each individual piece (Fig. 3). The testing process has been accomplished by passing the pattern to Back-Propagation neural networks that are trained on different pieces. For example, BKP1 decide that a certain pattern belongs to class 2 then verify by testing BKP2 if it decided also that the pattern belongs to class 2, count it as possible solution. These procedures are repeated with each BKP, the final decision will be according to which path gives the highest likelihood.
Speech signal

finding a highly accurate classifier on the training set, by combining weak hypotheses [11]. The most popular algorithm is AdaBoost, which was introduced by Freund and Schapire in 1995 [12], and has successfully solved many practical problems of previous boosting approaches. AdaBoost has been originally designed for binary problems; it can be extended to multi-class classification problems in [12]. One of such approaches is AdaBoost.M1. A. AdaBoost.M1 Boosting works by repeatedly running a given weak learning algorithm on various distributions over the training data, and then combining the classifiers produced by the weak learner into a single composite classifier. Input: Sequence of N examples {(x 1 , y1 ),..., ( x N , y N )} with

Pre-Processor

Classification BKP neural network Matching Agreeing classification

Classification LVQ neural network

Non agreeing classification

Final Decision

Classification BKP neural network

labels y i Y = {1, , k} Number of hidden neurons nh Number of training epochs Integer T specifying number of iterations Initialize: weight vector, w 1 i = 1/N , for i = 1, , N. Do for t=1, 2,,T 1. Normalize wt 2. Call WeakLearn (MLP with specific value of nh and epochs), providing with the weight wt ; get hypothesis

ht : X Y
3. Calculate the error of h t : t = if

Figure 2.The second Combined Classifier structure

we
i =1

t t i i

, where

eit =1,

The fourth ensemble is based on the third ensemble but instead of using single Back-Propagation neural network, a combination of three neural networks is used for each training piece.
Speech Signal

h t ( xi ) yi and 0 otherwise. If t > 1 / 2 , then set T = t -1 and abort loop.


4. Set t =

1 (1 t ) log t 2
wit +1 = wit exp( 2 t eit )

5. Set the new weights vector to be

Pre-processor

Output the hypothesis


BKP1 Words 1-2-3 BKP2 Words 2-3-4 BKP3 Words 4-5-6 BKP7 Words 10-1

h f ( x ) = arg max
yY

[h ( x ) = y ]
t t i i t =1

Figure 4 Pseudo code of AdaBoost.M1 algorithm

Decision Process

Final Decision

Figure 3. The third Combined Classifier structure.

The fifth ensemble is like the third ensemble but each neural network is responsible for training and recognizing 2 words only instead of 3 words. VI. BOOSTING NEURAL NETWORKS Boosting is a general method for improving the performance of any learning algorithm. It is a method for

Fig. 4, shows the AdaBoost.M1 learning algorithm [12]. The AdaBoost.M1 algorithm uses a set of training data, {(x1, y1),...,(xn , yn )}, where xi is the i-th feature vector of the observed signal and y is a set of possible labels. In addition, the boosting algorithm has access to another unspecified learning algorithm called the weak learning algorithm, which is denoted WeakLearn. Here, the WeakLearn is a back-propagation Neural Network. The boosting algorithm calls WeakLearn repeatedly in a series of rounds. On round t, the booster provides WeakLearn with a weight distribution wt over the training set. Next, AdaBoost.M1 sets a parameter t. Intuitively, t measures the importance that is assigned to ht. Then, the weight wt is updated which leads to the increase of the weight

151

for the data misclassified by ht. Therefore, the weight tends to concentrate on "hard" data. The final hypothesis hf is a weighted vote of the weak hypotheses. That is, for a given instance x, hf outputs the label y that maximizes the sum of the weights of the weak hypotheses predicting that label. The weight of hypothesis ht is defined to be so that greater weight is given to hypotheses with lower error. B. Proposed Enhanced Boosting Neural Network

As noted, the Adaboost.M1 algorithm doesn't use the output of the base classifier in making the final decision and only depends on the value of . In the proposed enhanced Adaboost.M1 algorithm, the output of base classifier should be the posterior probability not just the correct label. Posterior probability P(yi|x) , the probability that the correct label y of x is yi , might be interpreted as the probability that the generated label is correct, given the object x. In this case, the final hypothesis is given by:

individual classifiers, the majority vote rule is applied to obtain the final decision for each input pattern. The total initial number of classifiers generated is 125. A smaller number of weak classifiers have been selected to be included in the combination. The selection process depends on the performance of each individual classifier on the training data, an initial set of classifiers has been selected, and its performance evaluated and continues adding weak classifiers to the combination until there is no change for two extra additions or there is deterioration of performance. The best performance range is from 87% to 91% which resulted from a combination of 27 Back-Propagation classifiers. The second proposed ensemble has two different types of neural networks in the first level: Back-Propagation (with 150 hidden neurons, 0.02 learning rate) and LVQ (with 200 hidden neurons, 0.05 learning rate) which has been trained on the same training data. In the second level, there is only one Back-Propagation (with 300 hidden neurons, 0.01 learning rate) which has been trained on incorrect patterns of the first level. The results are shown in Table I.
TABLE I RESULTS OF ENSEMBLE-BASED METHODS Combination Type Ensemble Method Accuracy TR Ensemblebased approaches The first ensemble The second ensemble 100% 94% TS 94.3% 81%

h f ( x ) = arg max
yY

P( y
t =1

| x) t t

(4)

where P ( yi

| x)t is vector of posterior probabilities for each

class i which represent the output of MLP for each iteration t. The final hypothesis hf is a weighted vote of the weak hypotheses. That is, for a given instance x, hf outputs the label y that maximizes the sum in each iteration t of the posterior probabilities for each class multiplied by weights of the weak hypotheses . Feed forward neural networks such as back-propagation usually have [0,1] outputs, for well trained and sufficiently large networks they approximate the posterior probabilities. VII. EXPERIMENTAL RESULTS The corpus was created from 10 Arabic words. A number of 10 speakers (males) were asked to utter each word 6 times. Words are manually segmented to build the training and testing databases. Speech database consists of 600 utterances (10 speaker, 10 words, 6 repetitions) which is divide into 300 utterances (10 speaker, 10 words, 3 repetitions) as a training set and at the same for testing set. The Arabic vocabulary set that is used includes: "( "that), "( "Allah), "( "day), "( "say), "( "was), "( "Moses), "( "fear), "("instead of them), "( "love),"( "between you). The training set (TR) is used to train the classifiers. The testing set (TS) is a collection of utterances which werent used in the training phase. Each pattern is segmented into a fixed number of frames with variable length; here the number of frames is 20 frames. Each frame is multiplied by hamming window. The extracted feature vector consists of 12 MFCC features and 12 Delta MFCC features, giving 24 coefficients in total. A. Experimental Results of Proposed Combined Classifiers In the first proposed ensemble, each neural network has different - number of hidden neurons (50-200), learning rates (0.01-0.05), and training epochs (100-500). After training the

The third proposed ensemble have 7 Back-Propagation neural networks, each has a single layer, 200 hidden neurons, 500 epochs, 0.01 learning rate and is responsible for training and recognizing 3 words or 2 words as the last one. The fourth proposed ensemble has 7 groups each contains 3 Back-Propagation neural networks varying in number of hidden neurons. The decision of each group is made by taking the majority voting between the three neural networks. The fifth proposed ensemble has 10 Back-Propagation neural networks, each one has two hidden layer, 50 neurons in the first layer, 5 neurons in the second layer, 500 epochs, 0.01 learning rate. The results are shown in Table II.
TABLE II RESULTS OF MODULAR ENSEMBLE METHODS Combination Type Ensemble Method The third ensemble Modular approaches The fourth ensemble The fifth ensemble Accuracy TR 92% 97% 96.3% TS 77% 79% 79.3%

The ensemble-based classifiers resulted in better performance for both the training and testing sets compared to the modular classifiers.

152

B. AdaBoost.M1 Experimental Results AdaBoost.M1 is a type of ensemble-based classifiers that based on varying the training data according to a distribution. In the next experiment, combining a set of classifiers based on back-propagation Neural Network by using the AdaBoost.M1 algorithm. Back-propagation Neural Network with 300 hidden neurons, 0.01 learning rate and 1000 epochs was used as a base classifier and examined with a different number of iteration as shown in Fig. 5.

AdaBoost.M1 has shown a significant performance increase for the testing set. It has been enhanced by changing the technique of computing the final decision to depend on two confidence measures of each weak classifier, the posterior probability of the classifiers' output and the value of . This modification resulted in enhancing the performance relative to AdaBoost.M1 by a rate of about 2%. In the future work, we will apply the proposed enhanced AdaBoost.M1 to different weak classifier types.

Figure 5. Recognition rate of AdaBoost.M1with Back-propagation.

Figure 6. Comparing the results of AdaBoost.M1 and Enhanced AdaBoost.M1 with Back-propagation.

An iteration number of 13 results in 94.6% correct classification as the best result for recognizing 10 words and 100% correct classification of the training set for any number of iterations. C. Enhanced AdaBoost.M1 Experimental Results

REFERENCES
[1] [2] [3] [4] Sharkey, A.J.C., On Combining Artificial Neural nets. Connection Science, 8, 3/4, 299-314, 1996. Samudravijaya K., Speech and Speaker Recognition: A Tutorial, Proceedings of the International Workshop on Technology Development in Indian Languages, Kolkata, pp. 3-18, 2003. Katrin Kirchhoff, Novel Speech Recognition Models for Arabic, Johns-Hopkins University Summer Research Workshop, Final Report, 2002. H. Satori , M. Harti, and N. Chenfour, Introduction to Arabic Speech Recognition Using CMUSphinx System, Computational Intelligence and Intelligent Informatics, ISCIII '07. International Symposium, Page(s):31 35, 2007. Awadalla, M.; Abou Chadi, F.E.Z.; Soliman, H.H., Development of an Arabic speech database, Information and Communications Technology. Enabling Technologies for the New Knowledge Society: ITI 3rd International Conference, Page(s): 89 100, 2005 Becchetti, C. and Lucio Prina Ricotti, Speech Recognition: Theory and C++ Implementation, Chichester, John Wiley & Sons, 1999. Sadaoki Furui, "Digital Speech Processing, Synthesis and Recognition", CRC, 2 edition, 2000. Fang Zheng, Guoliang Zhang and Zhanjiang Song, "COMPARISON OF DIFFERENT IMPLEMENTATIONS OF MFCC",J. Computer Science & Technology, 16(6): 582-589, 2001. Chau Giang Le., Application of a Back-propagation neural network to isolated-word speech recognition, Master's thesis, NAVAL POSTGRADUATE SCHOOL MONTEREY CA, 1993. Laurene Fausett, Fundamentals of neural networks: architectures, algorithms and applications, Prentice Hall (1994) Holger Schwenk, Yoshua Bengio, Boosting Neural Networks, Published in Neural Computation, pages 18691887, vol 12, No 8, August 2000 Y. Freund and R. Schapire, A decision-theoretic generalization of online learning and an application to boosting, Journal of Computer and System Sciences, 55(1):119-139, Auguest 1997. Robert P.W. Duin, David M.J. Tax, Experiments with Classifier Combining Rules, Lecture Notes in Computer Science, vol. 1857, Springer, Berlin, 2000, 16-29.

The final hypothesis depends on the value t of which inversely proportional with the classifier error. In the Proposed Enhanced Adaboost.M1, the final hypothesis depends on the output of each individual classifiers posterior probability which is multiplied by t as a weight for each classifier. There are many classifiers that produce posterior probability's output such as back-propagation neural network, radial basis function (RBF), group method of data handling (GMDH) and more. There are also many classifiers that can obtain posterior probability from their output [13]. Here, we use back-propagation as a base classifier with 300 hidden neurons and 1000 epochs. The results comparing with the original AdaBoost.M1 are shown in Fig. 6. VIII. CONCLUSION In this paper, we have proposed different combined classifier architectures based on Neural Networks by varying the initial weights, architecture, type, and training data to recognize Arabic isolated words. The ensemble-based approaches show high improvement in recognition rate compared to the modular approaches. Modular approaches have a poor discriminatory capability.

[5]

[6] [7] [8] [9] [10] [11] [12] [13]

153