Sunteți pe pagina 1din 4

A Comparative Study of Arabic Speech Recognition

Prof. Onsy Abdel Alim Ali, Dr. Mohamed M. Moselhy, Eng. Aya Bzeih
Electrical Engineering Department, Faculty of Engineering Beirut Arab University Beirut, Lebanon Onsy2066@hotmail.com Abstract Speech recognition is a computer technology that
enables a device to recognize and understand spoken words, by using signal processing techniques to extract features for the spoken Arabic speech. A comparative study is performed in two different ways. The first includes two tests. The first test is an objective test where the computer has to recognize the recorded data. The second test is a subjective test where 15 persons judged the recognition process for the tested materials. The second form of comparison is performed for different transmission media, acoustical speech (direct), via telephone (PSTN, PBX), Wireless (cellular), Internet (VoIP). The results of the paper showed that the objective test recognition rates for all the Arabic words in the different testing materials concerning the transmission media are lower than those for direct. The objective test recognition rates are the lowest when the used recognizer (neural network) is trained with the direct transmission medium data. It also showed that the subjective test recognition rates are higher.

II. CHOOSING THE ARABIC SPEECH MATERIALS The choice of the testing Arabic materials is based on the following principles and based on the International Phonetic Alphabet (IPA) for Arabic speech [2]. Six groups are tested for long (VV) and short (V) vowels. One of the main Arabic vowels is selected to be in the middle of each word and at the beginning is one consonant from a group of Arabic consonants but at the end of the word a specific consonant is selected. If the vowel is to be long vowel (VV) then the syllable structure will be /CVVC/, and when the vowel is to be a short vowel (V) then the syllable structure will be /CVC/. III. SIGNAL PROCESSING AND FEATURE EXTRACTION The first step in constructing an Automatic Speech Recognition (ASR) system is to digitally sample the data so that it can be processes by a computer. A 10-12 kHz sampling rate is satisfactory because it is high enough to capture the first five formants for most talkers, but may capture all the unvoiced energy. Once the speech signal has been digitized, the discrete-time representation is usually analyzed within short-time intervals. Depending upon the application, an analysis window of 5-25 ms is selected in which it is assumed that the speech signal is time invariant, or quasi-stationary. Different methods and techniques for feature extraction for speech recognition can be used such as Short Time Fourier Transform (STFT), Short Time Energy Function, Zero Crossing rate, Endpoint Detection, Linear Prediction Coding method (LPC), and MelFrequency Cepstral Coefficients (MFCC). The used method in this paper is the Mel-Frequency Cepstral Coefficients. A. Cepstrum Calculation [1], [3] Cepstrum term is defined as the inverse Fourier transform of the log-magnitude spectrum which is real even function and hence its inverse Fourier is also real and even. Under this condition the sine part of the inverse Fourier transform can be discarded and keep only the cosine part. 1 jm c( m) = d (1) log S() e 2

I. INTRODUCTION Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. Rudimentary speech recognition software has a limited vocabulary of words and phrases and may only identify these if they are spoken very clearly. Speech recognition is performed by digitizing the sound and matching its pattern against stored patterns. Currently available devices are largely speaker-dependent (recognize the speech of only one or two persons) and can recognize discrete speech (speech with pauses between words) better than the normal (continuous) speech. Their major applications are in assistive technology for helping people in working around their disabilities [1]. The Arabic language has a standard pronunciation, which basically is the one used to recite the Quran. The same pronunciation is used in newscasts, discourses and formal actuations of all types. As in other widely used languages, dialect of Arabic pronounces some letters in different manner. This paper concentrates on the dialects style of pronunciation. Most languages, including Arabic, can be classified into a set of distinctive sounds or phonemes. Arabic language consists of 42 phonemes broken into four major classes: vowels (V), semivowels, diphthongs and consonants (C) [2].

1 log S() cos(m) d 0

(2)

978-1-4673-0784-0/12/$31.00 2012 IEEE

884

where S( ) is the spectral density of the speech signal and c(m) is the cepstrum coefficients. It is clear that c(m) decays rapidly as a function of m, in practice, c(m) is usually close to zero for m greater than 3 to catch the important information in the speech signal the least m=12. The Cepstral coefficients C(0), C(1), and C(2) are the average energy, the spectral tilt, and the degree to which the spectral energy is clustered around =/2, respectively. B. Mel-Frequency Cepstral Coefficients (MFCC) [3] The Mel-Frequency Cepstral Coefficients are defined as the inverse Fourier transform of the log-mel frequency spectrum. The mel-frequency spectrum is obtained by multiplying the signal frequency power spectrum by the critical band filter spectrum. The desired M cepstral coefficients are given by:
20 1 c n = X k cos n k 2 20 k =1

central office. Private branch exchanges used analog technology originally. Today, PBXs use digital technology.
C. Voice over IP (VoIP) [6]

Voice over IP, or Internet telephony is a real-time interactive audio application. The idea is to use the Internet as a telephone network with some additional capabilities. Instead of communicating over a circuitswitched network, this application allows communication between two parties over the packetswitched Internet. Two protocols have been designed to handle this type of communication: SIP and H.323.
D. Cellular telephony [7]

for n = 1,2,........M

(3)

Cellular telephony is designed to provide communication between two moving units, called mobile stations (MSs), or between one mobile unit and one stationary unit, often called a land unit. A service provider must be able to locate and track a caller, assign a channel to the call, and transfer the channel from base station to base station as the caller moves out of range. To make this tracking possible, each cellular service area is divided into small regions called cells. Each cell contains an antenna and is controlled by a small office, called the base station (BS). Each base station, in turn, is controlled by a switching office, called a mobile switching center (MSC). The MSC coordinates communication between all the base stations and the telephone central office. The Global System for Mobile Communication (GSM) is a European standard that was developed to provide a common second-generation technology for all of Europe. GSM uses two bands for duplex communication. Each band is 25 MHz in width, shifted toward 900 MHz. Each band is divided into 124 channels of 200 KHz separated by guard bands. V. ARTIFICIAL NEURAL NETWORKS (ANN) [8] Neural networks are pattern-matching devices with processing architectures, which are based on the neural structure of the human brain. A neural network consists of simple interconnected processing units (neurons). The strength of the interconnections between units are variable and are known as weights. Many architectures or configurations are possible, though a popular structure is known as a multilayer perception (MLP) in which the processing units are arranged in layers consisting of an input layer (X), a number of hidden layers (Z), and an output layer (Y). Weighted interconnections connect each unit in a given layer to every other unit in an adjacent layer. The network is said to be feed forward, in that there are no interconnections between units within a layer and no connections from outer layers back towards the input. The output of each processing unit, Y j , is some non linear function, of the weighted sum of the outputs from the previous layer that is:

where Xk are the log-energy outputs of a predefined set of 20 bandpass filters. The c0 coefficient represents the average energy in the speech frame and it is discarded in some systems as a form of amplitude normalization; c1 reflects the energy balance between low and high frequencies, higher values indicating a sonorant and low value suggesting a frication. Higher cepstral coefficients reflect increasing spectral details. IV. TRANSMISSION MEDIA Speech can be transmitted through different transmission media. Recognition for arabic dialectical speech in mobile communication services has been studied [4]. In this paper the comparison is performed for different transmission media (acoustical speech (direct), via telephone (PSTN, PBX), Internet (VoIP), and wireless (cellular).
A. The public switched telephone network (PSTN) [5]

The PSTN is the worldwide collection of interconnected public telephone networks that were designed primarily for voice traffic. The PSTN is a circuit-switched network where a dedicated circuit (channel) is established for the duration of the transmission. All PSTN calls run across 64 kbps switched connections (one DS0 of bandwidth). The DS0 data rate allows a 4 kHz frequency range to be delivered.
B. Private Branch Exchange (PBX) [5]

A PBX is a telephone system within an enterprise that switches calls between enterprise users on local lines while allowing all users to share a certain number of external phone lines. The main purpose of a PBX is to save the cost of requiring a line for each user to the telephone company's

885

N 1 Yj =f W ji X i + a i i= 0

, j=0, 1,2,..N-1

(4)

where ai is a bias value added to the sum of the weighted inputs Xi using the weights Wji and f represents some non linear function, such as the tangent function. Thus, by adjusting the weights Wji, the MLP may be used to represent complex non-linear mappings between a pattern vector presented to the input units and classification patterns appearing on the output units. In patternmatching applications, the network is trained by presenting a pattern vector at the input layer and by computing the outputs. The output is then compared with some desired output that is a set of output unit values, which will identify the input pattern. The error between the actual output and the desired output is computed and back propagated through the network to each unit. The input weights of each unit are then adjusted to minimize this error. This process is repeated until the actual output matches the desired output to within a some predefined error limit. Pattern recognition involves presenting the unknown pattern to the input nodes of the trained network and computing the values of the output nodes which identify the pattern. Training and adapting MLP features for Arabic speech recognition has been studied [9].
A. Algorithm

for the initial weights must not be too large, or the initial input signals to each hidden or output unit will be likely to fall in the region where the derivative of the sigmoid function has a very small value (the so-called saturation region). On the other hand, if the initial weights are too small, the network input to a hidden or output unit will be close to zero, which also causes extremely slow learning. A common procedure is to initialize the weights (and biases) to random values between -0.5 and 0.5. The values may be positive or negative because the final weights after training may be of either sign also. VI. RESULTS The data used in the analysis were obtained from six groups of isolated Arabic words for a male speaker shown in table 1. Every word is uttered ten times. Each group of words is recorded through the mentioned different transmission media. The used word groups are G1, G3, and G5 for long vowels and G2, G4, and G6 for short vowel. Table 1: Used word groups for long (G1, G3, G5) and short (G2, G4, G6) vowels Word 1 Word 2 Word 3 Word 4 Word 5 G1 G2 G3 G4 G5 G6

During training, each output unit compares its computed activation y k with its target value t k to determine the associated error for that pattern with that unit. Based on this error, the factor k (k = 1 . . . n) is computed which is used to distribute the error at output unit yk back to all units in the previous layer (the hidden units that are connected to yk) to update the weights between the output and the hidden layer. In a similar manner, the factor j (j= 1 . . . p) is computed for each hidden unit Zj. It is not necessary to propagate the error back to the input layer, but j is used to update the weights between the hidden layer and the input layer. After all of the factors have been determined, the weights for all layers are adjusted simultaneously. The adjustment to the weight wjk (from hidden unit Zj to output unit yk) is based on the factor k and the activation zj of the hidden unit Zj. The adjustment to the weight vij (from input unit Xi to hidden unit Zj) is based on the factor b and the activation xi of the input unit.
B. Choices of initial weights

A sample of the results obtained in this study [10] are presented in table 2 through table 7 for G1 and G2 word groups.
A. Objective Test

I) Training of ANN with the same media data The recognition rates of G1 and G2 via different transmission media are summarized in table 2 and table 3. The terms To stands for an office telephone (PBX), Tc for cabinet telephone (PSTN), M for mobile (wireless), and Skype for internet (VoIP). Table 2: recognition rates of G1 (long) via different media Media Direct To To Tc To M To Skype d 83.3% 83.3% 66.7% 83.3% 66.7% 100% 100% 50% 66.7% 83.3% 66.7% 50% 50% 66.7% 66.7% 83.3% 66.7% 50% 50% 83.3% 83.3% 66.7% 33.3% 50% 83.3%

The choice of initial weights will influence whether the network reaches a global (or only a local) minimum of the error and, if so, how quickly it converges. The update of the weight between two units depends on both the derivative of the upper unit's activation function and the activation of the lower unit. For this reason, it is important to avoid choices of initial weights that would make it likely that either activations or derivatives of activations are zero. The values

886

Table 3: recognition rates of G2 (short) via different media Media d Direct 83.3% 83.3% 50% 83.3% 83.3% ToTo 66.7% 50% 50% 50% 33.3% Tc To 33.3% 33.3% 33.3% 50% 16.6% M To 66.7% 0.0% 50% 50% 16.6% Skype 33.3% 16.6% 33.3% 50% 66.7%

Table 6: recognition rates of G1 (long) of the subjective test via different media word Direct 100% 65% 100% 75% 30% To -To 78.6% 78.6% 64.3% 92.5% 50% Tc-To 66.7% 66.7% 53.3% 73.3% 40% M To 86.7% 80% 20% 86.7% 66.7% Skype 100% 76.5% 94.1% 100% 70.6%

II) Training of ANN with direct data In this section the ANN is trained using direct data for all the groups while the tested data are the words through different transmission media. The number of trained data is four for each word in the group and the tested data is six for every isolated word. The following table 4 and table 5 are the recognition rates of G1 and G2. Table 4: recognition rates of G1 (long) where trained data for ANN is direct and tested data is the other media Media / To To Tc To M To Skype d 0.0% 33.3% 16.6% 33.3% 16.6% 66.7% 50% 83.3% 66.7% 33.3% 16.6% 66.7% 16.6% 66.7% 0.0% 16.6% 0.0% 50% 16.6% 0.0% Table 5: recognition rates of G2 (short) where trained data for ANN is direct and tested data is the other media Media / To To Tc To M To Skype d 0.0% 50% 0.0% 0.0% 0.0% 83.3% 0.0% 0.0% 0.0% 33.3% 16.6% 33.3% 50% 33.3% 16.6% 0.0% 0.0% 33.3% 16.6% 16.6%
B. Subjective Test

Table 7: recognition via different media word Direct 95% 100% 100% 100%

rates of G2 (short) of the subjective test To To 71.4% 85.7% 85.7% 78.6% 85.7% Tc To 66.7% 100% 20% 33.3% M To 93.3% 73.3% 26.7% 86.7% 86.7% Skype 100% 94.1% 82.4% 94.1% 100%

VII. CONCLUSION The objective recognition rates for all the Arabic words in the different groups concerning the transmission media TcTo, M-To, and Skype are lower than those for direct and ToTo transmission media when the ANN is trained with the same medium data. The objective recognition rates were the lowest when the ANN is trained with the direct transmission medium. Comparing the objective and subjective recognition rates, the subjective recognition rates were higher. Generally it may be concluded that the words with long vowels in all groups having higher score than those for short vowels. Also the words with the long vowel /a/ posses nearly higher scores than those with the short vowel. In addition the recognition rate for the words of long vowel /a/ is higher than those of vowels /u/, /i/. REFERENCES
[1] D. Oshaughnessy, Speech Communications: Human and Machine, 2nd ed., Wiley-IEEE Press, New York, 1999. [2] H. M. A. Shehata, Voice Message priorities using neuro-fuzzy Mood identifier, M.SC. Thesis, Alexandria, Egypt, 2006. [3] F. J. Owens, Signal Processing of Speech (Macmillan New Electronics), Palgrave Macmillan, 1993. [4] Q. Zhou and I. Zitouni, Arabic Dialectical Speech Recognition in Mobile Communication Sevices, Speech Recognition and Applications, pp. 550, Nov. 2008, I-Tech, Vienna, Austria. [5] B. A. Forouzan, Data Communication and Networking, 3rd ed., McGraw Hill, 2003. [6] D. Minoli, Voice Over IPv6: Architectures for Next Generation VoIP Networks, Newnes, New York, 2006. [7] S. Haykin, Communication Systems, 4th ed., Wiley, 2004. [8] L. Fausett, Fundamentals of Neural Networks: Architectures, Algorithms And Applications, Prentice Hall, 1993. [9] J. Park, F. Diehl, M.J.F. Gales, M. Tomalin, and P.C. Woodland, Training and Adapting MLP features for Arabic Speech Recognition, IEEE Conference on Acoustics, Speech and Signal Processing, 2009, ICASSP2009, pp. 4461-4464. [10] A. Bzeih, A Comparative Study for Arabic Speech Recognition, M.Sc. Thesis, Beirut, Lebanon, 2011.

The motivation of this test is to study the effect of the transmission media on humans speech recognition for isolated Arabic words. The results of this test are compared with the results of the objective test. This subjective test is carried out in anechoic chamber in the EE department, Faculty of engineering, BAU (Debbie Campus). Fifteen persons judged the recognition process for the tested materials. The words for each media are recorded in a different sequence and there is stuffing of dummy words in the data. In the direct test every four persons sat together in the acoustic lab in the anechoic chamber and they heard the recorded words from the analog recorder and wrote the words they heard on a paper. However in the other media tests, every person heard the recorded data individually through the earphones and wrote the word that he/she heard.

887

S-ar putea să vă placă și