Sunteți pe pagina 1din 7

FILIPINO SPEECH PHONEME CLASSIFICATION USING A REDUCED

FEATURE SET FOR MULTI-LAYER PERCEPTRON AND


SUPPORT VECTOR MACHINE CLASSIFIERS

Joel P. ILAO, Michael L. ABUNDO, Prospero C. NAVAL Jr. and Rowena Cristina L. GUEVARA

University of the Philippines, Diliman, Philippines

Abstract : This paper describes the design and development of a phoneme classifier intended for Filipino speech. Noise-
free speech samples were taken from a subset of the Filipino Speech Corpus (FSC) [3]. Speech features based on the first
nine Mel Frequency Cepstral Coefficients (MFCC’s) and their first and second temporal derivatives are computed from the
windowed speech samples, resulting in a 27-Dimension Feature Set. This feature set was dimensionality-reduced to 10
Dimensions using Fisher’s Linear Discriminant Analysis technique. Multilayer Perceptron (MLP) classification accuracies for
both original and reduced feature sets were then compared. It was noted that for the 140-node, 500-node and 760-node
hidden layer MLP architectures, classification performance slightly degrades with increasing number of hidden layer nodes.
An SVM-based classifier with a polynomial kernel and employing the Sequential Minimal Optimization (SMO) method was
also implemented, which was shown to be inferior to the MLP-based classifiers’ performance. The reduced feature set of 10-
dimensions resulted in faster classification, at the expense of slightly lower classification performances for all the investigated
classifiers.

Key words : Phone Classification, Multi-Layer Perceptron, Support Vector Machines


1 INTRODUCTION Spanish, Mandarin) [1],[3] have had quite mature researches
Automatic Speech Recognition (ASR) systems have started on ASR systems and some even have commercially-oriented
to pervade mainstream applications, and on different areas of applications; however, this maturity level still does not apply
human endeavor, from automated dictation systems used in to the other world languages, including the Filipino spoken
desktop and mobile technologies, to transcription services language. Modest efforts on automated speech recognition
used in the medical and call center industries, and in legal of Filipino spoken words have been made in the Digital
proceedings, to recognition of broadcast news, and as part of Signal Processing (DSP) Laboratory of University of the
assistive technologies for people with physical impairments Philippines - Diliman, among which is a Phoneme
and disabilities [1],[3]. There is an abundance of possible Recognizer based on a Multi-layer Perceptron (MLP). A
applications of ASR because speech is the most natural form previous local study has identified 46 unique phonemes for
of human expression. However important speech recogntion the Filipino Spoken Speech, and an MLP-based phoneme
systems can be, one issue that need to be understood and classifier was designed, achieving, at best, a 71.5%
addressed with is the complexity of spoken language itself: classification rate using 27 acoustical features extracted
the number of possible words contained in the language set from a database of manually transcribed recordings of a
(vocabulary), the different ways by which ideas are Filipino Speech Corpus (FSC) [3]. No effort have been made
expressed and delivered (grammar and style), the contexts previously to attempt to systematically reduce the
by which words are used (semantics), and the fact that dimensions of the corresponding feature set of the FSC for
among the world languages, there is a significant degree of phoneme recognition, and in light of the fact that Support
variability in terms of the language elements just described. Vector Machines (SVM) have been gaining prominence in
Add this to the fact that individual native speakers also vary the area of automated classification using multi-dimensional
to an extent in the manner by which they speak (i.e. voice feature sets, this project aims to find a way of effectively
qualities and pitch, accents, and sentence style preference). reducing the original 27-dimension feature set, and design
In the area of Acoustical Phonetics, a speech utterance a corresponding SVM-based classifier which can be
is seen as a time-sequenced concatenation of the most basic compared to the more popular MLP-based classifier.
utterance types, which are called phonemes [4],[5]. In a
typical ASR system, a fundamental task, then, is to first 1.1 The Filipino Speech Corpus (FSC) and the DSP46
determine the phonemes comprising any given speech Phoneme Set
waveform taken from a human speaker using a microphone. Most languages have only around 20 to 40 phonemes, with a
Then, based on a language model, increasing levels of few reaching up to 100 phonemes [5]. World languages
interpretations can be made, from the phoneme sequences, to share common phonemes (i.e. the cardinal vowels /i/, /a/,
word, then sentence and lastly, paragraph sequences. and /u/ and the consonants /p/, /t/ and /k/).
Various work focusing on the major languages (i.e. English,
Various speech corpora for specific language domains Table 1 The DSP46 Phoneme Set
were built for benchmarking purposes, each with
corresponding transcription notations [3],[4],[5]. For the Phone Type Phone Phone Type Phone
English language, the two dominant transcription notations Number Type Number Type
used are the (1) International Phonetic Alphabet and the (2) 0 B 23 I
ARPABet [4]. The notational standards used by various
Speech Corpora (i.e. Texas Instrument – Massachussets 1 D 24 O
Institute of Technology or TIMIT corpus, Switchboard 2 G 25 U
Transcription Project, Buckeye corpus) have enabled
researchers to compare the performance of speech 3 K 26 Ag
recognition algorithms applied on these common data sets.
4 P 27 At
The aforementioned speech corpora are based on the
English language, and the diversity of acoustical 5 T 28 Ak
characteristics for different spoken languages has
6 J 29 Ad
underscored the need to build speech corpora of languages
other than English. For the Filipino spoken language, the 7 Ts 30 og
DSP Laboratory of the University of the Philippines –
Diliman [3] has developed a speech database of recordings 8 F 31 Hu
of read and spontaneous Filipino speech and their 9 S 32 Aw
corresponding transcriptions. The Filipino Speech Corpus
(FSC), as the database is called, contains recordings from 10 Sh 33 Ay
100 speakers, and runs for a total of 100 hours. The speech 11 V 34 Oy
samples were recorded using 44.1 KHz sampling rate, and
were downsampled to 16 KHz. All samples have 16-bit 12 Z 35 Iw
floating point representations. The speakers for the read 13 M 36 El
speech used the same set of words and syllables. The ‘wav’
files are named following a notation that conveniently 14 N 37 Em
indicates the gender, age range, and speech content category
15 Ng 38 En
of the corresponding speech file.
The FSC project also identified 46 phonemes (named as 16 Q 39 Ha
the DSP46 phoneme set) for the Filipino Spoken Language.
17 L 40 He
The actual phonemes and their corresponding voiced/
unvoiced classifications are listed in Table 1. 18 R 41 Hi

1.2 Phoneme Recognizers and ASR systems 19 W 42 Ho


Phoneme classification from speech waveforms serves as a 20 Y 43 Pau
foundation for ASR systems, and various recognizers have
been implemented and customized for some of the major 21 A 44 Epi
world languages. For example, phoneme recognizers for the 22 E 45 Ow
English language using the TIMIT and NTIMIT speech
corpus have achieved 77% and 67.4% recognition rates,
1.3 The Mel Frequency Cepstral Coefficients
respectively, using 39 phoneme types and a binary
Mel Frequency Cepstral Analysis is used to describe the
partitioned Neural Network classifier with 500 Hidden
short-term spectral envelope of a speech signal, and is
Nodes [10]. For the Spanish language, the developed HMM-
considered as one of the most accurate methods of
based JANUS ASR system [11] has achieved 27% word
representing a signal for speech recognition. This method
error rates (WER) for non-simultaneous conversations, and
works by transforming the frequency components of a
32% WER for simultaneous conversations. As for the
particular speech signal, in order to conform with the model
Mandarin Language, a 73.1% classification rate was
of how the human ear receives and perceives spoken
achieved when applied on the Mandarin Broadcast News
language. Using the idea of Mel Frequency Cepstral
Corpus, using a 2000-Hidden Node MLP [3]. For these
Analysis, the human auditory system can be seen as a bank
systems just mentioned, neural networks are the preferred
of constant – Q nonuniformly-spaced filters, which means
architectures for speech recogntion, as they are flexible
that signals in the frequency domain need to be warped in
enough to adapt to data sets that form complex decision
the Mel Frequency scale, in order to make it possible to
regions, which is inherent in speech data. It can also be seen
conveniently subject them to linear methods of analysis.
that current state of research fall within the range of 70%
This concept originated from David and Mermelstein
classification accuracy. This value will serve as a benchmark
(D&M) in 1980, who also proposed combining the
in determining the performance of the phoneme recognizers
frequency-warped outputs with Discrete Cosine Transform
developed for this project.
(DCT), the results of which are called the Mel Frequency
Cepstral Coefficients (MFCC’s) now popularly used in ASR an SVM – based classifier.
systems [6]. For feature extraction, the recordings were segmented
To summarize, the following are the main steps in into 10-ms Hamming-windowed frames, with 5-ms skips.
calculating MFCCs: Twenty-seven (27) features were extracted for each frame:
a) Segmenting the digitized speech signal through the first 9 MFCC’s, the first derivative values of the 9
hamming window. MFCC’s (called the delta’s), and the corresponding 2nd
b) Converting to frequency domain by Fast Fourier derivative values (referred to as the delta-delta’s). The pre-
Transform (FFT). processing stage originally resulted in 2,376,712 frames.
c) Passing through the Mel Frequency filter bank. This feature set was down-sampled by a factor of 180
d) Calculating the log-magnitude spectrum out of a resulting to 13,009 frames.
filter bank.
e) Calculating the cosine transform of the filter bank 2.2 Feature Reduction using Fisher’s Linear Discriminant
output. Analysis [2]
Fisher’s LDA was applied on the 27-dimension original
1.4 Fisher’s Linear Discriminant Analysis for feature set. Fig 1 shows the cumulative sum plot of the
Dimensionality Reduction normalized eigenvalues after eigendecomposition. It can
Principal Components Analysis (PCA) is a statistical be seen from the plot that it is possible to reduce the feature
technique commonly used in finding patterns in data of high set to 10 dimensions by choosing a 95% threshold in
dimension, and would naturally be the first option selecting the top M eigenvalues of the plot in Fig 1.
considered for dimensionality reduction. However, since Empirical performance tests using the SVM classifier would
the training data has already been labeled, it would be better show that the reduced feature set better discriminates the
that the labelling information be exploited to determine the phoneme classes from each other, as indicated by better
most appropriate transformation to a lower dimensional accuracy rates compared to classifiers trained using the
feature space that guarantees a maximum separability among original 27-dimension feature set.
clusters of different classifications, and cohesiveness of
feature vectors classified similarly. The criteria just 2.3 The Multi - Layer Perceptron - based Classifier
described are well-addressed by another statistical method MLP-based classifiers, being the most popular choice for
for feature reduction, called the Fisher’s Linear Discriminant speech recognition and under which best-performing
Analysis (LDA) method. classifiers in literature belong, were implemented. In
Simply put, Fisher’s LDA chooses a reduced feature set particular, single-Hidden Layer MLPs, with varying number
that maximizes between-class scatters, while minimizing of hidden nodes, were implemented in order to investigate
within-class scatters, of a labelled feature set. The between- how classifier performance is affected by hidden node
class scatters may be defined as the average sum of squared number. The MLPs use Backpropagation Learning
differences (SSD) of the mean feature vector of a particular algorithm, with Logistic objective function, and the Mean
class type i, whereas the within-class scatter can be seen as Square Error (MSE) as a performance metric. Fig 2
the total variance among feature vectors of the same shows the training graph of the 500-Hidden Node MLP,
classification type. trained on 3000 single-pass epochs, with the MSE finally
settling at 0.913%. The knee of the training graph is
1.5 Support Vector Machine (SVM) Classification located roughly in the area of 250th epoch. It was observed
An SVM is a method used to estimate decision surfaces that regardless of the number of hidden nodes, the knee
separating two classes of data SVM has been used for consistently is located within this vicinity. Hence, the
numerous classification applications. Phoneme comparative-analysis of performances among the different
Classification is not an exception. Although Neural MLP’s were based on 300-epoch graphs. All the MLP-
Networks (NN’s) are still state-of-the-art, SVM’s are now based classifiers were evaluated using the original (27-
becoming a subject of study and research. The reader is Dimension) and reduced (10-Dimension) feature sets.
kindly requested to read [8] for discussions on the SVM
concept. 2.4 The Support Vector Machine – based Classifier
One of the objectives of this project is to benchmark the
2 MATERIALS AND METHODS performance of SVM-based classification in terms of
2.1 Preparing the Training and Test Data Sets accuracy in classifying phoneme X versus non-phoneme X.
Twenty-five (25) files comprising a hand-labeled subset of This is a 2-way classification problem and can easily be
the FSC are used for testing and training of the MLP and handled by SVMs.
SVM-based classifiers. These files consist of over 6.6 The SVM model is first trained in the same manner as
hours of recorded speech. The recordings have 13 speakers, that of the NN: feeding it with input and the correct output
of whom 7 are males, and 6 are females. An 80% training label. From this training session, the model becomes more
set - 20% testing set partitioning was used. Two classifiers “learned” and hence more accurate if fed with test data.
are then investigated: (1) an MLP-based classifier, and (2)
1

0.9

0.8

cumulative sum of normalized eigenvalues


0.7

0.6

0.5

0.4

0.3

5 10 15 20 25
eigenvalue rank

Fig 1 Cumulative Sum Plots of Normalized Eigenvalues using Fisher’s LDA on FSC

Fig 2 Training Graph of the MLP-based Phoneme Recognizer

Since objectives include seeing if SVM-based phonemes by looking at which of the models yields a “-1”
classifiers perform close to NN-MLPs and if the feature label and we can interpret this as a “neuron” firing, and thus
reduction affects the accuracy of classification, we can look map the current instance, with a feature set being fed to all
at the classification accuracies of both the raw, unreduced models, unto the phoneme that corresponds to the model
27-dimensional feature sets, and the reduced 10-dimensional number of the neuron that fired.
feature sets in a 2-way classification (phoneme X versus
non-phoneme X) as well as the over-all performance when 2.5 Error Metrics for Performance Measurements
the SVM is tasked to identify the particular phoneme among The classifiers were compared based on the overall
all the 46 phonemes. classification error-rate, and the per-phoneme classification
We decided to use a multi-model SVM classifier to act error-rate, whose formulas are shown below:
as the multi-class classifier for phoneme identification. N
Extending the 2-way SVM classifier concept by using many
different models for different differentiation classes, i.e. 1
∑ sgn ( Ideal − Actual )
i =1
i i
Overall _ Err = (1a)
model for phoneme 1 versus others, another model for N
phoneme 2 versus others, etc., we can classify each of the
N (i ) reduction also showed that MFCC’s more accurately
∑(Ideal(i)
j =1
j − Actual(i) j ) discriminate speech data classified according to phoneme
types, compared to its first derivative and second derivative
Per − phone_ Erri = (1b)
values.
N (i)
This project also investigated how an MLP-based
classifier would perform when using the reduced feature set.
where Ideali and Actuali are the human-ascertained phoneme
The classifier was able to register an overall classification
label, and the output of the developed classifier, respectively,
error of 38.855%, and an average per-phoneme error of
for the ith frame. Ideal(i)j and Actual(i)j are the human-
70.398%. Different representation levels, suggested by
ascertained phoneme label and output of the developed
disparate frequency of occurrences of phoneme types in the
classifier, respectively for the jth frame with phoneme label i.
training set, account for the varying per-phoneme error rates
N(i) is the total number of frames with human-labeled with
of the classifier. Focusing on the classifier performance on
value i. Note that i is an integer value corresponding to the
the best-represented phonemes, however, indicated that the
phoneme labels of Table 1.
MLP-based classifier is at par with current well-established
researches on other major languages.
An SVM-based phoneme classifier was also designed,
3 RESULTS AND DISCUSSION
the performance of which was benchmarked against the
The MLP-based classifier’s performance was tested using
designed MLP-based classifier. It was shown via inspection
the overall classification error, and the per-phoneme
of the per-phoneme error graphs that the SVM-based
classification error metrics. Overall classification on the
classifier is comparable in performance with the MLP-based
test data set is computed to be at 38.855%, whereas the
classifier, with approximately just 5% difference between
average per-phoneme classification error is at 70.398%.
the respective classifier performances. It was noted, though,
While the classifier performed poorly on some phonemes,
that the MLP-based classifier still had a slightly higher
even registering 100% per-phoneme errors on some of the
overall classification rate over the SVM-based classifier. The
phoneme types, it can be argued that these phoneme types
reduction operation using the Fisher’s LDA was also proven
where the classifier failed have very small frequency counts
to be effective as shown by improvements in the
in the training data set (see Fig 4), which kept them from
performance of the SVM-based phoneme classifier, when
significantly impressing on the MLP architectural
compared with using the original 27-dimensional feature set.
parameters during the training process. The best-represented
The phoneme classifiers would have better
phonemes: vowels (types 21 to 25), the /s/ and /n/ phonemes
performance if the training data set would be made larger
(types 9 and 14, respectively) as well as the /pau/ phoneme
such that all phoneme types are adequately represented
(type 43), however, posted individual per-phoneme error
during the training process. In terms of feature selection,
rates of typically below 30% error rates (with the exception
other features can also be computed (i.e. Perceptual Linear
of /e/ with 45.71% error rate). These phonemes registered an
Prediction or PLP, auditory energy ratios) and the technique
average per-phoneme error of 27.19%, which is comparable
used in this study for effective feature reduction can also be
to the MLP-based classifiers designed for other major
applied, in order to determine which features actually
languages.
contribute to improving ASR performances. Also, it would
Fig 5 shows a graph comparing the accuracies of the 2-
be worthwhile to investigate if other neural network
way SVM classifier when using the raw features and the
architectures in literature, used in ASR systems, would also
reduced features. We can observe the same trend and very
be effectively applied in phoneme classification for the
close accuracy levels for all the phonemes when we use the
Filipino spoken language.
raw feature set and the reduced feature set.
The resulting SVM classifier has a 43.20% over-all
ACKNOWLEDGMENTS
error in classification. Fig 6 summarizes the classification
The authors would like to acknowledge the Office of the
errors per phoneme.
Vice Chancellor for Research and Development of the
University of the Philippines – Diliman for funding the
4 CONCLUSIONS
Filipino Speech Corpus project, and the Department of
This project has endeavored to reduce the 27-dimensional
Science and Technology for the Engineering Research and
feature set used in ASR for the Filipino Speech Corpus,
Development for Technology (ERDT) Scholarship grant
using Fisher’s LDA, which is a statistical method that
given to the first two authors.
exploits the labelling information of the training set, as
opposed to the unsupervised feature reduction approach of
the PCA. Using a 95% threshold value in the resulting
normalized eigenvalue cumulative sum plot, the feature set
dimensions have been reduced to 10. Examination of the
corresponding transformation matrix used in feature
1

0.9

0.8

0.7

0.6

classification error
0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40 45
phone type

Fig 3 Per-phoneme Error plots for the MLP-based classifier

1800

1600

1400

1200
frequency count

1000

800

600

400

200

0
0 5 10 15 20 25 30 35 40 45
Phone type

Fig 4 Frequency Counts of the Phonemes in Training Data Set

27-Dimensional Feature Set 10-Dimensional Feature Set

120

100
Classification Error

80
(in Percent)

60

40

20

0
12

15

18

21

24

27

30

33

36

39

42

45
0

Phone Type
Fig 5 Two-way SVM Classification Performance using Raw & Reduced Feature Sets
Fig 6 Per-phoneme Error Plots for the SVM-Based Classifier

REFERENCES Systems; Bangkok, Thailand; vol IV; May 25 - 28,


[1] Cetin, O. et al. (2007). Monolingual and Crosslingual 2003; p. 281-284.
Comparison of Tandem Features Derived from [7] The libSVM (v.2.88) SVM Library with Matlab
Articulatory and Phone MLPs. Proceedings of IEEE Interface.
Workshop on Automatic Speech Recognition and (online: http://www.csie.ntu.edu.tw/~cjlin/libsvm)
Understanding; 2007 December 9 – 13; p. 36 - 41 [8] Vapnik V. (1995). The Nature of Statistical Learning
[2] Duda, R.O. et al. (2001). Pattern Classification, 2nd ed. Theory. Springer-Verilag: New York
NY: John Wiley & Sons. [9] Wassner H., Chollet G. (1996). New Time Frequency
[3] Guevara R.C.L. et al. (2003). Speaker Independent Derived Cepstral Coefficients for Automatic speech
Continuous Speech of the Filipino Speech Corpus. Recognition. Proceedings of European Signal
[Undergraduate Thesis]. Diliman, Philippines: Processing Conference
University of the Phils. 94p. (Available at the UP- [10] Zahorian, S.A. et al. (1997). Phone Classification with
Diliman library) Segmental Features and a Binary-Pair Partitioned
[4] Jurafsky D. & Martin J. (2009). Speech and Language Neural Network Classifier. Proceedings of IEEE
Models: An Introduction to Natural Language International Conference on Acoustics, Speech and
Processing, Computers, Linguistics and Speech Signal Processing; Vol. 2; 1997 April 21 - 24; p.
Recognition. Pearson Educ., Inc., NJ 1011 – 1014.
[5] Shaugnessy, D.O. (2000). Speech Communication: [11] Zhan, P., et al. (1996). Janus II: Towards Spontaneous
Human and Machine, 2nd ed. IEEE, Inc: NY Spanish Speech Recogniton. Proceedings of
[6] Skowronski M., Harris J. (2003). Improving the Filter International Conference on Spoken Language
Bank of Classic Speech Feature Extraction Algorithm. Processing; Philadelphia, USA; vol 4; October 3 - 6,
Proceedings of IEEE Intl Symposium on Circuits and 1996; pp. 2285-2288

S-ar putea să vă placă și