Documente Academic
Documente Profesional
Documente Cultură
Joel P. ILAO, Michael L. ABUNDO, Prospero C. NAVAL Jr. and Rowena Cristina L. GUEVARA
Abstract : This paper describes the design and development of a phoneme classifier intended for Filipino speech. Noise-
free speech samples were taken from a subset of the Filipino Speech Corpus (FSC) [3]. Speech features based on the first
nine Mel Frequency Cepstral Coefficients (MFCC’s) and their first and second temporal derivatives are computed from the
windowed speech samples, resulting in a 27-Dimension Feature Set. This feature set was dimensionality-reduced to 10
Dimensions using Fisher’s Linear Discriminant Analysis technique. Multilayer Perceptron (MLP) classification accuracies for
both original and reduced feature sets were then compared. It was noted that for the 140-node, 500-node and 760-node
hidden layer MLP architectures, classification performance slightly degrades with increasing number of hidden layer nodes.
An SVM-based classifier with a polynomial kernel and employing the Sequential Minimal Optimization (SMO) method was
also implemented, which was shown to be inferior to the MLP-based classifiers’ performance. The reduced feature set of 10-
dimensions resulted in faster classification, at the expense of slightly lower classification performances for all the investigated
classifiers.
0.9
0.8
0.6
0.5
0.4
0.3
5 10 15 20 25
eigenvalue rank
Fig 1 Cumulative Sum Plots of Normalized Eigenvalues using Fisher’s LDA on FSC
Since objectives include seeing if SVM-based phonemes by looking at which of the models yields a “-1”
classifiers perform close to NN-MLPs and if the feature label and we can interpret this as a “neuron” firing, and thus
reduction affects the accuracy of classification, we can look map the current instance, with a feature set being fed to all
at the classification accuracies of both the raw, unreduced models, unto the phoneme that corresponds to the model
27-dimensional feature sets, and the reduced 10-dimensional number of the neuron that fired.
feature sets in a 2-way classification (phoneme X versus
non-phoneme X) as well as the over-all performance when 2.5 Error Metrics for Performance Measurements
the SVM is tasked to identify the particular phoneme among The classifiers were compared based on the overall
all the 46 phonemes. classification error-rate, and the per-phoneme classification
We decided to use a multi-model SVM classifier to act error-rate, whose formulas are shown below:
as the multi-class classifier for phoneme identification. N
Extending the 2-way SVM classifier concept by using many
different models for different differentiation classes, i.e. 1
∑ sgn ( Ideal − Actual )
i =1
i i
Overall _ Err = (1a)
model for phoneme 1 versus others, another model for N
phoneme 2 versus others, etc., we can classify each of the
N (i ) reduction also showed that MFCC’s more accurately
∑(Ideal(i)
j =1
j − Actual(i) j ) discriminate speech data classified according to phoneme
types, compared to its first derivative and second derivative
Per − phone_ Erri = (1b)
values.
N (i)
This project also investigated how an MLP-based
classifier would perform when using the reduced feature set.
where Ideali and Actuali are the human-ascertained phoneme
The classifier was able to register an overall classification
label, and the output of the developed classifier, respectively,
error of 38.855%, and an average per-phoneme error of
for the ith frame. Ideal(i)j and Actual(i)j are the human-
70.398%. Different representation levels, suggested by
ascertained phoneme label and output of the developed
disparate frequency of occurrences of phoneme types in the
classifier, respectively for the jth frame with phoneme label i.
training set, account for the varying per-phoneme error rates
N(i) is the total number of frames with human-labeled with
of the classifier. Focusing on the classifier performance on
value i. Note that i is an integer value corresponding to the
the best-represented phonemes, however, indicated that the
phoneme labels of Table 1.
MLP-based classifier is at par with current well-established
researches on other major languages.
An SVM-based phoneme classifier was also designed,
3 RESULTS AND DISCUSSION
the performance of which was benchmarked against the
The MLP-based classifier’s performance was tested using
designed MLP-based classifier. It was shown via inspection
the overall classification error, and the per-phoneme
of the per-phoneme error graphs that the SVM-based
classification error metrics. Overall classification on the
classifier is comparable in performance with the MLP-based
test data set is computed to be at 38.855%, whereas the
classifier, with approximately just 5% difference between
average per-phoneme classification error is at 70.398%.
the respective classifier performances. It was noted, though,
While the classifier performed poorly on some phonemes,
that the MLP-based classifier still had a slightly higher
even registering 100% per-phoneme errors on some of the
overall classification rate over the SVM-based classifier. The
phoneme types, it can be argued that these phoneme types
reduction operation using the Fisher’s LDA was also proven
where the classifier failed have very small frequency counts
to be effective as shown by improvements in the
in the training data set (see Fig 4), which kept them from
performance of the SVM-based phoneme classifier, when
significantly impressing on the MLP architectural
compared with using the original 27-dimensional feature set.
parameters during the training process. The best-represented
The phoneme classifiers would have better
phonemes: vowels (types 21 to 25), the /s/ and /n/ phonemes
performance if the training data set would be made larger
(types 9 and 14, respectively) as well as the /pau/ phoneme
such that all phoneme types are adequately represented
(type 43), however, posted individual per-phoneme error
during the training process. In terms of feature selection,
rates of typically below 30% error rates (with the exception
other features can also be computed (i.e. Perceptual Linear
of /e/ with 45.71% error rate). These phonemes registered an
Prediction or PLP, auditory energy ratios) and the technique
average per-phoneme error of 27.19%, which is comparable
used in this study for effective feature reduction can also be
to the MLP-based classifiers designed for other major
applied, in order to determine which features actually
languages.
contribute to improving ASR performances. Also, it would
Fig 5 shows a graph comparing the accuracies of the 2-
be worthwhile to investigate if other neural network
way SVM classifier when using the raw features and the
architectures in literature, used in ASR systems, would also
reduced features. We can observe the same trend and very
be effectively applied in phoneme classification for the
close accuracy levels for all the phonemes when we use the
Filipino spoken language.
raw feature set and the reduced feature set.
The resulting SVM classifier has a 43.20% over-all
ACKNOWLEDGMENTS
error in classification. Fig 6 summarizes the classification
The authors would like to acknowledge the Office of the
errors per phoneme.
Vice Chancellor for Research and Development of the
University of the Philippines – Diliman for funding the
4 CONCLUSIONS
Filipino Speech Corpus project, and the Department of
This project has endeavored to reduce the 27-dimensional
Science and Technology for the Engineering Research and
feature set used in ASR for the Filipino Speech Corpus,
Development for Technology (ERDT) Scholarship grant
using Fisher’s LDA, which is a statistical method that
given to the first two authors.
exploits the labelling information of the training set, as
opposed to the unsupervised feature reduction approach of
the PCA. Using a 95% threshold value in the resulting
normalized eigenvalue cumulative sum plot, the feature set
dimensions have been reduced to 10. Examination of the
corresponding transformation matrix used in feature
1
0.9
0.8
0.7
0.6
classification error
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40 45
phone type
1800
1600
1400
1200
frequency count
1000
800
600
400
200
0
0 5 10 15 20 25 30 35 40 45
Phone type
120
100
Classification Error
80
(in Percent)
60
40
20
0
12
15
18
21
24
27
30
33
36
39
42
45
0
Phone Type
Fig 5 Two-way SVM Classification Performance using Raw & Reduced Feature Sets
Fig 6 Per-phoneme Error Plots for the SVM-Based Classifier