Sunteți pe pagina 1din 4

Copyrights: 978-1-4673-1711-5/12/$31.

00 c 2012 IEEE
ICSES 2012 International Conference on Signals and Electronic Systems
WROCLAW, POLA!, Se"tem#er 1$ 21, 2012







Evaluation of Time Domain Features for
Voiced/Non-voiced Classification of Speech

F. Ykhlef
System Architecture and Multimedia Division,
CDTA,
Cite 20 Aout 1956, Baba Hassen, Algiers, Algeria,
fykhlef@cdta.dz

L. Bendaouia
Embedded System Laboratory,
Cergy-Pontoise, University,
6 rue du Ponceau, Cergy, France,
lotfi.bendaouia@ensea.fr


Abstract In this paper, we have performed an evaluation
of several time domain features for voiced/non-voiced classifi-
cation of speech signal. We have chosen in a seamless way three
features: autocorrelation function (ACF), average magnitude
difference function (AMDF) and weighted ACF (WACF) to
form three different classifiers. Experimental results were
conducted on TIMIT database in clean and noisy environ-
ments. The white noise extracted from the NOISEX92 database
has been incorporated to validate the developed classifiers. We
have established an overall ranking of these classifiers based on
the average value of the percentage of classification accuracy
(Pc).
I. INTRODUCTION
Accurate and reliable voiced/non-voiced classification of
speech is a crucial preprocessing step in many speech
processing applications and it is essential in most analysis
and synthesis systems. The goal of classification is to de-
termine whether the speech production system involves
vibration of the vocal folds or not. For example, voicing
determination is vital in pitch detection problem. The accu-
racy of detection can significantly improve the performances
of a pitch detector [1].
Voiced speech consists of periodic or quasi periodic
sounds made when there is a significant glottal activity (vi-
bration of the vocal folds). Unvoiced speech is a non period-
ic, random excitation sound caused by air passing through a
narrow constriction of the vocal track. Unvoiced sounds
include the main classes of consonants which are voiceless
fricatives, occlusives and stops. When both quasi-periodic
and random excitations are present simultaneously (which is
named mixed excitation, such as voiced fricatives), the
speech is classified voiced because the vibration of vocal
folds is a part of the speech act. In other contexts, the mixed
excitation could be treated by itself as a different class [2].
The non-voiced region includes silence and unvoiced speech
[3].
A variety of classifiers for robust voiced/non-voiced clas-
sification have been reported in literature [1], [2], [4]. The
majority of them use hybrid approaches for voicing decision
which include time and frequency acoustical features.
Several features have been used in literature. We can
mention the energy of the signal (E), zeros crossing rate
(ZCR), autocorrelation function (ACF), average magnitude
difference function (AMDF), weighted ACF (WACF), cep-
stral function (CEP), discrete wavelet transform (DWT)
coefficients, first coefficient of a p
th
order linear prediction
analysis and harmonic measure [3,4]. The combination of
evidence from multiple features can be done by using statis-
tical models such as neural network, Gaussian mixture mod-
el or hidden Markov model [4]. This combination can sig-
nificantly offer strong classifiers that depend on the number
of features incorporated in the model. On the other hand, the
need for hardware implementation and real time applications
requests the reduction of the number of features which aims
at decreasing the computational complexity. The perfor-
mances of these classifiers in terms of percentage of classi-
fication accuracy (Pc) and especially in noisy environments
depend on the choice of the suitable feature.
The ACF, AMDF and WACF are widely used as input
attribute to these classification models. They are preferred
due to their low computation and precise estimation. Fur-
thermore, these features are commonly used in the develop-
ment of pitch detection techniques since they provide signif-
icant instants of pitch periods. Thus, they can be used for
both classification and pitch detection. In our study, we will
focus on their use for speech classification. Also, they are
considered as time-based function which is generally a pre-
ferable characteristic in real time applications. The purpose
of this paper is to study separately the performance of each
feature for voiced/non-voiced classification in clean and
noisy environment and to establish an overall ranking of
these later. To achieve our purpose, we have developed
three classification schemes that use only one feature with-
out the need of pre- or post processing stages.
We have used manually segmented speech signals from
TIMIT database to measure the success of the classification
into voiced and non-voiced frames. The Performance of the
developed classifiers is evaluated by using an additive white
noise, extracted from the NOISEX 92 data base. We have
used different SNRs (Signal to Noise Ratios) of the input
signal ranged from 30 to 5dB.
Our paper is organized as follow. In section II, the de-
tailed implementations of the three classifiers are reviewed.
In section III, the evaluation and comparison of the features
for voiced non-voiced classification are given. Section IV
gives a conclusion and future perspectives.
Copyrights: 978-1-4673-1711-5/12/$31.00 c 2012 IEEE


No Yes
LPF: 0-700Hz
Speech frame
Find max. peak
i

Voiced Non-voiced
( )
2 1 i
n : n , =

i 0 i


Fig. 1: Block diagram of the correlation classifiers
i=1, ACF,
i=3, WACF,
II. VOICED/NON-VOICED CLASSIFIERS
In the following section, detailed explanations of three
distinct classifiers are provided. The classification schemes
are based on just one acoustical feature amongst: ACF,
AMDF and WACF.
A. ACF
The fist classifier is based on the ACF. Fig. 1 (i=1) shows
a block diagram of the classification scheme. It is based on
frame by frame processing of the speech signal using a sta-
tionary rectangular window of 22.5 ms duration. The used
speech signal has a sampling rate (Fs) of 16 kS/s. The frame
x(n) is low pass filtered to 700Hz (A 20- point linear phase,
finite impulse response Low Pass Filter (LPF) ) in order to
eliminate the formant structure of the speech signal. The
first stage of the processing is the computation of the ACF
using equation 1.
( ) ( ) ( )


=
+ =
1 N
0 n
1
n x n x
N
1
(1)

2 1
n : n =
where N, is the length of the frame which is equal to 360.
is a lag number. n1 and n2 represent the rang of the ACF
computation which corresponds to the frequency band of the
fundamental frequency (namely from 70 to 600Hz). They
are respectively set to 26 and 200 samples for a sampling
rate of 16 kS/s. The characteristics of ( )
1
is that it has a
large peak for voiced frames that decreases for non-voiced
frames.
The second stage of the processing is the extraction of the
largest peak in the ACF which is called
1
. Then, in the third
stage, this peak is compared to a fixed threshold
01
. If
1
is
greater than
01
, the frame is classified voiced; otherwise, it
is classified non-voiced.



No Yes
LPF: 0-700Hz
Speech frame
Find global minimum valley
2

Voiced Non-voiced
( )
2 1 2
n : n , =
02 2


Fig. 2: Block diagram of the AMDF classifier

It should be noticed that it is possible to use a normalized
version of the autocorrelation function to obtain a voicing
decision. In this case, we would be obliged to add a silence
detector to eliminate the wrong decisions (silence frames
classified voiced ones) that could appear as presented in [5].
We have preferred in this study to use a non-normalized
version of the ACF in order to reduce the number of features
in the classification (we dont need a silence detector for
voiced/non-voiced classification).
B. AMDF
Fig. 2 shows a block diagram of the classification based
on the AMDF. It is clear that the classifier follows the same
steps as the previous one except for the function which is
replaced by the AMDF feature given by equation 2.
( ) ( ) ( )


=
+ =
1
0
2
1
N
n
n x n x
N
(2)

2 1
: n n =
The values of the variables N, n1 and n2 are maintained the
same as in the ACF classifier.
The AMDF is a variation of the ACF, where instead of
correlating the input speech at various delays where multip-
lications and summations are formed at each value, a differ-
ence signal is formed between the delayed speech and the
original, and at each delay value, the absolute magnitude is
taken. Unlike, ACF, however, the AMDF calculations re-
quire no multiplications, a desirable propriety of hardware
and real time applications [6]. The characteristics of
( )
2
is that it has several valleys appeared periodically for
voiced speech. The global minimum valley
2
is used for
voicing decision. If it is lower or equal than a fixed thre-
shold
02
, the speech is classified voiced, otherwise, it is
classified non-voiced. The same as ACF, the AMDF is the
only feature used in speech classification, and there is no
need for a silence detector.


Copyrights: 978-1-4673-1711-5/12/$31.00 c 2012 IEEE


C. WACF
The WACF has been firstly proposed by T. Shimamura
[7] in 2001 for pitch detection. Utilizing that the AMDF has
similar characteristics to the ACF, the ACF is weighted by
the reciprocal of the AMDF to form what is called WACF.
In this paper, the WACF is used to classify the speech
frames into voiced and non-voiced classes. The proposed
classification scheme is shown in the diagram block of Fig.1
(i=3) where ( )
3
denotes the WACF given by the follow-
ing equation:
( )
( )
( ) ( ) +

=
2
1
3
(3)

2 1
: n n =
( )
1
is the ACF,
( )
2
is the AMDF,
The values of the variables n1 and n2 are maintained the
same as in the ACF classifier.
is a fixed number used to avoid divergence of the di-
rectly inversed ( )
2
at lag zero (because ( ) 0 0
2
= ). In our
case, this fixed number does not have high importance be-
cause ( )
3
is computed between n1 and n2. Thus, in this
study, it is set to 0.1. The voicing decision follows the same
steps as in the ACF classifier. The maximum peak found
3
is compared this time to
03
.
III. PERFORMANCE EVALUATION
A. Criteria of the test
The performance of the three classifiers was tested on a
speech database which was hand labeled into voiced/non-
voiced regions. Three measures were used:
Voiced speech classified as non-voiced (VNV error),
Non-voiced speech classified as voiced (NVV error),
Percentage of classification accuracy (Pc) :
) NVV B VNV A ( 1 Pc + = (4)
where A and B represent respectively the percentage of
voiced and non-voiced frames in the speech utterance.
The two types of error listed above occurred during the
initial speech classification into voiced and non-voiced re-
gions, where non-voiced speech was considered as voiced or
voiced speech misclassified as non-voiced.
B. Speech Data
The performance of the developed classifiers is eva-
luated on the TIMIT database [8]. Speech underwent exten-
sive manual labeling before it could be used. The spoken
material consisted of a set of 26 rich English sentences from
TIMIT database (13 female and 13 male) that contain sever-
al dialects and acoustic forms (weak voiced speech and fast
voiced non-voiced transition). The percentage of voiced
speech samples in each utterance is maintained at 50% (A
and B equal to 0.5 in equation 4) by appending required
duration of silence.



TABLE1. PERFORMANCE OF VOICED/NON-VOICED CLASSIFICATION (%)



SNR

ACF

AMDF

WACF
W
h
i
t
e

n
o
i
s
e

Clean
97.58
2.20
2.62
95.92
2.93
5.23
97.66
2.37
2.31
30dB
97.11
3.28
2.49
95.85
2.89
5.39
97.62
2.74
2.01
20dB
97.05
3.43
2.47
95.63
2.44
6.30
97.60
2.86
1.93
10dB
96.52
3.60
3.34
83.90
0.28
31.91
96.57
5.25
1.60
5dB
93.97
3.24
8.81
62.84
0.04
74.28
95.28
6.21
3.21
0dB
63.74
0.21
72.31
50.00
00.00
100.0
87.29
7.71
17.69
-5dB
50.75
0.04
98.45
50.00
00.00
100.0
67.08
4.55
61.28
*
Pc
VNV
NVV
Two experienced persons performed the manual classifi-
cation of the spoken material.The original time waveform
was used as the primary tool, with spectrograph used only in
few cases where the waveform was insufficient to make the
decision.
C. Performance of the classifiers
The three classifiers developed in this paper are defined
as threshold based classifiers. The performances of these
classification schemes are directly related to the optimal
choice of the thresholds which are given as follows:

01
for the ACF,

02
for the AMDF,

03
for the WACF,
The main purpose of this study is to evaluate the assess-
ment of each time domain feature for voicing classification
of English language by using an optimal decision level.
Consequently, we need to find an optimal threshold for each
feature to evaluate the global performance of the classifiers.
A practical approach is to seek a value that gives the optimal
classification of each utterance in clean environment and
then, to use it in noisy environment. Once the peaks (or
valley)
1
,
2
and
3
are computed for each frame in the
utterance, the thresholds
01
,
02
and
03
are respectively
found by computing the median value of the previous peaks
(or valley). These thresholds are updated for every utterance
in the speech database. The performances of the three clas-
sifiers are reported in table 1 for clean and noisy speech. A
white noise from the NOISEX92 database has been incorpo-
rated in order to obtain the noisy speech.
The entire classifiers have good Pc in clean environment
which are degraded by decreasing the SNRs of the added
noise. The WACF has the best performance and ranks first
under several SNRs.
Copyrights: 978-1-4673-1711-5/12/$31.00 c 2012 IEEE


The noticed degradation of the entire classifiers is essen-
tially due to the NVV error (in low SNRs) which increases
by the diminution of the SNRs.
The ACF classifier ranks second with large NVV error
for a SNR of -5 dB. Finally, the AMDF ranks third with
100% NVV error for a SNR of 0dB.
IV. CONCLUSION
This paper reported the results of performance evalua-
tion of three voiced/non-voiced classification schemes
which use only one time domain feature. Based on a variety
of error measurements, the performance of the developed
classifiers was highlighted for different white SNRs. It has
been shown that the degradation of the percentage of classi-
fication accuracy is proportionally related to the SNR level.
The accuracy degradation of the classifiers is essentially
due to the NVV error. The ranking of the classifiers was
established based on the percentage of classification accura-
cy. The WACF classifier ranks first, followed respectively
by the ACF and AMDF ones.
The combination of evidence from several features will
improve the accuracy of classification; however, the compu-
tation complexity is going to increase.
In future works, performance of the studied features will
be evaluated for other noise types. Furthermore, a combina-
tion of evidence from multiple features is going to be done
by using a statistical model.
ACKNOWLEDGMNENTS
Authors would like to thank D. Belabdi and Y. Boudjahfa,
post graduate English students, respectively at Batna Uni-
versity and ENS school of Algiers, for their helpful sugges-
tions and language support.
REFERENCES
[1] S. Ahmadi, and A.S. Spanias, Cepstrum-Based Pitch Detection using
a New Statistical V/UV Classification Algorithm, IEEE Transactions
on Speech and Audio Processing, vol. 7, no. 3, pp. 333-338, 1999.
[2] Y. Qi, and B.R. Hunt, Voiced-Unvoiced-Silence Classification of
speech using Hybrid Features and a Network Classifier, IEEE
Transactions on Speech and Audio Processing, vol. 1, no. 2, pp. 250-
255, 1993.
[3] N. Dhananjaya and B. Yegnanarayana, Voiced/Nonvoiced Detection
Based on Robustness of Voiced Epochs, IEEE Signal Processing Let-
ters, vol. 17, no. 3, pp. 273-276, 2010.
[4] B S. Atal and L. R. Rabiner, A Pattern Recognition Approach to
Voiced-Unvoiced-Silence Classification with Applications to Speech
Recognition, IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. ASSP-24, no. 3, pp. 201-212, 1976.
[5] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg and C. A. McGonegal,
A Comparative Performance Study of Several Pitch Detection Algo-
rithm, IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 24, no. 5, pp. 399-418, 1976.
[6] Z. Yu-Min, W. Zhen-Yang, L. Hai-Bin, and Z. Lin, Modified AMDF
Pitch Detection Algorithm, International Conference on Machine
Learning and Cybernetics, pp. 470 473, vol.1, Xian, 2003.
[7] T. Shimamura and H. Kobayashi, Weighted Autocorrelation for
Pitch Extraction of Noisy Speech, IEEE Transactions on Acoustics,
Speech, and Signal Processing, vol. 9, no. 7, pp. 727-730, 2001.
[8] J. S. Garofolo et al., TIMIT Acoustic-Phonetic Continuous Speech
Corpus Linguistic Data Consortium, Philadelphia, PA, 1993.

S-ar putea să vă placă și