Sunteți pe pagina 1din 18

Available online at www.sciencedirect.

com

Speech Communication 81 (2016) 54–71


www.elsevier.com/locate/specom

Significance of analytic phase of speech signals in speaker verification


Karthika Vijayan∗, Pappagari Raghavendra Reddy, K. Sri Rama Murty
Department of Electrical Engineering, Indian Institute of Technology Hyderabad, India
Received 1 July 2015; received in revised form 22 January 2016; accepted 6 February 2016
Available online 26 February 2016

Abstract
The objective of this paper is to establish the importance of phase of analytic signal of speech, referred to as the analytic phase, in
human perception of speaker identity, as well as in automatic speaker verification. Subjective studies are conducted using analytic phase
distorted speech signals, and the adversities occurred in human speaker verification task are observed. Motivated from the perceptual studies,
we propose a method for feature extraction from analytic phase of speech signals. As unambiguous computation of analytic phase is not
possible due to the phase wrapping problem, feature extraction is attempted from its derivative, i.e., the instantaneous frequency (IF). The IF is
computed by exploiting the properties of the Fourier transform, and this strategy is free from the phase wrapping problem. The IF is computed
from narrowband components of speech signal, and discrete cosine transform is applied on deviations in IF to pack the information in smaller
number of coefficients, which are referred to as IF cosine coefficients (IFCCs). The nature of information in the proposed IFCC features is
studied using minimal-pair ABX (MP-ABX) tasks, and t-stochastic neighbor embedding (t-SNE) visualizations. The performance of IFCC
features is evaluated on NIST 2010 SRE database and is compared with mel frequency cepstral coefficients (MFCCs) and frequency domain
linear prediction (FDLP) features. All the three features, IFCC, FDLP and MFCC, provided competitive speaker verification performance
with average EERs of 2.3%, 2.2% and 2.4%, respectively. The IFCC features are more robust to vocal effort mismatch, and provided relative
improvements of 26% and 11% over MFCC and FDLP features, respectively, on the evaluation conditions involving vocal effort mismatch.
Since magnitude and phase represent different components of the speech signal, we have attempted to fuse the evidences from them at the
i-vector level of speaker verification system. It is found that the i-vector fusion is considerably better than the conventional scores fusion.
The i-vector fusion of FDLP+IFCC features provided a relative improvement of 36% over the system based on FDLP features alone, while
the fusion of MFCC+IFCC provided a relative improvement of 37% over the system based on MFCC alone, illustrating that the proposed
IFCC features provide complementary speaker specific information to the magnitude based FDLP and MFCC features.
© 2016 Elsevier B.V. All rights reserved.

Keywords: Analytic phase; Instantaneous frequency; Feature extraction; MP-ABX tasks; t-SNE visualization; Speaker verification.

1. Introduction to textual message, language of communication, emotion and


health state of the speaker. From such a composite signal, the
Speaker verification is the task of verifying the claimed extraction of speaker-specific features that help in discrimi-
identity of a person from his/her voice. It is an important nating the speakers well, has to be realized. Speaker-specific
task in the field of speech processing, finding applications characteristics are mainly a result of anatomical structure,
in the areas of voice access control, telephone banking and like vocal tract shape and size, and learned speaking habits,
forensics (Kinnunen and Li, 2010). Speaker verification sys- like dialect and prosody. The anatomical structure plays an
tem requires extraction of speaker-specific information from important role in characterizing the speaker and hence, we
the speech signal. In addition to information about the speaker need to analyze the speech signal and extract features rep-
identity, the speech signal conveys rich information related resenting the anatomical structure of the speech production
mechanism.
∗ Corresponding author. Tel.: +91 9581145556. One of the major goals of signal analysis is to infer the
E-mail addresses: ee11p011@iith.ac.in, karthikavijayan@gmail.com characteristics of the underlying system from the signal. In
(K. Vijayan), ee12m1023@iith.ac.in (P. Raghavendra Reddy), the case of speech analysis, we need to extract information
ksrm@iith.ac.in (K. Sri Rama Murty).

http://dx.doi.org/10.1016/j.specom.2016.02.005
0167-6393/© 2016 Elsevier B.V. All rights reserved.
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 55

about the vocal tract system (VTS) and excitation source from Another well known algorithm for AM-FM decomposition
the observed speech signal. Since speech is a natural signal, it of a NB signal is energy separation algorithm (ESA) (Maragos
is not amenable for closed-form mathematical representation. et al., 1993a, 1993b). This method utilizes nonlinear Teager-
However, natural signals can be analyzed by expanding them Kaiser energy operator, which calculates energy of a mono-
using a complete set of basis functions, having precise math- component signal as the product of its squared amplitude and
ematical representation. From a mathematical point of view, frequency (Kaiser, 1990). The instantaneous characteristics of
there are several ways of achieving this signal decomposition. the signal are then obtained by applying the ESA algorithm
The Fourier transform is a prominent approach for signal de- (Maragos et al., 1993a). Comprehensive comparison between
composition, in which an arbitrary signal is expressed as a Hilbert transform method and ESA method can be found in
linear combination of complex sinusoids (Oppenheim et al., Vakman (1996) and Potamianos and Maragos (1994). Phase
1999). The set of coefficients of these complex sinusoids rep- locked loops and extended Kalman filters have also been ex-
resents relative contributions of different frequencies, and is plored for demodulation of NB signals (Gill and Gupta, 1972;
called the spectrum. The spectrum, in general, is complex- Pai and Doerschuk, 2000; Pantazis et al., 2011).
valued and it is often advantageous to express it in terms Generalization of AM-FM demodulation techniques to
of its magnitude and phase. In the case of speech signals, multi-component wideband (WB) signals, like speech, is not
prominent peaks in the magnitude spectrum, referred to as straightforward. It is not physically meaningful to interpret the
formants (Quatieri, 2001), convey information about the res- instantaneous amplitude and phase of a multi-component sig-
onances of the VTS. The locations of the formants are in- nal (Boashash, 1992). The most common solution for AM-FM
fluenced by the anatomical structure of the VTS, and hence, decomposition of a WB signal is to pass the signal through a
are important for speaker recognition. For example, location bank of NB filters, and then apply the preferred NB decompo-
of the first formant, being inversely proportional to the length sition algorithm on the output of each filter (Potamianos and
of the vocal tract, might be helpful in inferring the height Maragos, 1996). This strategy is called multiband demodu-
of a speaker (Greisbach, 1999). Most of the state-of-the-art lation analysis, and is similar to phase vocoder in speech
speaker recognition systems use features extracted from the processing (Quatieri, 2001). Another popular approach for
magnitude spectrum of the speech signal. Mel frequency cep- AM-FM separation of WB signals is the empirical mode de-
stral coefficients (MFCCs) (Davis and Mermelstein, 1980) composition (Huang et al., 1998), which uses the extrema of
and linear prediction cepstral coefficients (LPCCs) (Makhoul, the signal to obtain intrinsic mode functions (IMF). Although
1975), which represent the gross envelope of the magnitude this method provides highly accurate signal representations,
spectrum, are the commonly used features for speaker recog- straightforward implementation of sifting procedure produces
nition (Kinnunen and Li, 2010). Though there were attempts mode mixing (Huang et al., 1998). That is, a specific signal
to extract features from the phase spectrum of the speech may not be separated into the same IMFs every time. This
signal, they were not as popular as their magnitude counter- problem makes it hard to implement feature extraction, model
parts (Alsteris and Paliwal, 2007; Picone, 1993). Since Fourier training and pattern recognition, since a feature is no longer
transform is a weighted average of the signal, over the entire fixed at one labeling index.
duration, the Fourier spectra of signals with time-varying fre- The AM-FM analysis, especially the amplitude component,
quency content is not physically meaningful (Cohen, 1995). has been used in speech processing applications. The locations
For example, the Fourier transform of a Gaussian modulated of the formants were estimated from AM-FM decomposition
chirp signal in Fig. 1(a) does not offer any insight into its of speech signals using linear adaptive or fixed filter-banks
time-varying characteristics. Hence, short-time Fourier trans- (Atal and Shadle, 1978; Potamianos and Maragos, 1996; Rao
form (STFT) is used to analyze the time-varying characteris- and Kumaresan, 2000). Features extracted from instantaneous
tics of the VTS (Rabiner and Schafer, 1978). amplitude envelopes of NB components were used for speech
Amplitude modulated and frequency modulated (AM-FM) and speaker recognition (Gowda et al., 2015; Kinnunen, 2006;
signal decomposition provides an alternative way of ana- Sadjadi et al., 2012; Shannon et al., 1995). Frequency domain
lyzing time-varying frequency content in a signal. In this linear prediction (FDLP) features, derived from all-pole mod-
analysis, a narrowband (NB) signal (predominantly mono- els of amplitude envelopes, were found to be either compara-
component) is decomposed into instantaneous amplitude and ble or better than the conventional MFCC features for speech
phase components. Several methods have been proposed to processing applications (Athineos and Ellis, 2007; Ganapathy
accomplish such a decomposition (Cohen, 1995; Gianfelici et al., 2014).
et al., 2007; Griffiths, 1975; Kumaresan and Rao, 1999; Both AM and FM components are essential for exact re-
Maragos et al., 1993a, 1993b; Potamianos and Maragos, construction of the original signal. Perceptual studies also
1999; Quatieri, 2001; Quatieri et al., 1997). The analytic sig- assert that FM component is important for human percep-
nal representation obtained through the Hilbert transform is tion of speech signals, especially in noisy conditions (Wolfe
the most commonly used method for AM-FM decomposi- et al., 2009; Won et al., 2014; Zeng et al., 2005). However,
tion (Boashash, 1992; Cohen, 1995). Fig. 1(b) and (c) shows the FM component has received lesser prominence than the
the instantaneous amplitude and analytic phase variations ob- AM component in mainstream speech processing. This could
tained from the analytic signal representation of a Gaussian be due to the inevitable phase wrapping problem associated
modulated chirp signal in Fig. 1(a). with the computation of FM component, or the analytic phase
56 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

a
1

0.5

-0.5

-1
b 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0.5

0
c 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
d
150
Frequency (Hz)

100

50

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (ms)

Fig. 1. Interpretation of instantaneous frequency: (a) Gaussian modulated chirp signal with frequency sweep between 20 Hz and 100 Hz, (b) instantaneous
amplitude, (c) analytic phase and (d) instantaneous frequency.

(Quatieri, 2001). It is difficult to analyze the analytic phase of explored for speech recognition (Paliwal and Atal, 2003).
the Gaussian modulated quadratic chirp shown in Fig. 1(c), Features extracted from median smoothed IF were used for
as all the values are wrapped between −π and +π . However, speaker recognition (Thiruvaran et al., 2006). In order to over-
the derivative of the unwrapped analytic phase referred to as come the artifacts associated with computation of IF, all-pole
the instantaneous frequency (IF), can be computed without modeling of NB signals was attempted (Thiruvaran et al.,
getting affected by phase wrapping. The IF of a signal can 2008). In this approach, the trajectory of phase angles of poles
be interpreted as the frequency of the sinusoid that locally in the z-domain was used to estimate IF. The IF computed
fits the given signal. Hence, the IF of a NB signal depicts from all-pole modeling of NB signals, was found to provide
the time-varying frequency content in the signal. The IF of complementary speaker-specific information to the MFCC
the chirp signal, in Fig. 1(d), shows the quadratic variation features (Nosratighods et al., 2009). Feature extracted from
of frequency in the NB signal shown in Fig. 1(a). the zero crossing rates of the NB components, which approx-
The IF, because of the sudden shifts in analytic phase, ex- imate the IFs, were used for speaker recognition (Thiruvaran
hibits spurious peaks in the regions where the instantaneous et al., 2009). In Saratxaga et al. (2009), relative phase shifts
amplitude is close to zero. Several workarounds were pro- of the NB components from the fundamental frequency of the
posed to minimize this artifact and derive features from IF. speech signal were used to estimate the frequency deviation.
The IF features extracted from higher amplitude regions, were Features extracted from the first and second spectral moments
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 57

of NB components, which represent the average frequency tems are combined, the resulting system is remarkably better
and bandwidth, were used for speech recognition (Tsiakoulis than any one of them, illustrating the complementary nature
et al., 2010). Features derived from amplitude weighted IF of information captured in all the three features.
were explored for speech recognition (Dimitriadis et al., 2005;
Tsiakoulis et al., 2009; Yin et al., 2011), speaker recognition 2. Perceptual importance of analytic phase in speaker
(Grimaldi and Cummins, 2008) and whispered speaker recog- verification
nition (Sarria-Paja et al., 2013). Though amplitude weighting
reduces the spurious behavior of IF, in regions of lower am- Most of the current-day speaker verification systems use
plitudes, it does not reflect true characteristics of IF. Also, in features derived from either magnitude spectrum of the
higher amplitude regions, the IF characteristics get dominated Fourier transform or amplitude envelope of the analytic sig-
by amplitude characteristics. nal, and completely ignore the phase in both the cases. Even
In this paper, we present a systematic study to establish though there were several perceptual studies demonstrating
the importance of analytic phase of speech signals in hu- the importance of phase in speech recognition (Wolfe et al.,
man perception of speaker identity as well as in automatic 2009; Won et al., 2014; Zeng et al., 2005), there were not
speaker verification. We have conducted perceptual studies many attempts to establish the perceptual importance of phase
to illustrate the significance of the analytic phase in human in speaker recognition. In this section, perceptual importance
perception of speaker identity (Section 2). Perceptual studies of the analytic phase in speaker verification is studied through
are conducted on speech stimuli synthesized from original in- subjective experiments. The speech stimuli, for the perceptual
stantaneous amplitude envelope and distorted analytic phase. studies, are created by ignoring the original analytic phase and
When the analytic phase is tampered, human subjects found replacing it with random values as suggested in Paliwal and
it difficult to verify the speaker identity, illustrating its impor- Alsteris (2005).
tance in human perception. Taking the clue from perceptual The analytic signal representation of a continuous time,
studies, we have developed a method for extracting features finite energy real signal s(t) is defined as (Cohen, 1995):
from the analytic phase of the speech signal for automatic
z(t ) = s(t ) + jsh (t ) (1)
speaker verification (Section 3). Features are extracted from
the IF of the speech signal, which is the derivative of the an- where sh (t) is the Hilbert transform of s(t), and is given by
alytic phase. In the proposed method, the IF is computed by
sh (t ) = F −1 {Sh ( j)} (2)
exploiting the properties of the Fourier transform, and does
not involve explicit computation of the analytic phase. Hence, where F −1 denotes inverse Fourier transform. The Sh (j) is
the proposed IF computation is free from the phase wrap- obtained by manipulating the Fourier transform of s(t), S(j)
ping problem, unlike the methods reported in Grimaldi and as (Cohen, 1995):
Cummins (2008) and Yin et al. (2011). Discrete cosine trans- ⎧
⎨+ jS( j), <0
form is applied on deviations in IF computed from NB Sh ( j) = 0, =0 (3)
components of speech, to obtain the IF cosine coefficients ⎩
− jS( j), >0
(IFCCs). A pilot study on the feasibility of using proposed IF
features for speaker verification was reported in our previous The signals of the form in (1), obtained through Hilbert trans-
work (Vijayan et al., 2014). In the current study, we present form, satisfy the Cauchy-Riemann conditions for differentia-
an extensive analysis on nature of information content in the bility, and hence the name “analytic” (Cohen, 1995). The an-
proposed IFCCs, and demonstrate their suitability for both alytic signal z(t) can be expressed in polar form as
text-dependent and text-independent speaker recognition sys- z(t ) = a(t )e jθ (t ) , (4)
tems. The speaker-dependent nature of the proposed IFCCs is
illustrated using Minimal Pair ABX (MP-ABX) measures, and where a(t) and θ (t) represent the amplitude envelope and an-
t-stochastic neighborhood embedding (t-SNE) visualizations alytic phase, respectively, and are given by (Cohen, 1995)
(Section 4). The MP-ABX measures (Schatz et al., 2013), 
which are based on template matching of triphone stimuli, a(t ) = s2 (t ) + sh2 (t ), (5a)
establish the suitability of IFCC features for text-dependent  
−1 sh (t )
speaker recognition. The suitability of proposed features for θ (t ) = t an . (5b)
text-independent speaker recognition, which involves statis- s(t )
tical pattern recognition, is illustrated using t-SNE visual- If s(t) is a NB signal, then the amplitude envelope a(t)
izations (van der Maaten and Hinton, 2008) and validated and analytic phase θ(t) can be interpreted as the AM and FM
using speaker verification experiments. The performance of components of the signal, as illustrated in Fig. 1. As speech
the proposed IFCC features is evaluated on NIST-2010 SRE is a WB signal, it is passed through a bank of NB filters
database, which involves different channel conditions and vo- to obtain multiple NB components. For the subjective exper-
cal effort conditions, and compared with systems based on iments, we have used a 40 channel filter-bank with linearly
FDLP and MFCC features (Section 5). Finally, we explored spaced, triangular shaped filters, each having 50% overlap
two different approaches for combining evidences from dif- with the adjacent filters. The NB components obtained from
ferent features. When the evidences from all the three subsys- such a filter-bank can be added to exactly reconstruct the
58 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

a /f/ /l/ /sh/ /a/ /t/


1

-1
b 1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25

0.5

-0.5

c 1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25

0.8

0.6

0.4

0.2

0
d 1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25


1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25
e
0.5

-0.5

1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25


Time (s)

Fig. 2. AM-FM decomposition: (a) speech signal, (b) NB component, (c) amplitude envelope, (d) analytic phase and (e) analytic phase distorted speech
segment. The speech content is a segment from the utterance ‘rifle shot’.

original signal. A segment of speech signal and its NB com- (Paliwal and Alsteris, 2005). It is observed that the phase
ponent from a filter centered at 500 Hz are shown in Fig. 2(a) distorted speech signals synthesized from different realiza-
and (b), respectively. The amplitude envelope and the analytic tions of the random process sound similar. Hence, the quality
phase, obtained from the analytic signal of the NB component of the phase distorted speech signal does not depend on any
in Fig. 2(b), are shown in Fig. 2(c) and (d), respectively. particular realization of the random process.
Every NB component from the filter-bank output can be Subjective experiments were conducted, on pairs of phase
decomposed into its corresponding amplitude envelope and distorted speech stimuli, to study the perceptual importance
analytic phase, as illustrated in Fig. 2. Notice that the speech of analytic phase. Each subject was asked listen to a pair of
signal can be exactly reconstructed from the amplitude en- speech stimuli, and decide whether both the stimuli belong to
velopes and the analytic phases of all the NB components the same speaker or not. The test pairs were formed from 5
(Quatieri, 2001). In order to study the importance of ana- male and 3 female speakers, and there were no cross-gender
lytic phase, a phase distorted speech signal is synthesized test pairs. The test pairs were grouped into 5 different sets,
by replacing the analytic phase with Independent and Iden- where each set contains 2 test pairs from matched speakers
tically Distributed (IID) random process, with a marginal and 3 test pairs from mismatched speakers. The speech stim-
distribution uniformly distributed in the interval (−π , π ] uli were synthesized from TIMIT database, which contains
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 59

Table 1 also wrapped into the interval (−π , π ], making it impossi-


Human speaker verification rates for different types of speech stimuli. ble to unambiguously determine the exact phase angle at any
Type of stimuli Miss rate (%) False alarm rate (%) given instant of time. Fig. 2(d) shows the analytic phase of
Analytic phase distorted speech 14.48 70.80 the NB component in Fig. 2(b). The phase angles are wrapped
Clean speech 10.42 12.28 between −π and π , and it is difficult to draw any direct infer-
ences about the signal from analytic phase in Fig. 2(d). This
could be a potential reason for lesser prominence received
data collected from native American English speakers in dif- by analytic phase, compared to its amplitude counterpart, in
ferent dialects (Garofolo et al., 1993). All speech signals were mainstream signal processing.
sampled at a frequency of 16 kHz. The IF, which is the time derivative of the unwrapped an-
Thirty normal hearing subjects, aged between 20 and 30, alytic phase, can be used to deal with the phase wrapping
participated in this perceptual evaluation. The subjects were problem. The IF of a NB signal, with unwrapped analytic
not native English speakers. However, all of them can read, phase θ (t), is given by (Cohen, 1995)
write and speak fluent English. The subjects were neither
d
familiar with the speakers, nor received any prior training θ  (t ) = θ (t ). (6)
through repeated listening of the database under considera- dt
tion. The subjects heard the test stimuli pairs, multiple times, Notice that the analytic phase obtained from (5b) cannot be
monaurally through headphones. Each subject was asked to used to compute the IF, as it represents wrapped analytic
listen to pairs of stimuli from one of the five different test sets, phase. The analytic phase obtained from (5b) has to be un-
and decide whether each pair belongs to the same speaker wrapped before taking the derivative. Even though there are
or not. Hence, there were a total of 150 (30 × 5) listen- a variety of phase unwrapping algorithms, which attempt to
ing tests, of which 60 were genuine tests (matched speak- preserve continuity of the phase function, they are usually
ers) and 90 were impostor tests (mismatched speakers). The ad-hoc, complex and unreliable (Karam, 2006). In order to
performance of the human subjects, on speaker verification circumvent the issues associated with the phase unwrapping,
task with analytic phase distorted speech, evaluated in terms we directly compute the IF θ  (t), without explicitly differen-
of miss rate and false alarm rate, is given in Table 1. The tiating θ (t).
performance of the human subjects, on clean speech data, is The IF can be obtained by differentiating the logarithm of
also given for comparison. The error rates on clean speech the analytic signal z(t) in (4), and then equating the imaginary
data are comparable to 13% EER reported on human assisted parts (Murty and Yegnanarayana, 2008), and is given by,
 
speaker recognition evaluation on unfamiliar speakers (van  z (t )
Dijk et al., 2013). The false alarm rate on clean speech data θ (t ) = Im (7)
z(t )
is significantly lower than the false alarm rate on phase dis-
torted speech data. where Im{.} denotes imaginary part of a complex quantity
Fig. 2 (e) shows the analytic phase distorted signal synthe- and z (t) is the time derivative of the analytic signal z(t).
sized from clean speech signal in Fig. 2(a). When the analytic The derivative of the analytic signal can be computed using
phases of NB components are replaced with the realizations differentiation property of the Fourier transform as follows
from IID random process, the quasi periodicity property of (Oppenheim et al., 1999):
the voiced segments is destroyed. Hence, the resulting signal z (t ) = j F −1 {Z ( j )} (8)
sounds like whispered speech. It is known that human sub-
jects find it difficult to verify speaker identity from whispered where Z(j) is the Fourier transform of z(t). Thus, the IF can
speech (Orchard and Yarmey, 1995; Pollack et al., 1954). In be expressed as
 −1
our study on analytic phase distorted speech, the subjects de- F {Z ( j)}
cided that both the stimuli belong to the same speaker in most θ  (t ) = Re (9)
F −1 {Z ( j)}
of the cases. This explains the relatively low miss rate and
where Re{.} denotes real part. The computation of IF of a
high false alarm rate on speaker verification task with analytic
discrete-time NB signal can be implemented as
phase distorted speech. These subjective experiments illustrate

the significance of analytic phase in conveying speaker spe-
 2π FD −1 (k Z[k ])
cific information. θ [n] = Re , (10)
N FD −1 (Z[k])

3. Feature extraction from analytic phase of speech signals where FD −1 denotes inverse discrete Fourier transform
(DFT), N is the length of the NB signal and Z[k] is the DFT
Even though analytic phase is important for human per- of the analytic signal z[n], obtained from the NB signal s[n],
ception of speaker identity, it is not easy to process it for as explained in Marple (1999). As the proposed IF compu-
feature extraction as it suffers from phase wrapping problem tation does not involve computation of analytic phase, it is
(Quatieri, 2001). The phase angle θ (t), computed using the free from the phase wrapping problem. Notice that, the stud-
four quadrant inverse tangent as in (5b), is restricted to the ies reported in Grimaldi and Cummins (2008) and Yin et al.
interval (−π , π ]. The values of phase outside this interval are (2011) compute IF from wrapped analytic phase and energy
60 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

a /f/ /l/ /sh/ /a/ /t/


0.6
0.4
0.2
0
b 1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25

800
Frequency (Hz)

600

400

200

c 1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25


700
Frequency (Hz)

600

500
fi 500Hz
400

300
1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25
d 3000
Frequency (Hz)

2000

1000

0
1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25
Time (s)

Fig. 3. The properties of IF: (a) amplitude envelope, (b) IF, (c) smoothed IF (fi denotes the center frequency of the filter in Hz) and (d) spectrogram for a
segment from the utterance ‘rifle shot’.

weighted analytic phase, respectively, and hence do not solve where {.}∗ denotes the complex conjugate. The spurious fluc-
the phase wrapping problem. tuations in the IF of a speech signal can be attributed to two
The unwrapped analytic phase of a NB signal s(t) cen- main reasons:
tered around the frequency  can be expressed as θ (t ) =
t + φ(t ), where φ(t) is the deviation of the analytic phase 1. Since IF computation in (11) involves division by
from linear phase component. Hence, the IF of the NB signal squared instantaneous amplitude of the NB component,
can be expressed as θ  (t ) =  + φ  (t ), where φ  (t) is the de- the IF exhibits large fluctuations when the amplitude is
viation of the IF from the center frequency. In this work, we closer to zero. The large fluctuations of IF in the un-
use features extracted from deviation of IF from the center voiced region, from 1.95 s to 2.05 s in Fig. 3(b), are
frequency for speaker verification. Here, the IF deviations are mainly due to low energy of NB component in that
computed directly from analytic signal, whereas the IF devi- region, as shown in Fig. 3(a).
ation measures are indirectly computed using relative phase 2. The fluctuations of IF in the voiced regions, from
shifts and zero crossing rates in Saratxaga et al. (2009) and 1.87 s to 1.93 s and from 2.1 s to 2.19 s in Fig. 3(b),
Thiruvaran et al. (2009), respectively. Fig. 3(b) shows the can be attributed to the impulse-like nature of excita-
IF of a NB component of the speech signal around 500 Hz tion source. During speech production, the impulse re-
shown in Fig. 2(b). The IF in Fig. 3(b) is centered around sponses of VTS initiated at successive glottal closure
500 Hz, and exhibits spurious fluctuations on either side of instants (GCI) are superposed to produce the speech
500 Hz, making it difficult to interpret and analyze the char- signal. The superposition of impulse responses is mani-
acteristics of VTS from it. The reasons for these fluctuations fested in the NB component as phase discontinuity, and
can be explained by rewriting (7) as results in large amplitude peaks at GCI locations in the
IF. The quasi-periodic peaks in the voiced regions of
IF shown in Fig. 3(b), correspond to the GCI locations.
Im{z (t )z∗ (t )} This property of IF was exploited for GCI extraction in
θ  (t ) = (11)
a2 (t ) Murty and Yegnanarayana (2008).
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 61

a
4 trail Philip Steels

Frequency (kHz)
3

0
b 1.4 1.6 1.8 2 2.2 2.4 2.6
4
Frequency (kHz)

0
1.4 1.6 1.8 2 2.2 2.4 2.6
Time (s)

Fig. 4. (a) Pyknogram and (b) spectrogram for a segment from the utterance of ‘Author of the danger trail, Philip Steels, etc.’.

Hence the fluctuations of IF in unvoiced regions are due to illustrating the effectiveness of IF in capturing the formant
lower denominator values, and the fluctuations of IF in voiced transitions.
regions are due to higher numerator values. Since the numera- Since deviation of IF computed using appropriately wide
tor and denominator of IF in (11) contribute to fluctuations in bandwidth filters convey information about formant transi-
different regions, we propose to smooth them separately be- tions, which in turn reflect anatomy of the underlying VTS,
fore computing their ratio. Both the numerator and denomina- we propose to use deviation of IF to extract features for
tor are smoothed using a moving average rectangular window speaker verification. The IF deviations are segmented into
of 25 ms. The IF computed from smoothed numerator and de- overlapping short-time frames of 25 ms duration, shifted by
nominator components is shown in Fig. 3(c). The deviation of 10 ms, and temporal average of IF deviations is computed to
the smoothed IF, from the center frequency of 500 Hz, can be obtain L-dimensional IF coefficients (IFCs) for every frame.
attributed to the time-varying frequency content in the speech In order to represent the information in IFCs with fewer num-
signal. In the region from 2.05 s to 2.2 s, there is a dominant ber of coefficients, we have applied discrete cosine trans-
frequency (first formant of /a/) slightly above the 500 Hz, as form (DCT) on the IFCs. We have chosen DCT over prin-
shown in the spectrogram in Fig. 3(d). The IF rises above cipal component analysis (PCA) for dimensionality reduc-
500 Hz in that region, reflecting the presence of dominant tion of features because, PCA is a data-dependent analysis
frequency above 500 Hz. In a similar way, the IF goes below where the basis functions are learned from the given dataset,
the center frequency, when there exists a dominant frequency whereas DCT holds a constant set of basis irrespective of data
slightly below the center frequency. (Yoshida et al., 2007). Hence PCA may be proven critical in
The deviation of IF from the center frequency reflects on cases of mismatch between training and testing datasets. Thus
the location of the dominant frequency with respect to the we choose the first few coefficients in the DCT transformed
center frequency. Hence the IF grossly represents formant domain of IFCs, referred to as the IF cosine coefficients
transitions, when it is computed using a filter with appro- (IFCCs), along with their first and second order derivatives as
priately wide bandwidth capable of capturing the transitions features for speaker verification. The algorithm for extracting
around the center frequency. In order to analyze the IF of IFCC features from speech signal is given in Algorithm 1.
a WB speech signal s(t) in every band, it is passed through
an L channel filter-bank with linearly spaced NB filters, hav- 4. Nature of information in IF features
ing 3-dB bandwidth B, centered at ωi , i = 1, 2, . . . , L, to ob-
tain multiple NB components si (t). The deviation of IF from Speech signal conveys information about several factors
center frequency, φi (t ), is computed for every NB compo- including textual message, speaker identity, language of com-
nent si (t ), ˜i = 1, 2, . . . L. The deviations of IF obtained us- munication, health and emotional state of the speaker, etc.
ing a linearly spaced filter-bank with Gaussian shaped filters, Hence any feature extracted from speech signal contains in-
(L = 40 and B = 400 Hz), is shown as in Fig. 4(a). This fig- formation about all these factors, to varying degrees. Minimal
ure is referred to as pyknogram (Potamianos and Maragos, pair ABX (MP-ABX) tasks (Schatz et al., 2013) have been
1996), which is a scatter plot denoting the time-frequency proposed to analyze the nature of information captured in
representation of IFs from different filters in the filter-bank. a feature representation. Usually, the effectiveness of a fea-
The pyknogram clearly shows the formant transitions, and ture representation is evaluated by its performance in recog-
are in agreement with those in the spectrogram in Fig. 4(b), nition tasks, which depends on the efficiency of the modeling
62 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

Table 3
Algorithm 1 IFCC feature extraction from speech signals.
Percentage errors in MP-ABX tasks for IFC and IFCC feat (IFCC+ + ).
1: Preemphasize speech signal s[n], with preemphasis coefficient of ‘G’ denotes Gaussian shaped filters, ‘T’ denotes triangular shaped filters and
0.97 ‘R’ denotes rectangular shaped filters.
2: Compute N point DFT of the signal s[n] to obtain S[k], k =
1, 2, . . . ,N MP-ABX IFC 13 IFCC 20 IFCC 20 IFCC 20 IFCC
tasks feat (G) feat (G) feat (T) feat (R)
3: Design an L channel filter-bank with linearly spaced Gaussian
shaped filters in the frequency domain. Let the filter coefficients PaT 30.36 28.79 27.75 27.00 39.07
in the frequency domain be Wi [k], for i = 1, 2,...,L and k = TaP 21.55 22.31 19.89 21.68 30.46
1, 2, . . . ,N
4: for i = 1 to L do
5: Perform NB filtering of s[n] through ith filter: Si [k] ←
S[k ]Wi [k ]
6: Compute analytic signal zi [n] from si [n] as in (Marple, 1999) using dynamic time warping (DTW) with cosine similarity
7: Compute smoothed IF θi [n] as in (10), after smoothing the metric.
numerator and denominator separately with a moving average The TIMIT acoustic phonetic continuous speech corpus
rectangular window of 25 ms (Garofolo et al., 1993) was chosen to study the nature of
8: Compute IF deviations φi [n] ← θi [n] − ωi , where ωi ∈ [0˜π ) IFCC features using MP-ABX tasks. The database contains
is the center frequency of ith NB filter recordings of 6300 phonetically balanced English sentences,
9: end for consisting of 10 sentences from each of the 630 speakers.
10: Segment φi [n], i = 1, 2, . . . ,L into short-time frames of dura- All the sentences were transcribed and segmented at phone
tion as 25 ms, shifted by 10 ms level using 61 phone labels. In this evaluation, the 61 TIMIT
11: Average IF deviations within each frame to obtain L-dimensional
phone labels were collapsed into 39 phones, as proposed in
IFCs
Lee and Hon (1989). In order to evaluate the MP-ABX tasks,
12: Apply DCT on IFCs and retain first few coefficients to obtain
IFCCs all possible triphones and their corresponding durations were
13: Append IFCCs with their first and second order derivatives collected from the database. A total of 6,321,458 triphone
triplets were used to evaluate the PaT and TaP tasks. All the
utterances were sampled at 16 kHz for this analysis.
Table 2 The IFCs were extracted from NB components of speech
Explanation of MP-ABX tasks on triphone triplets. signal using a 40 channel filter-bank with linearly spaced
Task A B X Success Gaussian shaped filters, each having a bandwidth of 400 Hz.
The resulting 40-dimensional IFC features were used to eval-
PaT /beg/ SP1 /bag/ SP1 /bag/ SP2 B
TaP /bag/ SP2 /bag/ SP1 /beg/ SP1 B uate performance on MP-ABX task. The performance of IFC
features on PaT and TaP tasks, evaluated in terms of percent-
age error, is given in Table 3. The error on TaP task is lower
than the error on PaT task, indicating the suitability of IFCs
strategy as well. Hence, the final performance cannot be solely for speaker recognition. The performance on MP-ABX tasks
attributed to the feature representation. The MP-ABX tasks do was evaluated using IFCC features as well. In the case of
not require any system modeling and render an easy way to IFCC features, the number of coefficients retained after per-
evaluate the effectiveness of a particular feature representation forming DCT is an important parameter. A smaller number of
for a given task, especially in the context of speech/speaker IFCCs may not be enough to represent the information cap-
recognition (Schatz et al., 2013). These tasks are evaluated tured in IFCs. The percentage errors obtained, by retaining
using three different triphone stimuli – A, B and X, in which 13 and 20 IFCCs, are given in Table 3. The performance of
A and B differ from each other by minimal contrast, and X the system improved when 20 IFCCs were retained, justifying
matches better with either A or B (Schatz et al., 2013). The the dimensionality reduction step using DCT. For rest of the
contrast between A and B could be due to difference in a studies in this work, 20-dimensional IFCCs along with their
phoneme or difference in the speaker. Depending on the types first and second order derivatives (20 IFCCs+20 + 20)
of contrast, there are three different MP-ABX tasks, namely, are used as features.
phoneme across context (PaC), phoneme across talker (PaT) The effectiveness of IFCC features in capturing speaker-
and Talker across phoneme (TaP). The PaT and PaC tasks are specific information, critically depends on accuracy of the
aimed at checking the effectiveness of a feature in discrim- IF computed from speech signal. The computation of IF is
inating the phonemes, irrespective of variability in speaker mainly affected by the parameters of the filter-bank, namely,
and context, respectively. On the other hand, TaP task brings number of channels, shape of filters, center frequencies and
forth the ability of a feature in discriminating the speakers. bandwidths of individual filters. In our earlier studies, it was
In this study, we have chosen the PaT and TaP tasks to study found that linearly spaced equi-bandwidth filters are more
the nature of IFCCs. For each of the MP-ABX tasks under suitable for IF computation than mel-spaced varying band-
consideration, the triphone triplets and the ground truth were width filters (Reddy et al., 2015; Vijayan et al., 2014). In the
formed as illustrated in Table 2 (Schatz et al., 2013). Distance case of mel filter-bank, the bandwidth increases at high fre-
between features extracted from pairs of stimuli is calculated quencies, making the computation of IF less reliable. Hence,
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 63

Table 4 Table 5
Percentage errors in TaP task for IFCC features computed using different Percentage errors in MP-ABX tasks for different features.
number of channels L, and bandwidths of the filters B.
MP-ABX tasks IFCC FDLP MFCC
B(Hz)/L 50 200 400 600 800
PaT 27.8 26.8 23.6
20 39.33 22.57 20.04 21.79 24.54 TaP 19.9 21.2 22.1
40 37.41 21.21 19.89 21.89 24.59
60 30.57 21.15 19.91 21.85 24.6 4.3. Comparison of IFCC with FDLP and MFCC

The performance of IFCC features on MP-ABX tasks is


we have used linearly spaced equi-bandwidth filters for all compared with the FDLP and MFCC features. The FDLP
the subsequent studies. features are extracted from amplitude envelope of NB com-
ponents of speech signal, while the MFCC features are ex-
4.1. Effect of shape of filters in the filter-bank tracted from the magnitude spectrum of the speech signal.
In either case 20 cepstral coefficients along with their first
The performance of IFCC features extracted using rect- and second order derivatives are used as features. The per-
angular, triangular and Gaussian shaped filters is given in formance of the features on MP-ABX tasks is given in
Table 3. The features extracted using Gaussian shaped fil- Table 5. The performance of MFCC features is best on the
ters performed best on the TaP task. The lowest performance PaT task, indicating its ability to match phonemes irrespective
of features extracted using rectangular filters can be attributed of speaker variability. On the other hand, the IFCC features
to their sudden discontinuities, leading to poor IF estimates. delivered superior performance in the TaP task, indicating
Gaussian shaped filters, on the other hand, have smoother its efficiency in capturing speaker specific information. The
edges making computation of IF more reliable. In this work, FDLP features demonstrate a speaker-specific nature rather
we use linearly spaced Gaussian shaped filters for rest of the than speech-specific nature, by performing better in TaP task.
studies. Notice that Gaussian shaped NB filter in frequency It can be seen that PaT and TaP are contradicting tasks and
domain is equivalent to cosine modulated Gaussian window the same features cannot be the best on both of them. The
in the time domain, which is essentially a Gabor filter (Gabor, MFCC features delivered decent performance on both tasks,
1946). Gabor filter-bank, along with ESA algorithm, has been justifying its selection as the default choice in most of the
used for multiband demodulation analysis of speech signals in speech processing applications.
Maragos et al. (1993a) and Potamianos and Maragos (1996). The MP-ABX tasks, which are based on template match-
ing, are appropriate for studying the suitability of features for
text-dependent speaker recognition. On the other hand, the
4.2. Effect of bandwidth and number of channels suitability of features for text-independent speaker recogni-
tion, involving statistical pattern recognition, can be studied
The number of channels L, and the filter bandwidths B by analyzing their distributions.
should be chosen to cover the complete spectrum of the
speech signal. The performance of IFCC features extracted 4.4. t-SNE visualization of features
using different number of channels L, and bandwidths B is
given in Table 4. When the number of channels is small, the The t-SNE is a technique for dimensionality reduction, that
filter bandwidths have to be large to cover the entire spec- is particularly well suited for embedding high dimensional
trum. On the other hand, when the number of channels is data into a 2-D/ 3-D space, which can be visualized in a
large, the filter bandwidths can be decreased to reduce the scatter plot (van der Maaten and Hinton, 2008). This method
overlap among adjacent filters. If the bandwidth of a filter is preserves the distance between the corresponding data points,
very large, its output ceases to be NB, making the IF com- such that similar points in high dimensional space are mapped
puted from it inaccurate. On the other hand, if the bandwidth to nearby points in 2D space and dissimilar points are mapped
of a filter is too small, it cannot capture the formant transi- to distant points. In this study, we have used t-SNE visual-
tions in the vicinity of its center frequency. The percentage izations to illustrate the speaker discrimination capabilities of
error on TaP task, given in Table 4, is considerably less for different features. For this study, we have recorded 3 repe-
the bandwidths 200, 400 and 600 Hz, in comparison with titions of the words, heed, hood and hod, from 2 different
50 Hz (very low bandwidth) and 800 Hz (high bandwidth). speakers. The 20 dimensional IFCC features extracted from
The IFCC features extracted using 400 Hz bandwidth filters each word is visualized as a 2D scatter plot in Fig. 5, where
performed best, irrespective of the number of filters. This ob- the points marked as ‘boxes’ and ‘asterisks’ belong to dif-
servation is consistent with the selection of 400 Hz bandwidth ferent speakers. The t-SNE visualizations for 20-dimensional
filters for formant tracking (Potamianos and Maragos, 1996). FDLP and MFCC features are also shown in Fig. 5, for com-
Based on the above analyses, we have used a 40 channel parison. The IFCC features belonging to different speakers are
filter-bank with linearly spaced Gaussian filters having center well separable, even though they represent the same linguistic
frequencies at ωi , i = 1, 2, . . . , 40 and a 3 dB bandwidth of content, illustrating their speaker discrimination ability. Sim-
400 Hz, for IFCC extraction. ilar observations can be made on FDLP features as well. On
64 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

IFCC : /a/ from hod IFCC : /i/ from heed IFCC : /u/ from hood
20 20 20
Spkr1 Spkr 1 Spkr1
15 Spkr2 15 Spkr2 15 Spkr2

10 10 10

5 5 5

0 0 0

-5 -5 -5

-10 -10 -10

-15 -15 -15

-20 -20 -20


-20 -10 0 10 20 -10 0 10 20 -20 -10 0 10

FDLP : /a/ from hod FDLP : /i/ from heed FDLP : /u/ from hood
20 15 20
Spk Spkr1 Spkr1
15 Spkr2 Spkr2 15 Spkr2
10
10 10
5
5 5

0 0 0

-5 -5
-5
-10 -10
-10
-15 -15

-20 -15 -20


-10 0 10 20 -10 0 10 -10 0 10 20
MFCC : /a/ from hod MFCC : /i/ from heed MFCC : /u/ from hood
15 15 15
Spkr1 Spkr1 Spkr1
Spkr2 Spkr2 Spkr2
10 10 10

5 5 5

0 0 0

-5 -5 -5

-10 -10 -10

-15 -15 -15


-10 -5 0 5 10 -15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15

Fig. 5. t-SNE visualization of different features.

the other hand, the overlap of MFCC features, belonging to from interview scenario were made using room microphone
different speakers, can be attributed to the similarity in their channel, and each recording has a duration of either 3 min
linguistic content. or 8 min. Another important feature of NIST 2010 database
is that several of these recordings were made at particularly
5. Speaker verification evaluations using NIST 2010 high or particularly low vocal effort. In order to study the
database robustness of a speaker recognition system against the differ-
ences in speaking scenario, transmission channel and vocal
The performance of the analytic phase based IFCC features effort, different evaluation conditions were defined in NIST
was evaluated on the core task of NIST 2010 SRE database 2010 SRE task, as given in Table 6. There are a total of
(NIST, 2010), and compared with the performances of mag- 610,748 speaker verification trials in the core task spanning
nitude based MFCC and FDLP features. The NIST 2010 across 9 evaluation conditions. The details of trials in each
database consists of speech data recorded from conversational evaluation condition is also given in Table 6 (NIST, 2010). All
scenario and interview scenario. In the case of conversational the speech data were sampled at 8 kHz and quantized with 8
scenario, the data were recorded from telephone (wired and bit precision using μ law compression. All the speech signals
wireless) channel and room microphone channel, and the du- were band-limited to the range fL = 125 Hz to fH = 3800 Hz
ration of each recording is around 5 min. All the recordings for feature extraction.
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 65

Table 6
Description of evaluation conditions (Eval. cond.) in core tasks of NIST 2010 SRE. ‘Int’ denotes interview, ‘Conv’ denotes
conversational speech, ‘Tele’ denotes telephone channel and ‘Mic’ denotes room microphone channel. ‘lv’ denotes speech
with low vocal effort and ‘hv’ denotes the same with high vocal effort.

Eval. cond. number Explanation No: genuine trials No: impostor trials No: total trails
1 Int vs. Int (same Mic) 2166 61,393 63,559
2 Int vs. Int (different Mic) 7582 214,802 222,384
3 Int vs. Conv-Tele 1640 56,844 58,484
4 Int vs. Conv-Mic 2403 85,426 87,829
5 Conv-Tele vs. Conv-Tele 708 62,398 63,106
6 Conv-Tele vs. Conv-Tele-hv 361 28,311 28,672
7 Conv-Mic vs. Conv-Mic-hv 366 28,626 28,992
8 Conv-Tele vs. Conv-Tele-lv 298 28,306 28,604
9 Conv-Mic vs. Conv-Mic-lv 298 28,293 28,591

The IFCC features were extracted using a 37 channel filter- the sequence of features extracted from reference and test
bank, with linearly spaced Gaussian shaped filters, having a utterances. In the case of text-independent speaker verifica-
3 dB bandwidth of 400 Hz. As the frequency range under tion, the features from the reference and test utterances cannot
consideration for speaker verification is 125 Hz to 3800 Hz, be compared directly, as they are not temporally aligned. In
3 filters were dropped from the choice of 40 filters dis- such situations, the underlying probability density functions
cussed in Section 4.2. The separation between adjacent fil- (PDF) of the reference and test utterances are estimated and
ters is ( fH − fL )/36=102.08 Hz, and center frequencies of compared to obtain a similarity score. The unknown PDF
the filters is given by fi = fL + 102.08(i − 1), i = 1, 2, . . . 37. is usually approximated as a convex combination of Gaus-
Smoothed IF deviations were computed for each of the NB sian density functions, popularly known as Gaussian mixture
components as explained in Algorithm 1. The IF deviations model (GMM). The parameters of the GMM λ, are estimated
were averaged over short-time windows of 25 ms duration, from features using expectation maximization (EM) algorithm
shifted by 10 ms, to obtain 37-dimensional IFC features. The (Duda et al., 2000). Two GMMs λR and λT can be trained on
IFCCs were obtained by applying DCT on IFCs and retain- the features extracted from the reference and test utterances,
ing the first 20 coefficients in the transformed domain. The respectively, and the similarity between the models λR and λT
20-dimensional IFCCs, together with their first and second can be used to make a decision (Reynolds and Rose, 1995).
order derivatives, were used to build the speaker verification Since GMM training involves estimation of several parame-
system. ters, it requires large amounts of training data, which is not
The FDLP features were extracted, using a 96 channel feasible in practice.
filter-bank with linearly spaced rectangular filters, as ex- In order to cater the large training data requirement, it
plained in Ganapathy et al. (2014, 2011). 30th order LP anal- was proposed to adapt a universal background model (UBM)
ysis was performed, in frequency domain, on subband seg- to the reference and test utterances to obtain λR and λT
ments of duration 1 s to obtain subband FDLP envelopes. (Reynolds et al., 2000), respectively. UBM is essentially a
These FDLP envelopes are gain normalized and warped to 37 GMM trained using a large database, collected from many
mel bands, which are temporally integrated over segments of speakers, to represent the speaker-independent distribution of
25 ms duration, shifted by 10 ms, to obtain 37-dimensional the features in the acoustic space. We have trained a 1024
power spectral estimates. A time domain linear prediction, mixture UBM, with diagonal covariance matrices, based on
of order 19, is performed to model these power spectral the assumption that the dimensions of feature vectors are in-
estimates and 20 cepstral coefficients are computed. These dependent. The dataset for UBM training is constructed from
cepstral coefficients, along with their first and second order NIST: 2003, 2004, 2005, 2006 and 2008 and switchboard cel-
derivatives, constitute the FDLP features. lular part 1 and part 2. The resulting dataset includes 5855
The MFCCs are extracted by performing STFT analysis recordings from 1641 male speakers, and 7875 recordings
over hamming windowed segments of speech with duration from 2132 female speakers.
25 ms, shifted by 10 ms. The resultant power spectrum is The parameters of the UBM (λU ) are adapted to the fea-
warped using a 24 channel mel filter-bank and 20 cepstral tures from reference/test utterance, by employing maximum a
coefficients are computed, which are appended with their posteriori (MAP) adaptation (Reynolds et al., 2000). Since the
first and second order derivatives to obtain MFCC features. amount of data available for reference/test is limited, only the
Speaker verification systems are built using IFCC, FDLP and mean vectors are adapted, retaining the weights and covari-
MFCC features. ance matrices of the UBM. The mean vectors of all the com-
In the task of speaker verification, we need to compare two ponent densities of GMM are concatenated to form a 61440-
speech utterances and decide whether they belong to same dimensional (1024 × 60) GMM supervector (Campbell et al.,
speaker or not. In the case of text-dependent speaker verifi- 2006). The GMM supervector contains speaker-specific infor-
cation, this can be achieved by computing similarity between mation along with other information like transmission channel
66 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

a b c
30 30 30
Spkr1 Spkr 1 Spkr 1
Spkr 2 Spkr 2 Spkr 2
20 Spkr 3 20 Spkr 3 20 Spkr 3
Spkr 4 Spkr 4 Spkr 4

10 10 10

0 0 0

-10 -10 -10

-20 -20 -20

-30 -30 -30


-30 -20 -10 0 10 20 30 -30 -20 -10 0 10 20 30 -30 -20 -10 0 10 20 30

Fig. 6. t-SNE visualization of i-vectors obtained from speaker verification system using (a) IFCC, (b) FDLP and (c) MFCC features.

and acoustic environment. In order to reduce the dimension- The i-vectors extracted using the T-matrix, contain both
ality of the supervector, it is typically projected to a low di- speaker and channel variabilities. In order to compensate
mensional subspace, termed as total variability subspace as for the channel effects, probabilistic linear discriminant anal-
follows: ysis (PLDA) was employed on the i-vectors (Prince and
Elder, 2007). The PLDA model was trained using the i-
s = sU + Tw (12) vectors derived from multiple utterances recorded from speak-
ers, by using EM algorithm with a minimum divergence re-
where s is the GMM supervector of an utterance, sU is the estimation step. A 200-dimensional eigen voice subspace and
speaker and channel independent supervector of the UBM, 100-dimensional eigen channel subspace constitute the PLDA
T is a low-rank total variability matrix (T-matrix) and w is model, with a noise term characterized by a diagonal covari-
a weight vector with a standard normal prior, referred to as ance matrix. The dataset used for training the PLDA model
identity vector or i-vector (Dehak et al., 2011). The T-matrix includes all databases used to train the T-matrix. We have
is trained using the sufficient statistics of the adapted GMM chosen only those speakers contributing at least 6 recordings
models, obtained from a large amount of background data, to form the dataset. There are 803 male speakers and 1184
by employing the EM algorithm as described in Dehak et al. female speakers contributing 23,201 and 31,164 recordings,
(2011). We have trained a 400-dimensional total variability respectively, for PLDA training. The likelihood ratio based
subspace, using all available data from the same databases on hypothesis testing between the i-vectors of reference and
considered for UBM training. The dataset under consider- test utterances wR and wT was used as the confidence score.
ation involves 23,303 recordings from 1641 male speakers Higher score indicates that the reference and test utterances
and 31,530 recordings from 2132 female speakers. The re- belong to the same speaker, otherwise they belong to different
sultant T-matrix is used to obtain the low dimensional rep- speakers.
resentation of speaker-specific information in the form of The speaker verification system, starting from feature nor-
400-dimensional i-vectors (Dehak et al., 2011). The sufficient malization to PLDA scoring, was built using the open source
statistics of reference and test models λR and λT are used to Alize 3.0 toolkit (Larcher et al., 2013). The performance of
obtain their corresponding i-vectors wR and wT . the speaker verification system was evaluated in terms of
The effectiveness of i-vectors in discriminating speakers is equal error rate (EER), and minimum detection cost func-
demonstrated using t-SNE visualizations in Fig. 6. The 400- tion (minDCF) (NIST, 2010). Table 7 gives the EERs and
dimensional i-vectors extracted from multiple utterances of 4 minDCF values obtained from speaker verification systems
speakers, using IFCC features, are projected to 2D plane. The built on IFCC, FDLP and MFCC features, over all trails from
projected i-vectors of different speakers are clearly separable, 9 different evaluation conditions.
and can be classified using simple linear discriminant func- The speaker verification systems built on IFCC, FDLP
tions (Duda et al., 2000). Similar behavior can be observed and MFCC features provide competitive performances, with
from the projected i-vectors extracted from FDLP and MFCC average EERs of 2.3%, 2.2% and 2.4%, respectively. The
features, as shown in Fig. 6(b) and (c), respectively. Even performances of all the three features is better on micro-
though the raw feature vectors of different speakers overlap, phone channel conditions (1,2,4,7 and 9), than their telephone
as shown in Fig. 5, the i-vectors derived from these features channel counterparts. This could be due to the band-limiting
provide clear separation between the speakers. and distortion introduced by telephone channel. The FDLP
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 67

Table 7 minDCF). In particular, the IFCC features provided consider-


Performance obtained for NIST 2010 speaker verification using different able improvements in EER for female speakers in evaluation
features.
conditions 6, 7 and 9, where vocal effort mismatch is in-
Eval. cond. EER (%) minDCF (∗ 1e3) volved. The performance of all the three features is better on
number IFCC FDLP MFCC IFCC FDLP MFCC
male speakers than their respective performances on female
speakers. It is known that the performance of the speech and
1 1.4 1.2 1.3 0.195 0.223 0.227
2 2.1 1.6 1.8 0.242 0.297 0.308
speaker recognition systems on female speakers is, in gen-
3 2.3 2.2 2.0 0.304 0.285 0.299 eral, inferior to their performance on the male speakers. This
4 1.6 1.4 1.8 0.234 0.194 0.213 is because of the higher fundamental frequency of female
5 3.3 2.7 2.0 0.378 0.308 0.274 speakers, which leads to a more sparsely sampled magnitude
6 4.3 5.4 5.0 0.564 0.632 0.470 spectrum. Since IFCC features are extracted from the analytic
7 3.1 3.4 5.0 0.414 0.437 0.434
8 1.7 1.3 1.3 0.158 0.223 0.232
phase, the effect of fundamental frequency is lesser.
9 0.5 0.7 1.7 0.185 0.126 0.465 As the magnitude based features (MFCC and FDLP) and
phase based features (IFCC) are derived from different com-
Average 2.3 2.2 2.4 0.297 0.303 0.325
ponents of the speech signal, they may carry complemen-
tary speaker-specific information. Even though the MFCC and
FDLP features are extracted from the magnitude spectrum of
features are more robust to microphone mismatch compared the speech signal, the signal processing steps involved in their
to IFCC and MFCC features. The FDLP features suffered a extraction are different. While the speech signal is segmented
relative degradation of 33% from condition 1 (matched mi- in the time domain for MFCC extraction, it is segmented
crophone) to condition 2 (mismatched microphone), while the in the frequency domain (NB filtering) for FDLP and IFCC
IFCCs and MFCCs suffered 50% and 38% relative degrada- extraction. Hence, all the three features may contribute com-
tions, respectively. The IFCC features performed best on three plementary information, which can collectively help in im-
of the four conditions involving vocal effort mismatch (condi- proving the performance of the speaker verification system.
tions 6–9). On an average, IFCC features provided 26% and
11% relative improvement in EER, compared to the MFCC 5.1. Fusion of evidences from different features
and FDLP features, respectively, illustrating the robustness of
analytic phase based features to vocal effort mismatch. The The evidences from IFCC, FDLP and MFCC features can
amplitude of speech signal is affected by the differences in be combined at different levels to improve the performance
vocal effort, which leads to variability in the magnitude based of speaker verification system. At the lowest level, a com-
features like MFCC and FDLP. On the other hand, IFCCs bined i-vector based speaker verification system can be built
being derived from analytic phase of the speech signal, are by concatenating different feature representations. As the in-
potentially more robust to amplitude variability. Hence, the dividual dimensions of the concatenated features need not
IFCC features are more robust to intra speaker vocal effort be uncorrelated, their distribution can be modeled either us-
differences. ing a GMM with full covariance matrices, or using a GMM
Table 8 provides gender-wise performance of all the three with huge number of diagonal covariance mixtures (Reynolds
features in terms of the EERs and minDCF values. On an et al., 2000). Both these approaches are computationally in-
average, the MFCC features performed best for male speak- tensive, and hence, not attempted in this work. At the high-
ers (in terms of minDCF values), while the IFCC features est level, the confidence scores obtained from different sys-
performed best for the female speakers (in terms of EER and tems can be combined. However, in such an approach, the

Table 8
Performance obtained for NIST 2010 speaker verification using different features for male and female speakers.

Eval. Male speakers Female speakers

cond. EER (%) minDCF (∗ 1e3) EER (%) minDCF (∗ 1e3)

number IFCC FDLP MFCC IFCC FDLP MFCC IFCC FDLP MFCC IFCC FDLP MFCC
1 0.7 0.5 0.9 0.123 0.127 0.093 2.1 1.8 1.8 0.212 0.270 0.233
2 1.0 0.7 1.1 0.169 0.210 0.177 3.0 2.3 2.4 0.301 0.357 0.349
3 1.1 1.2 1.4 0.246 0.210 0.238 3.4 3.1 2.7 0.322 0.372 0.341
4 1.0 0.9 1.6 0.202 0.169 0.149 1.8 1.6 2.0 0.268 0.206 0.285
5 2.6 2.3 1.4 0.272 0.240 0.128 3.9 3.1 2.6 0.401 0.347 0.398
6 2.8 3.4 2.5 0.466 0.427 0.230 4.9 7.7 6.0 0.606 0.716 0.622
7 1.4 1.4 2.2 0.313 0.263 0.156 4.4 5.1 7.4 0.471 0.605 0.599
8 1.7 0.8 0.8 0.067 0.209 0.017 1.9 1.7 1.7 0.196 0.229 0.304
9 0.4 0.9 1.0 0.103 0.043 0.273 0.2 0.7 1.7 0.219 0.168 0.560
Average 1.41 1.34 1.43 0.218 0.211 0.162 2.84 3.01 3.14 0.333 0.363 0.409
68 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

Table 9
Performance obtained for NIST 2010 speaker verification with fusion of scores from different systems. (‘Be. In. Sy. Scr’ denotes the best scores
obtained from any of the individual systems).

Eval. EER (%) minDCF (∗ 1e3)

cond. FDLP+ MFCC+ MFCC+ MFCC+IFCC+ Be. In. FDLP+ MFCC+ MFCC+ MFCC+IFCC+ Be. In.
number IFCC IFCC FDLP FDLP Sy. Scr. IFCC IFCC FDLP FDLP Sy. Scr.
1 1.2 1.2 1.1 1.1 1.2 0.187 0.190 0.211 0.193 0.195
2 1.5 1.5 1.3 1.3 1.6 0.231 0.239 0.250 0.232 0.242
3 1.9 1.6 1.9 1.6 2.0 0.251 0.278 0.248 0.258 0.285
4 1.1 1.1 1.3 1.1 1.4 0.172 0.185 0.187 0.173 0.194
5 2.3 2.3 2.1 1.9 2.0 0.292 0.219 0.234 0.193 0.274
6 3.9 3.6 3.9 3.6 4.3 0.511 0.373 0.434 0.418 0.470
7 2.9 2.8 3.6 2.8 3.1 0.320 0.278 0.323 0.275 0.414
8 1.1 1.0 1.0 0.8 1.3 0.133 0.064 0.138 0.077 0.158
9 0.4 0.7 1.0 0.7 0.5 0.079 0.209 0.224 0.126 0.126
Average 1.81 1.76 1.91 1.65 – 0.242 0.226 0.250 0.216 –
In-set fusion – average 1.77 1.67 1.88 1.68 – 0.247 0.223 0.252 0.230 –

weighting given to the individual systems has to be fixed a Table 9. The performance of in-set fusion is marginally better
priori. The weights depend on the performances of the indi- than the fusion with equal weights.
vidual systems, as well as the dynamic ranges of their confi- In in-set fusion, we have used same set of scores to ob-
dence scores. In order to overcome these issues, we attempt tain the optimal weights for scores fusion. However, in real-
to combine the systems at an intermediate level using the ity, we need to learn the optimal weights from some other
i-vectors. independent dataset. Optimizing weights using a limited set
of scores usually result in over-fitting, leading to unreliably
5.1.1. Score fusion fused scores (Hautamaki et al., 2013). In order to overcome
In this experiment, the confidence scores obtained from the issues associated with scores fusion, we have attempted
two different systems are combined using a convex combina- the fusion of speaker verification systems at the i-vector level
tion. Since the confidence scores of all the systems are in the (McLaren et al., 2013).
same dynamic range and their performances are similar, the
scores are combined with equal weights. That is, the average 5.1.2. i-vector fusion
of the individual scores is used as the fused score. The perfor- In this approach, a PLDA model was trained on hybrid
mances of all the possible fusion combinations, evaluated in i-vectors obtained by concatenating i-vectors extracted from
terms of EER and minDCF, are given in Table 9. The EERs different features (McLaren et al., 2013). For example, in
of the fused systems are consistently lower than the EERs order to combine evidences from IFCC and FDLP features,
of the individual systems, demonstrating the complementary the 400 dimensional i-vectors extracted from individual fea-
nature of information captured in different feature representa- tures were concatenated to form an 800 dimensional hybrid
tions. The improvement in performance is better when mag- i-vector. Then a PLDA model is trained, on hybrid i-vectors,
nitude based features are combined with phase based fea- with 400-dimensional eigen voice subspace, 200 dimensional
tures (FDLP+IFCC, MFCC+IFCC), than when two magnitude eigen channel subspace and a noise term with diagonal co-
based features are combined (MFCC+FDLP), in an average variance matrix. At the time of testing, the confidence scores
sense. When the evidences from MFCC and IFCC features are are evaluated on the hybrid i-vectors extracted from reference
combined, the EER improved by 26% in comparison with the and test utterances. Similar approach is followed to combine
system based on MFCC features alone. Among all the pos- evidences from pairs of speaker verifications systems, i.e.,
sible feature combinations, the system based on MFCC and FDLP+IFCC, MFCC+IFCC and MFCC+FDLP. The perfor-
IFCC features performed best with an EER of 1.76%. The mances of hybrid i-vector systems, evaluated in terms of EER
performance further improved with a convex combination of and minDCF, are shown in Table 10. The hybrid i-vector sys-
scores from all the three features (IFCC + FDLP + MFCC). tems delivered consistently better performance than the indi-
In scores fusion, it is crucial to optimize the weights of vidual systems.
linear combination, so as to normalize the deviations in dis- The effectiveness of hybrid i-vectors in enhancing the dis-
tributions of the scores. In order to check the best perfor- crimination between speakers is illustrated in Fig. 7. For this
mance that can be achieved, with the best possible weights, study, we have considered multiple utterances from 4 speakers
we have performed an in-set fusion using Bosaris toolkit which were originally misdetected by both FDLP and IFCC
(Brümmer and de Villiers, 2011). In the in-set fusion, the based systems. Fig. 7(a) and (b) show the t-SNE visualiza-
weights involved in the convex combination are learned, from tions of the i-vectors derived from IFCC and FDLP features,
the same set of scores using logistic regression. The average respectively. In the case of IFCC features, the utterances of
performance of in-set fusion of different features is given in speaker-2 are associated with speaker-4. On the other hand,
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 69

Table 10
Performance obtained for NIST 2010 speaker verification with fusion of i-vectors from different systems. (‘Be. In. Sy. Scr’ denotes the best scores
obtained from any of the individual systems).

Eval. EER (%) minDCF (∗ 1e3)

cond. FDLP+ MFCC+ MFCC+ MFCC+IFCC+ Be. In. FDLP+ MFCC+ MFCC+ MFCC+IFCC+ Be. In.
number IFCC IFCC FDLP FDLP Sy. Scr. IFCC IFCC FDLP FDLP Sy. Scr.
1 0.8 0.7 0.7 0.5 1.2 0.122 0.145 0.142 0.089 0.195
2 1.0 1.0 0.8 0.6 1.6 0.152 0.151 0.183 0.111 0.242
3 1.6 1.4 1.4 1.2 2.0 0.197 0.173 0.131 0.080 0.285
4 0.8 0.8 1.1 0.5 1.4 0.122 0.112 0.144 0.071 0.194
5 2.4 2.3 1.9 1.7 2.0 0.247 0.228 0.195 0.155 0.274
6 2.8 3.6 3.6 2.6 4.3 0.503 0.428 0.388 0.338 0.470
7 2.0 2.0 2.6 1.4 3.1 0.275 0.264 0.279 0.176 0.414
8 1.1 1.0 1.0 0.8 1.3 0.092 0.057 0.097 0.040 0.158
9 0.3 0.5 0.8 0.4 0.5 0.082 0.238 0.251 0.132 0.126
Average 1.42 1.47 1.54 1.08 – 0.199 0.200 0.201 0.132 –

a b c
1000 600 800
Spkr 1 Spkr 1 Spkr 1
800 Spkr 2 Spkr 2 Spkr 2
600
Spkr 3 400 Spkr 3 Spkr 3
600 Spkr 4 Spkr 4 Spkr 4
400
400 200

200 200
0
0 0

-200 -200
-200
-400
-400
-400
-600

-800 -600 -600


-1000 -500 0 500 1000 -500 0 500 -600 -400 -200 0 200 400 600 800

Fig. 7. t-SNE visualization of i-vectors for 4 different speakers: (a) derived from IFCC system, (b) derived from FDLP system and (c) hybrid i-vectors.

in the case of FDLP features, the utterances of speaker-3 speaker specific information from different feature represen-
are closely associated with speaker-2. In either case, there tations.
were several pair-wise misdetections due to the proximity of
i-vectors of one speaker to another. When the i-vectors from
6. Conclusions
FDLP and IFCC are concatenated, the t-SNE projections of
the hybrid i-vectors, in Fig. 5(c), clearly discriminates all the
The significance of the analytic phase of speech signals is
4 speakers, illustrating the effectiveness of hybrid i-vectors.
studied using perception tests and automatic speaker verifica-
The performance of the i-vector fusion is consistently bet-
tion experiments. When the analytic phase of speech signals is
ter than the scores fusion. In the case of i-vector fusion also,
distorted, the resulting signal sounded like whispered speech
the FDLP+IFCC and MFCC+IFCC combinations resulted in
and human subjects found it difficult to verify the speaker
better performance than MFCC+FDLP combination (in aver-
identity. This finding motivated us to explore features from
age sense), illustrating the complementary nature of magni-
analytic phase for speaker verification. Since computation of
tude and phase components. When the i-vectors from all the
analytic phase suffers from the phase wrapping problem, we
three features are concatenated (MFCC+FDLP+IFCC), the re-
have used its derivative, i.e., the instantaneous frequency (IF)
sulting hybrid i-vector system outperformed all the individual
as its representative for feature extraction. The IF is computed
systems and their pair-wise combinations by a large margin.
using properties of Fourier transform, without getting affected
The i-vector fusion of all the three features provided 50%
by the phase wrapping problem. The artifacts associated with
relative improvement in EER, compared to the best individ-
the IF computation are highlighted, and a smoothing strat-
ual system based on FDLP features. Hence, the i-vector fu-
egy is suggested to minimize the artifacts. The deviations
sion provides an effective way of combining complementary
of smoothed IF from center frequency are used to extract
70 K. Vijayan et al. / Speech Communication 81 (2016) 54–71

IFCC features, which are used as representatives of the ana- Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S.,
lytic phase of speech signals. The performance of IFCC fea- Dahlgren, N. L., 1993. DARPA TIMIT Acoustic Phonetic Continuous
tures on MP-ABX tasks indicated the speaker-specific nature Speech Corpus.
Gianfelici, F., Biagetti, G., Crippa, P., Turchetti, C., 2007. Multicomponent
of information. The performance of the IFCC features, on AM-FM representations: an asymptotically exact approach. IEEE Trans.
NIST 2010 SRE database, is comparable to the performance Audio Speech Lang. Process. 15 (3), 823–837.
of FDLP and MFCC features. When the evidence from IFCC Gill, G., Gupta, S.C., 1972. First-order discrete phase-locked loop with ap-
features is combined with FDLP and MFCC features, the per- plications to demodulation of angle-modulated carrier. IEEE Trans. Com-
mun. 20 (3), 454–462.
formances of the systems improved, illustrating the comple-
Gowda, D., Saeidi, R., Alku, P., 2015. AM-FM based filter bank analysis for
mentary speaker-specific information in the analytic phase of estimation of spectro-temporal envelopes and its application for speaker
the speech signal. recognition in noisy reverberant environments. In: Proceedings of Inter-
speech, pp. 1166–1170.
Acknowledgments Greisbach, R., 1999. Estimation of speaker height from formant frequencies.
Int. J. Speech Lang. Law 6 (2), 265–277.
Griffiths, L.J., 1975. Rapid measurement of digital instantaneous frequency.
We would like to thank the anonymous reviewers for their
IEEE Trans. Acoust. Speech Signal Process. 23 (2), 207–222.
creative suggestions and constructive criticisms, which helped Grimaldi, M., Cummins, F., 2008. Speaker identification using instantaneous
to improve the content of this paper. frequencies. IEEE Trans. Audio Speech Lang. Process. 16 (6), 1097–
1111.
Supplementary material Hautamaki, V., Kinnunen, T., Sedlak, F., Lee, K.A., Ma, B., Li, H., 2013.
Sparse classifier fusion for speaker verification. IEEE Trans. Audio
Speech Lang. Process. 21 (8), 1622–1631.
Supplementary material associated with this article can be Huang, N.E., Shen, Z., Long, S.R., Wu, M.C., Shih, H.H., Zheng, Q.,
found, in the online version, at 10.1016/j.specom.2016.02. Yen, N.-C., Tung, C.C., Liu, H.H., 1998. The empirical mode decom-
005. position and the Hilbert spectrum for nonlinear and non-stationary time
series analysis. Proc. R. Soc. Lond. A: Math. Phys. Eng. Sci. 454 (1971),
References 903–995.
Kaiser, J., 1990. On a simple algorithm to calculate the ‘energy’ of a signal.
Alsteris, L.D., Paliwal, K.K., 2007. Short-time phase spectrum in speech pro- In: Proceedings of IEEE International Conference on Acoustics, Speech,
cessing: a review and some experimental results. Digital Signal Process. and Signal Processing (ICASSP ’90), pp. 381–384. doi:10.1109/ICASSP.
17 (3), 578–616. 1990.115702.
Atal, B.S., Shadle, C.H., 1978. Decomposing speech into formants: a new Karam, Z.N., 2006. Computation of One-dimensional Unwrapped Phase. MIT
look at an old problem. J. Acoust. Soc. Am. 64 (S1), S162. (Master’s thesis).
Athineos, M., Ellis, D.P.W., 2007. Autoregressive modeling of temporal en- Kinnunen, T., 2006. Joint acoustic-modulation frequency for speaker recog-
velopes. IEEE Trans. Signal Process. 55 (11), 5237–5245. nition. In: Proceedings of IEEE International Conference on Acoustics,
Boashash, B., 1992. Estimating and interpreting the instantaneous frequency Speech and Signal Processing (ICASSP ’06), p. I. doi:10.1109/ICASSP.
of a signal – part 1: fundamentals. Proc. IEEE 80 (4), 520–538. 2006.1660108.
Brümmer, N., de Villiers, E. December 2011. The BOSARIS Toolkit: Theory, Kinnunen, T., Li, H., 2010. An overview of text-independent speaker recog-
Algorithms and Code for Surviving the New DCF, NIST SRE’11 Analysis nition: from features to supervectors. Speech Commun. 52 (1), 12–40.
Workshop, Atlanta. Kumaresan, R., Rao, A., 1999. Model-based approach to envelope and pos-
Campbell, W., Sturim, D., Reynolds, D., 2006. Support vector machines using itive instantaneous frequency estimation of signals with speech applica-
GMM supervectors for speaker verification. IEEE Signal Process. Lett. tions. J. Acoust. Soc. Am. 105 (3), 1912–1924.
13 (5), 308–311. Larcher, A., Bonastre, J.-F., Fauve, B.G., Lee, K.-A., Lévy, C., Li, H.,
Cohen, L., 1995. Time-frequency Analysis: Theory and Applications Signal Mason, J.S., Parfait, J.-Y., 2013. ALIZE 3.0-open source toolkit for state-
processing series. Prentice Hall, Inc., Upper Saddle River, NJ, USA. of-the-art speaker recognition. In: Proceedings of Interspeech, pp. 2768–
Davis, S., Mermelstein, P., 1980. Comparison of parametric representations 2772.
for monosyllabic word recognition in continuously spoken sentences. Lee, K.-F., Hon, H.-W., 1989. Speaker-independent phone recognition using
IEEE Trans. Acoust. Speech Signal Process. 28 (4), 357–366. hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 37
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end (11), 1641–1648.
factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. van der Maaten, L.J.P., Hinton, G.E., 2008. Visualizing high-dimensional data
Process. 19 (4), 788–798. using t-SNE. J. Mach. Learn. Res. 9 (9), 2579–2605.
van Dijk, M., Orr, R., van der Vloed, D., van Leeuwen, D., 2013. A human Makhoul, J., 1975. Linear prediction: a tutorial review. Proc. IEEE 63 (4),
benchmark for automatic speaker recognition. In: Biometric Technologies 561–580.
in Forensic Science, pp. 39–45. Maragos, P., Kaiser, J., Quatieri, T., 1993a. Energy separation in signal mod-
Dimitriadis, D., Maragos, P., Potamianos, A., 2005. Robust AM-FM features ulations with application to speech analysis. IEEE Trans. Signal Process.
for speech recognition. IEEE Signal Processing Letters 12 (9), 621–624. 41 (10), 3024–3051.
Duda, R.O., Hart, P.E., Stork, D.G., 2000. Pattern Classification, second ed. Maragos, P., Kaiser, J., Quatieri, T., 1993b. On amplitude and frequency
Wiley-Interscience, New York, NY, USA. demodulation using energy operators. IEEE Trans. Signal Process. 41
Gabor, D., 1946. Theory of communication. J. Inst. Electr. Eng. 93 (26), (4), 1532–1550.
429–457. Marple, S.L.J., 1999. Computing the discrete-time “analytic” signal via FFT.
Ganapathy, S., Mallidi, S., Hermansky, H., 2014. Robust feature extraction IEEE Trans. Signal Process. 47 (9), 2600–2603.
using modulation filtering of autoregressive models. IEEE/ACM Trans. McLaren, M., Scheffer, N., Graciarena, M., Ferrer, L., Lei, Y., 2013. Improv-
Audio Speech Lang. Process. 22 (8), 1285–1295. ing speaker identification robustness to highly channel-degraded speech
Ganapathy, S., Pelecanos, J., Omar, M.K., 2011. Feature normalization for through multiple system fusion. In: Proceedings of IEEE International
speaker verification in room reverberation. In: Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP’ 13),
International Conference on Acoustics, Speech and Signal Processing pp. 6773–6777.
(ICASSP ’11), pp. 4836–4839.
K. Vijayan et al. / Speech Communication 81 (2016) 54–71 71

Murty, K.S.R., Yegnanarayana, B., 2008. Epoch extraction from speech sig-
nals. IEEE Trans. Audio Speech Lang. Process. 16 (8), 1602–1613. Sadjadi, S.O., Hasan, T., Hansen, J.H.L., 2012. Mean Hilbert envelope co-
NIST, 2010. The NIST Year 2010 Speaker Recognition Evaluation Plan. URL efficients (MHEC) for robust speaker recognition. In: Proceedings of In-
http:// www.nist.gov/ itl/ iad/ mig/ upload/ NIST_SRE10_evalplan-r6.pdf terspeech, pp. 1696–1699.
Nosratighods, M., Thiruvaran, T., Epps, J., Ambikairajah, E., Ma, B., Li, H., Saratxaga, I., Hernaez, I., Erro, D., Navas, E., Sanchez, J., 2009. Simple rep-
2009. Evaluation of a fused FM and cepstral-based speaker recognition resentation of signal phase for harmonic speech models. IEEE Electron.
system on the NIST 2008 SRE. In: Proceedings of IEEE International Lett. 45 (7), 381–383.
Conference on Acoustics, Speech and Signal Processing (ICASSP ’09), Sarria-Paja, M., Falk, T., O’Shaughnessy, D., 2013. Whispered speaker ver-
pp. 4233–4236. ification and gender detection using weighted instantaneous frequencies.
Oppenheim, A.V., Schafer, R.W., Buck, J.R., 1999. Discrete-time Signal Pro- In: Proceedings of IEEE International Conference on Acoustics, Speech
cessing, second ed. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. and Signal Processing (ICASSP ’13), pp. 7209–7213.
Orchard, T.L., Yarmey, A.D., 1995. The effects of whispers, voice-sample du- Schatz, T., Peddinti, V., Bach, F., Jansen, A., Hermansky, H., Dupoux, E.,
ration, and voice distinctiveness on criminal speaker identification. Appl. 2013. Evaluating speech features with the Minimal-Pair ABX task: anal-
Cogn. Psychol. 9 (3), 249–260. ysis of the classical MFC/PLP pipeline. In: Proceedings of Interspeech,
Pai, W.-C., Doerschuk, P., 2000. Statistical AM-FM models, extended pp. 1–5.
Kalman filter demodulation, Cramer-Rao bounds, and speech analysis. Shannon, R.V., Zeng, F.-G., Kamath, V., Wygonski, J., Ekelid, M., 1995.
IEEE Trans. Signal Process. 48 (8), 2300–2313. Speech recognition with primarily temporal cues. Science 270 (5234),
Paliwal, K.K., Alsteris, L.D., 2005. On the usefulness of STFT phase spec- 303–304.
trum in human listening tests. Speech Commun. 45 (2), 153–170. Thiruvaran, T., Ambikairajah, E., Epps, J., 2006. Speaker identification using
Paliwal, K.K., Atal, B.S., 2003. Frequency-related representation of speech. FM features. In: Proceedings of 11th Australian International Conference
In: Proceedings of Eurospeech, pp. 65–68. on Speech Science and Technology, pp. 148–152.
Pantazis, Y., Rosec, O., Stylianou, Y., 2011. Adaptive AM-FM signal decom- Thiruvaran, T., Ambikairajah, E., Epps, J., 2008. Extraction of FM compo-
position with application to speech analysis. IEEE Trans. Audio Speech nents from speech signals using all-pole model. IEEE Electron. Lett. 44
Lang. Process. 19 (2), 290–300. (6), 449–450.
Picone, J., 1993. Signal modeling techniques in speech recognition. Proc. Thiruvaran, T., Nosratighods, M., Ambikairajah, E., Epps, J., 2009. Com-
IEEE 81 (9), 1215–1247. putationally efficient frame-averaged FM feature extraction for speaker
Pollack, I., Pickett, J.M., Sumby, W.H., 1954. On the identification of speak- recognition. IEEE Electron. Lett. 45 (6), 335–337.
ers by voice. J. Acoust. Soc. Am. 26 (3), 403–406. Tsiakoulis, P., Potamianos, A., Dimitriadis, D., 2009. Short-time instanta-
Potamianos, A., Maragos, P., 1994. A comparison of the energy operator and neous frequency and bandwidth features for speech recognition. In: Pro-
the Hilbert transform approach to signal and speech demodulation. Signal ceedings of IEEE Workshop on Automatic Speech Recognition Under-
Process. 37 (1), 95–120. standing (ASRU ’09), pp. 103–106.
Potamianos, A., Maragos, P., 1996. Speech formant frequency and bandwidth Tsiakoulis, P., Potamianos, A., Dimitriadis, D., 2010. Spectral moment fea-
tracking using multiband energy demodulation. J. Acoust. Soc. Am. 99 tures augmented by low order cepstral coefficients for robust ASR. IEEE
(6), 3795–3806. Signal Process. Lett. 17 (6), 551–554.
Potamianos, A., Maragos, P., 1999. Speech analysis and synthesis using an Vakman, D., 1996. On the analytic signal, the Teager-Kaiser energy algo-
AM-FM modulation model. Speech Commun. 28 (3), 195–209. rithm, and other methods for defining amplitude and frequency. IEEE
Prince, S., Elder, J., 2007. Probabilistic linear discriminant analysis for infer- Trans. Signal Process. 44 (4), 791–797.
ences about identity. In: Proceedings of IEEE International Conference Vijayan, K., Kumar, V., Murty, K.S.R., 2014. Feature extraction from ana-
on Computer Vision (ICCV ’07), pp. 1–8. lytic phase of speech signals for speaker verification. In: Proceedings of
Quatieri, T., 2001. Discrete-time Speech Signal Processing: Principles and Interspeech, pp. 1658–1662.
Practice, first ed. Prentice Hall Press, Upper Saddle River, NJ, USA. Wolfe, J., Schafer, E.C., Heldner, B., Mulder, H., Ward, E., Vincent, B.,
Quatieri, T., Hanna, T., O’Leary, G., 1997. AM-FM separation using auditory- 2009. Evaluation of speech recognition in noise with cochlear implants
motivated filters. IEEE Trans. Speech Audio Process. 5 (5), 465– and dynamic FM. J. Am. Acad. Audiol. 20 (7), 409–421.
480. Won, J., Shim, H., Lorenzi, C., Rubinstein, J., 2014. Use of amplitude mod-
Rabiner, L.R., Schafer, R.W., 1978. Digital Processing of Speech Signals. ulation cues recovered from frequency modulation for cochlear implant
Prentice-Hall, Englewood Cliffs, NJ, USA. users when original speech cues are severely degraded. J. Assoc. Res.
Rao, A., Kumaresan, R., 2000. On decomposing speech into modulated com- Otolaryngol. 15 (3), 423–439.
ponents. IEEE Trans. Speech Audio Process. 8 (3), 240–254. Yin, H., Hohmann, V., Nadeu, C., 2011. Acoustic features for speech recogni-
Reddy, P.R., Vijayan, K., Murty, K.S.R., 2015. Analysis of features from an- tion based on gammatone filterbank and instantaneous frequency. Speech
alytic representation of speech using MP-ABX measures. In: Proceedings Commun. 53 (5), 707–715.
of Interspeech, pp. 593–597. Yoshida, H., Jain, A., Ichalkaranje, A., Ichalkaranje, N., 2007. Advanced
Reynolds, D., Rose, R., 1995. Robust text-independent speaker identifica- computational intelligence paradigms in healthcare – 1. Studies in Com-
tion using Gaussian mixture speaker models. IEEE Trans. Speech Audio putational Intelligence. Springer Berlin Heidelberg.
Process. vol. 3 (1), 72–83. Zeng, F.-G., Nie, K., Stickney, G.S., Kong, Y.-Y., Vongphoe, M.,
Reynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using Bhargave, A., Wei, C., Cao, K., 2005. Speech recognition with ampli-
adapted Gaussian mixture models. Digital Signal Process. 10 (1-3), 19– tude and frequency modulations. Proc. Natl. Acad. Sci. USA 102 (7),
41. 2293–2298.