Automatic Language Identification Using GMM Documentation

[Document title]
CHAPTER 1
OVERVIEW OF THE PROJECT
This chapter include introduction about the project, problem statement,

motivation, objective, methodology adopted, tools required, organisation of the report.
1.1 INTRODUCTION
The world becomes more and more cosmopolitan, and with the exponential
growth of information technologies, human-to-machine communication becomes
more and more usual, in particular voice interaction with a machine. For a complete
interaction the computer should recognise human speech, however, some applications
does not need full speech recognition but only identification of the spoken language.
Moreover, this identification may require to be done relatively quickly, so taking the
time to rebuild the entire speech may be a waste of time.
Automatic spoken Language Identification (LID) is one part of such human to

machine interface, and even if it may be clear for a human ear, language
identification is hard for a computer. LID refers to the process of having a spoken
language recognised by a computer [ Zissman and Berkling , 2001]. LID systems
could be used, for instance, in call centres to find automatically an interlocutor who
matches the language of the caller or for automatically classifying some speech data
from a huge database.
Moreover, in some applications, an automatic LID system should be very quick,

for example in case of an emergency call, where each second could be vital. In the
case reported in this paper, the LID system is being built as a prototype for the firm
Snell, and the specific requirements for the LID system that was developed were
provided by this firm and are presented in the last chapter of the report.
Automatic spoken LID is not new. Research has been done since the 1970s and
LID systems have become more and more complex. The simplest systems use only
acoustic features; then to increase the accuracy, phonotactics features like phoneme
recognition are added. The most complicated systems use word level approaches and
1
[Document title]
analyse the syntax of the language. The top performance of these kind of systems is
high [ Zissman , 1996].
However this top performance is obtained under some strict lab conditions and
existing LID techniques need to be significantly improved for real applications. This
project attempts to investigate several key issues related to applying LID in real
applications.
In general, there are two main approaches to building LID systems: The first
approach generally deals with only acoustic features and unlabelled data. The
language identification features are extracted then the model is trained. The typical
model used is a Gaussian Mixture Models (GMM) [Reynolds, 2008] for each
language to be identified because of GMM’s high performance and its easy
computation.
The second approach is to use a phoneme recogniser of one language or several

recognisers in parallel. Then, mode ls for different languages are built and used to
identify the phoneme sequences of a test sample of speech. The resulting sequences
can then be used in decision-making.
The report is organised as follows: The first chapter presents the challenge of
LID and the necessary knowledge of signal processing. Then each chapter describes
a step of an LID system: including speech enhancement techniques to reduce
background noise, feature extraction, language modelling methods and different
decision making techniques according to the purpose of the system.
Finally, the last chapter presents the implementation done for the purpose of this
project and showed its results in real application.
1.2 PROBLEM STATEMENT
In a multi lingual country like India, there are many places like airports,
call centres, service centres etc where multiple language recognition is required to
provide services .
2
[Document title]
1.3 MOTIVATION
The main motive of this research is to build a language identification system

which can recognise a language when it is given as input either in the real time or as
recorded one.
1.4 OBJECTIVE
The main objective of a language identification system is to identify the

language in a given utterance and to achieve accurate results in shortest speech
segments by using automatic language identification system.
1.5 METHODOLOGY APOPTED
In this, system can work in phases i.e., training and testing phase.
Methodology for Training phase
1. Input signal.
2. Convert to vocal feature.
3. Apply feature extraction methods.
4. Apply GMM model for training.
5. Train GMM for generated model.
Methodology for Testing phase
1.Input -Audio signal.
2. Convert to vocal feature.
3. Apply feature extraction methods. (MFCC, GFCC, PLP, SDC and combination of
this feature).
4. Apply GMM model on generated feature.
7. Classify feature according to trained dataset.
8. Identify Language based on classified feature.
3
[Document title]
1.6 TOOLS REQUIRED
 MATLAB Software (R2013a version).

 12 Languages Database.
1.7 ORGANIZATION OF THE REPORT
Chapter 1 presents overview of project, motivation and objectives and goals and
applications. Chapter 2 presents literature survey regarding the project. This survey
helped a lot in getting complete description of the project. Chapter 3 presents
Introduction to speech which is basic for our domain. Chapter 4 presents Speech
Enhancement which included signal to noise ratio, effect of noise on speech signal Etc.
Chapter 5 presents Language Identification Systems which includes basic block
diagrams of language identification systems and its detailed explanation of blocks.
Chapter 6 presents GMM Modelling techniques which included different phases of

GMM modelling technique and their detailed explanation. Chapter 7 presents
Conclusion and Future scope of Language Identification Systems.
4
[Document title]
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
In the previous chapter, introduction of the project, problem statement,
motivation, objective, methodology adopted, tools used and organization of the
project were discussed. In this chapter literature survey of the project is discussed.
2.2 GENESIS OF THE REPORT
Language ID utilizing phoneme acknowledgment and phonotactic

demonstrating takes after by n-gram language models and uses Language ID utilizing
phoneme acknowledgment and phonotactic demonstrating takes after by n-gram
language models and uses PRLM. It present gender dependent acoustic model. This
method is utilized for enhancing speech recognition performance. Yet, because of
gender dependent accuracy precision is low so this framework can enhance exactness
for the gender dependent.
Language Identification (LID) taking into account language dependent

telephone acknowledgment utilizes various elements and their mixes that are extricated
by language dependent recognizers were assessment depends on the same database.
Two techniques are utilized that are
 Forward and backward bigram based language models

 Context-dependent duration models. Both the methods are used for language
identification, backward bigram is used to capture backward phonetic constrains
only.
SVM based speaker confirmation utilizing GMM Model, Gaussian mixture

models with universal backgrounds (UBMs) have turned into the standard strategy for
speaker recognition. A speaker model is model by MAP adjustment of the method for
the UBM. A GMM supervector is developed by the method for received mixture
components. A late research is that element investigation of this GMM supervector is
5
[Document title]
a viable strategy for variability remuneration. System build a support vector machine
utilizing strategy called as GMM supervector.
Acoustic, phonetic and discriminative approaches deal with programmed

language identification that presented 3 techniques GMM, phone recognition and
support vector machine arrangement yet rectify exactness is not achieved.
Total variability model utilizes i-vector approach based on Joint Factor Analysis
for speaker verification. JFA model based on speaker and channel components
comprises of two particular spaces: the speaker space element characterized by the
eigen voice matrix V and the channel space element represent by the eigen channel
matrix U. Only single space is utilized rather than two, which refers to as the total
variability space. The total variability network contains the eigenvectors that have the
biggest eigenvalues of the total variability covariance matrix. Given an utterances, the
new speaker-and channel-subordinate GMM supervector characterized as takes after:
M = m + Tw
Joint factor analysis versus Eigen channels in speaker recognition this

procedure introduced two ways to deal with the issue of session variability in Gaussian
mixture model (GMM) based speaker verification, Eigen channels, and joint factor
examination that component investigation was significantly more viable than
Eigenchannel modelling.
In proposed system we use JFA methods for accuracy in speaker verification

task.
For speaker confirmation front-end factor analysis speak to another speaker

check framework where further analysis is utilized to define new low-dimensional
space that models both speaker and channel variability.
I-vectors in the connection of phonetically-constrained short utterance for

speaker verification future extent of this method is to exploring the effect of phonetic
data on i-vector standardization by considering the relationship between the current
speaker segregation scoring and distinction between expressions phonetic separation
for short duration
6
[Document title]
Language recognition in i-vectors space The idea of alleged I Vectors, where

every utterance is spoken to by fixed length low-dimensional element vector, novel
methodology for language recognition that model gives magnificent execution over all
conditions future extension is attempt to acquire i-Vectors from the utterances and the
relating adequate insights in a more straightforward manner.
The methods based on I vector model for speaker recognition presents a study
for how the current selection of factor analysis techniques that perform when utterance
lengths are significantly reduced. Problems of short utterance with factor analysis
approaches will be investigated in future.
2.3 CONCLUSION
In this chapter, reference links have been referred and the analysis of the
literature survey of the project was discussed.
7
[Document title]
CHAPTER 3
INTRODUCTION TO SPEECH
3.1 Human Speech
The first toward completing this project is to understand how human speech is
formed and what are the differences between languages.
3.1.1 Speech Formation
A very detailed explanation about speech production and a description of

language can be found in this book [Ladefoged, 1993] which is a reference in this field.
Phoneme
Language is used to communicate, so speech is a series of messages. These

messages are divided in words in a specific order, and each word is a sequence of
phonemes. A phoneme is defined as “the smallest segmental unit of sound employed to
form meaningful contrasts between utterances” by the International Phonetic
Association [Association, 1999]. Phonemes are used as a symbol for sound distinction.
They can be split into two classes: the consonants and the vowels which are defined by
this book [Huang et al., 2001] as follows:
8
[Document title]
 Consonants are articulated in the presence of constrictions in the throat or

obstructions, in the mouth (tongue, teeth, lips), to the air flow from the lungs as
we speak.
Vowels are articulated without major constrictions and obstructions.
The source and the filter
A speech sound is produced at the vocal cords by the air flow coming from the
lungs, then this sound resonates in the oral or nasal cavities according to the position of
the mouth’s organs. One of the most important theories of speech processing is to model
the airflow at the vocal cords as a source and model the resonating in the mouth as a
filter h[n]:
x[n]=e[n]∗h[n].
The separation between source and filter is a big challenge in speech recognition
and in good features extraction. However it is necessary, because some characteristics
(like phoneme classification) is mostly dependent on the filter.
3.1.2 language Discrimination and its difficulties
Language Discrimination
Some characteristics are useful to discriminate languages. Following is a list of the

most important characteristics for language recognition, but more detailed studies can
be found in the books [Fromkin et al., 2010, Comrie, 2011].
 Phonetic: Each language has its own set of phonemes, with a specific frequency
of occurrence for each. Even if two languages share the same phoneme, the rules
governing the allowing of previous or successive phonemes can be different.
They are called the “phonotactics constraints”.
 Prosody: This refers to the rhythm of a language. For some languages, prosodic
characteristics are useful in distinguishing words and meaning. These can be the
stress, the duration of phonemes or the pitch. For example some languages, like
Mandarin, are tonal, so pitch is highly relevant in this case.
9
[Document title]
 Vocabulary: Obviously, each language has its own words, but all languages do
not share the same root, or at least the construction of the words is different.
 Syntax: Each language has its own word order, placing words in a sentence
according to their function.
Difficulties
In the previous section we saw which features characterise a language.

Muthusamy showed that humans can accurately recognise a language if they know it
[Muthusamy et al., 1994]. However, this is still an unsolved problem for a machine.
Each language is just a set of rules and yet it could sound very different depending on,
for example, the speaker or accent. There are some characteristics which may make the
recognition task difficult for a machine:
 Each speaker has his own voice therefore he/she speaks the language
differently. Indeed a speaker’s vocal anatomy is unique, so the harmonics
produced during speaking are not the same.
Furthermore, two speakers may not pronounce a phoneme in the same way because
of their accent. That is why a language cannot be reduced to one speaker, and LID
systems must be trained by a broad range of speakers. Moreover, the pitch of a male
voice may be highly different from the pitch of a female or child’s voice. So generally
LID systems distinguish male from female voice. A study of the role of pitch can be
found in the chapter six of the book [Huang et al., 2001].
 when we are speaking, emotions emphasise a word or a situation. However

emotions may distort the conventional pronunciation of words and make the
decision of a LID system more difficult. Studies about the modification of a
phoneme for the Farsi language according to the emotion can be found in the
article [Ververidis and Kotropoulos, 2006] or for Chinese language [Yuan et
al., 2002].
The worst inconvenience for an LID system are environmental noises. Indeed,
they may interfere with the speaker’s voice and add irrelevant information for the LID
system which may then make mistakes. As, for example, with background music or
10
[Document title]
street sounds, environmental noises need to be filtered before proceeding to the

language recognition.
3.2 Digital Signal Processing
The sound produced by our vocal tract is a continuous signal. However it is

saved in a discrete (or digital) way on our computer. This digital signal needs to be
transformed to extract good features for language recognition. This section describes
the principal treatment applicable to a digital signal for LID systems.
3.2.1 Window functions
Sounds produced when we speak change very quickly, however we want to

study one sound at a time to ensure that we work with a portion of signal which has
stable properties. Therefore, the signal is cut in several frames; each frame is extracted
according to a window function. The most used windows are the rectangular and the
Hamming windows. The rectangular window of size N is the simplest one, the
rectangular windowing signal s[n]is just the N contiguous samples of s[n]at the region
of interest. However, at each window edge the change may be very abrupt. So, to reduce
the edge effect the General Hamming window is applied. Its generalised equation is for
any α∈[0,1]:
For the specific Hamming window α is equal to 0.46. A classical (3.1)

Hamming window in the time domain is shown on Figure 3.1
The Fourier Transform is a transformation of a continuous function in the time

domain into another function in the frequency domain. It represents the energy of the
signal as a function of frequency. The plot of the amplitude against the frequency is
called spectrum. The Fourier transform of a function f(t)is:
11
[Document title]
Figure 3.2: Plot of a 20 ms hamming window in the time domain.
In the case of periodic signal x T[n]with period T, we can apply the Discrete
Fourier Transform (DFT). It transforms the signal into a sum of T harmonic sinusoids
and represents the signal by T coefficients. The DFT of periodic signal x T[n]is defined
as:
(3.3)
Figure 3.2 shows an example of DFT absolute coefficients applied to a speech signal.
The DFT of a periodic signal normally requires T2 operations. However, a faster
algorithm called Fast Fourier Transform (FFT) needs only T ln T operation. These
algorithms and their explanations can be found in the book [Brigham, 1988].
3.2.2 DIGITAL SIGNAL PROCESSING
Transfer function
A filter is more easily expressed in the frequency domain according to its

transfer function H(z). The transfer function of a time invariant filter is defined as the z
transform of the output signal (the filtered input) Y(z)divided by the z-transform of a
input signal X(z):
12
[Document title]
(3.4)
Where the z-transform is simply a generalisation of the Fourier transform where
So the z-transform of a digital signal s[n]is defined as:
(3.5)
Moreover, it transpires that H(z)is the z-transform of the impulse response h [n].
Thus, the output transform signal can simply be computed by multiplying H (z)and the
input transform signal X(z).
3.2.3 Discrete Cosine Transform
The Discrete Cosine Transform (DCT) is similar to the DFT; it transforms the signal
into a sum of T sinusoids. More works about DCT can be found in the book [Rao and
Yip, 1990]. There are several definitions of the DCT, however it is more commonly
defined as:
(3.6)
Figure 3.6: shows an example of DCT coefficients applied to a speech signal.
3.2.4 Digital filters
A digital filter is a system which modifies characteristics of a digital signal.

These could be characteristics of the phase or frequencies of a digital signal. However,
the most used filters are those that work on the spectral coefficients of a signal. So, here
we are interested in filters which reduce or increase the energy of different frequencies
of a digital signal.
3.3 DISCRETE COSINE TRANSFORM
The advantage of the DCT is its data compression capacity. Indeed, as a speech
signal is formed by a majority of low frequencies, most of the signal information is
13
[Document title]
concentrated in the first DCT coefficients. So the signal can be approximated with fewer
coefficients. On this point DCT performs better than DFT, because the DFT tries to
simulate a periodic signal which ends at the same amplitude that it begins (this is in
general unlikely for an ordinary signal). The advantage of the DCT is its data
compression capacity. Indeed, as a speech signal is formed by a majority of low
frequencies, most of the signal information is concentrated in the first DCT coefficients.
So the signal can be approximated with fewer coefficients. On this point DCT performs
better than DFT, because the DFT tries to simulate a periodic signal which ends at the
same amplitude that it begins (this is in general unlikely for an ordinary signal).
The advantage of the DCT is its data compression capacity. Indeed, as a speech
signal is formed by a majority of low frequencies, most of the signal information is
concentrated in the first DCT coefficients. So the signal can be approximated with fewer
coefficients. On this point DCT performs better than DFT, because the DFT tries to
simulate a periodic signal which ends at the same amplitude that it begins (this is in
general unlikely f3r an ordinary signal).
Figure 3.3 plot of speech waveform (above) and its DFT (below).
14
[Document title]
Impulse response
A time invariant filter is fully represented by its impulse response h[n]. This
means that h[n]is the output signal produced by the filter then the input signal is the
impulse signal δ[n]. Where δ[n]is expressed as:
(3.8)
It can be seen metaphorically as a very brief impulsion like a hammer blow.

Then output signal is computed by the convolution of the input signal and the impulse
response of the filter:
(3.9)
15
[Document title]
CHAPTER-4
SPEECH ENHANCEMENT
Nowadays, with the hugely increasing use of mobile phones with
embedded microphones, an LID system must be prepared to deal with all kinds of
noises which reduce its performance. Indeed, white noise from a telephone channel,
street noise or music may merge with speech and disturb the information contained in
the speech sample. Noise might have to be removed by pre-processing the signal in
order to gain good performance for a recognition task. This chapter describes some
speech enhancement techniques.
4.1 Signal to noise ratio
The speech to noise ratio (SNR) is an indicator that measures the quality of a
signal according to the level of background noise. For an LID task, the SNR is more
commonly called the speech to noise ratio. This means that if the SNR is low, then the
strength of the noise may be too high compared to the strength of the speech. In such a
case, the speech may be corrupted and unintelligible. Whereas a high SNR means a
good listenability of the speech. The SNR is computed as follows:
The SNR is computed as follows:
(4.1)
Where Px is the power of signal x[n]of period T defined as:
16
[Document title]
(4.2)
4.2. SPECTRAL NOISE SUBTRACTION
As we often work in the frequency domain, it could be useful to compute power

in the frequency domain. This can be done according to Parseval’s theorem for a
discrete time signal:
(4.3)
Where X(n) is the Discrete Fourier Transform of a discrete signal x(n). Thus the
expression of the power of a signal x[n ]becomes
(4.4)
The SNR is often expressed in decibels (dB) according to the conversion:
(4.5)
4.2 Spectral noise subtraction
4.2.1 Basic idea
Spectral noise subtraction is a speech enhancement method. Its aim is to remove

large band noise from a signal while keeping the understandability of the speech. To
understand the idea of spectral subtraction, we need to know the effects of noise
corruption on a speech sample. Figures 3.1 and 3.2 compare a clean speech sample
without noise and the same speech sample where uncorrelated white noise has been
17
[Document title]
added. White noise is a signal with a flat spectrum, which means that each frequency
has the same magnitude. Figure 3.1 represents the waveform (in the time domain) of
these clean and noisy speech samples. We can see that the mean is unaltered, but the
variance is slightly increased. Figure 3.2 shows the same clean and noisy speech
samples, but this time in the frequency domain by plotting the magnitude of the signals
according to frequency. We can see this time that the noise has increased the mean of
the magnitude and the variance is altered too.
Because of this last observation, spectral subtraction has been investigated to

reduce the noise in a given signal. Thus, the spectral subtraction method acts in the
frequency domain and the idea is to counteract this augmentation of the
Figure 4.1: Plots of speech signals in time domain. Above, a clean speech signal. Below, same
speech signal but corrupted by uncorrelated white noise.
18
[Document title]
Speech enhancement techniques generally make the assumption that, on a short-

term frame, a speech signal x[k]corrupted by uncorrelated noise is equal to the sum of
a clean speech signal s[k]and a noise signal n[k] [Beroutietal., 1979]:
(4.6)
Which can be derived in the frequency domain by computing the Discrete Fourier Transform:
(4.7)
Figure 4.2: Plots of magnitude spectra of speech (in frequency domain). Above, a clean speech
signal. Below, same speech signal but corrupted by uncorrelated white noise.
19
[Document title]
Where the exponent b can be equal to one or two, according to the applied
method: b=1 for magnitude spectral subtraction and b=2 for power spectral subtraction.
An estimation of the noise spectrum may be performed during frames without speech
(which is determined by a voice activity detector). So the mean is computed on a subset
of these voiceless frames. Moreover, this noise estimation could be smoothed in
frequency according to the kind of noise [ Berout et al., 1979]. For example, estimation
of a typical white noise may be set to be flat in frequency.
Because the first assumption is not completely true, it could result that
magnitude of the clean speech spectrum estimation is negative. So, it is defined to zero
when this is the case. The clean speech spectrum estimation becomes:
(4.8)
(4.9)
4.2.2 Improved method
The method described above seems to remove some background noise,

however, some noise remains; but worse, it creates a new kind of noise. This new noise
is due to the sharpness resolution of the negative value. Indeed, in real applications,
background noise might not be a perfect white noise; so, the power spectrum of noise
may not be completely flat and may contain peaks and valleys. Thus, valleys and slight
peaks are reduced to a magnitude of zero by spectral subtraction; and only the highest
peaks remain. These isolated peaks create a new specific noise.
20
[Document title]
This new kind of noise is called musical noise because the band of frequencies
responsible of it is very narrow and spaced from each other so, it may be perceived as
a tone. Figure 3.3 (adapted from [Westall et al., 1998] ) illustrates this problem.
In order to counteract the emergence of these noises and to remove more

background noise, two coefficients α and β are added to the equation:
Figure 4.3: Illustration of musical noise created by spectral subtraction (adapted from [Westall
et al., 1998]).
(4.10)
Besides that, the method remains the same. This method is described by a block
diagram on Figure 3.4 adapted from [Berouti et al., 1979]. Finally figure 3.5 shows the
result of the spectral subtraction applied to the noise-corrupted speech plotted in the
bottom on figure 3.1. We can see that the signal is clearer, but not as good as the original
noiseless signal in the top on figure 3.1.
21
[Document title]
Figure 4.4: Block diagram of the spectral noise subtraction method (adapted from [Berouti]
22
[Document title]
Figure 4.5: Plots of speech signals in time domain. Above: a speech signal
corrupted by uncorrelated white noise. Below: result of spectral subtraction
applied to the noisy speech above.
4.3 Noise power estimation
Some speech enhancement techniques, like the spectral subtraction described

above, need a noise power estimation to work. A good noise estimator is essential for
good performance in speech enhancement methods especially for spectral subtraction
which highly relies on it. Indeed, if the estimation is too low, background noise may
persist and if it is too high, the speech could be affected. This section describes a noise
power spectral density estimation proposed by [Martin, 2001], which does not require
a voice activity detector, and therefore does not use an explicit threshold to detect
speech activity. It is based instead on minimum statistics by tracking spectral minima
in the spectral domain for each frequency band. First, as above, we assume that the
noisy input signal x[k]is equal to the sum of a clean speech signal s[k]and a noise signal
n[k]. Then, we express this equation in the spectral domain by computing the FFT for
each frame λ:
(4.11)
A periodogram is a plot of signal power against the frames. A periodogram

generally fluctuates greatly according to the frame. To understand how tracking minima
of signal power could work, we define a smooth version P λ[k]of the signal
periodogram. The smooth version aims to reduce the oscillations while keeping the
global shape of the signal power:
(4.12)
Where α is a coefficient, between zero and one, that determines the

“smoothness” of the periodogram. Figure 4.6 shows an example of a periodogram of a
signal for a specific frequency bin k, and its smooth version. We can see that the smooth
version of the periodogram follows the global shape of the original periodogram
without all short-time high variations. High peaks are assumed to be frames with voice
23
[Document title]
and the valleys are frames without voice (so only contain noise). To obtain a noise
power estimation, the idea now is to estimate the noise power by tracking minima on a
sliding window. Ideally, the size of this sliding window should be as wide as the number
of frames between two valleys. Figure 4.7
Figure 4.6: Plots of a periodogram (above) and its smooth version (below).
illustrates this noise power estimation method. The sliding window has been taken
small for the purpose of example. However this method suffers from different issues,
as [Martin, 2001] describes: First, wide peaks in the smooth signal power could be
larger than the sliding window. So the estimation may incorporate speech parts. Then,
the estimation is biased toward lower values. Indeed, only one frame with a very low
power is enough for the estimation to take this value during all the length of the sliding
window. Finally, if the noise increases suddenly then the noise power estimation will
be late by the length of the sliding window.[Martin, 2001] gives also a more realistic
24
[Document title]
noise power spectral estimation by changing the computation of the smooth

periodogram by, for example, making α
Figure 4.7: Plots of a smooth periodogram in blue and its noise power estimation by minima
tracking in red.
4.4 Voice activity detector
A voice activity detector (VAD) is a system that determines which frames

contain speech and which do not. To achieve this, we may applied a likelihood ratio
test between two hypotheses for each frame. By keeping the same notations and
assumptions as above, this leads to two hypotheses:
(4.13)
Then [Sohn et al., 1999] is assuming the fact that a real and imaginary part of
an FFT coefficient could be statistically represented as Gaussian random variables. By
making this statistical assumption, the probability densities for each frequency bin k
become:
25
[Document title]
(4.14)
where λ S[k] and λ N[k] are the variances of N and S.
Then, it can be easily derived that log L(X) becomes:
(4.15)
Where ξ k= λ S[k]/λ N[k] and γ k=∣X[k]∣2/λ N[k].
λ N may be estimated according to noise power estimation methods like the one in
section 4.3. So, we only need to estimate the parameters ξk. This may be done by
maximising the likelihood. Thus, we compute the derivative of log zero, and we obtain:
(4.16)
However, this test is biased. Indeed γk−logγk−1 is always positive so H0 is

more likely to be accepted than H1. [Sohn et al., 1999] proposes a method to reduce
this bias, moreover it presents a more realistic VAD which considers previously made
decisions for each new decision.
Figure 4.8 shows the result of this VAD applied to a speech signal. The top
spectrogram is from the original speech signal and the bottom one is after removing the
frames without speech detected by the VAD. Above each spectrogram, the waveform
of the signal is plotted. We can see that the blanks in the speech are removed without
altering the frequencies and the energy of each frame. More blanks may be removed ,
however this may alter the speech more. For LID, a VAD can be useful in reducing the
size of the data by removing irrelevant information; so more data could be used
26
[Document title]
Figure 4.8: Plots of spectrograms to illustrate the effect of a VAD. Above,

the original speech spectrogram. Below, spectrogram of the same speech
after applying a VAD.
4.5 RASTA filtering
Relative Spectra (RASTA) filtering is a speech enhancement technique used to

reduce the channel noise from a speech sample [Hermansky and Morgan, 1994].
RASTA filtering comes from the fact that the spectrogram of a speech produced by
vocal tract may be much more fluctuating than the spectrogram of channel noise, which
should globally stays stationary. The idea of RASTA filtering is to suppress slow
variation thanks to a band filter in the frequency domain. So, it leads to removing the
27
[Document title]
constant magnitudes, during short-term spectrum, in each spectral component Research

showed that the transfer function of this kind of filter .
28
[Document title]
CHAPTER 5
LANGUAGE IDENTIFICATION SYSTEMS
Now that the understanding of the formation of human speech and the methods
of speech enhancement have been reviewed, the functioning of LID systems can be
examined.
5.1 Overview
A LID system takes advantage of the characteristics described in section 2.1.2

which distinguish one language from another language. The level of difficulty of these
characteristics is shown in figure 4.1 adapted from [Tong et al., 2006]. To keep it simple
and to stay in systems with reasonable computation time, the LID systems presented
below use either acoustic or phonotactic features. The training of each LID system
generally follows the same steps, represented by figure 4.2. During pre-treatment,
thanks to Hamming windows, the input signal is cut into several overlapping frames of
20 or 30 ms [Zissman, 1996]. These frames are overlapped to overcome the edge’s
attenuation by the window. Next, features are computed from the frames and then used
to train a model for each language. This results in one or more models that will be used
during the recognition phase in order to determine the language of the test data.
The next sections present the acoustic features, the phonotactics-based system
and finally how complementary systems can be fused to improve their performance.
However, only models from acoustic features were implemented in this project because
of resource and time limitations. Indeed, labelled data for phonotactics-based system
were not available and model’s training takes several audio files.
29
[Document title]
Figure 5.1: Five levels of LID features (adapted from [Tong et al., 2006]).
Figure 5.2: Block diagram of the training part of a LID system.
days on computers that were at my disposal whereas multi models are needed to run relevant
experiences on fused system.
5.2 ACOUSTIC FEATURES
One acoustic feature dominates the field of LID: Mel-frequency cepstral

coefficients (MFCCs). They have been implemented in many systems and have given
satisfying results [Leung et al., 2009., Cookman et al., 2011, Martin et al., 2006, Tong
et al., 2006].
However, a better set of features derived from the MFCCs has been discovered;
they are called the shifted delta cepstral coefficients (SDC). To prove their efficiency,
these new coefficients have been used with a GMM of 2048 Gaussian components
[Torres-Carrasquillo et al., 2002]. They have also been used with a SVM classifier
[Campbell et al., 2006] with a Generalised Linear Discriminant Sequence (GLDS)
kernel [?]. These two classification methods are described in chapter 5. According to
these two studies produced by the same team, the GMM got a slightly better score than
SVM at the NIST Language recognition evaluation where the procedure of evaluation
can be found at [A. Martin, 2005]. As regards to the good results of these algorithms,
the prototype implemented for this project is based on these two previous studies.
30
[Document title]
The definition and the computation algorithm of the SDCs are presented below,
beginning with the definition and computation algorithm of the MFCCs.
5.2.1 The MFCCs
The computation of the MFCC is illustrated by the block diagram in figure 5.3.
Figure 5.3: Block diagram of the MFCCs computation.
Once the input signal has been windowed, each frame is projected in the
frequency domain thanks to the DFT as described in section 2.2.2. Then a Mel filter-
bank is applied to the projected frame. This filter-bank rescaled the frame in the Mel
scale which is presented in figure 5.4. The Mel Scale represents what frequency a
human ear really hears for a corresponding sound frequency.
The Figure 5.5 shows an example of a Mel filter-bank, each filter (triangle) will
lead to a coefficient.
Then, we use the logarithm to get coefficients in decibels. Finally the DCT is
applied to concentrate the information in the first coefficients as described in section
2.2.3.
Figure 5.4: Plot of Mel scale.
31
[Document title]
Figure 5.5: Plot of a Mel filter-bank of 24 filters (crosses indicate centre frequency of each
filter).
5.2.2 The SDCs
The SDCs are computed from the MFCCs ci. They have four parameters N-dP-
k, “where N is the number of cepstral coefficients computed at each frame, d represents
the time advance and delay for the delta computation, k is the number of blocks whose
delta coefficients are concatenated to form the final feature vector, and P is the time
shift between consecutive blocks” [Torres-Carrasquillo et al., 2002].
At a frame t, the SDC feature vector is:
Where ∆c(t ,i)=c(t +I p +d)−c(t +I P−d). Figure 5.6 shows an illustration of this computation.
Figure 5.6: Computation of the SDC feature vector at frame t for parameters N-d-P-k (adapted
from [Torres-Carrasquillo et al., 2002]).
32
[Document title]
A SDC feature vector at a frame t uses k P consecutive frames of cepstral

coefficient. It could be the explanation of the accuracy of SDC in discriminating
language.
5.3 Phonotactics-based systems
Phonotactics-based systems may lead to a better recognition rate than acoustic

based systems. However these kinds of systems need labelled data, which is much more
complicated to find or produce. Therefore such systems are less flexible because, to add
a language to the system, we need labelled data from this language.
To use phonotactics features, a phoneme recogniser needs to be trained. Once

again, MFCCs are generally used to determine the phoneme in a frame. Then, models
for the desired languages have to be created by a generative model of phonetic n-gram
sequence, for example. The phoneme recogniser does not need to be in the same
language as the model; the same phoneme recogniser can be used for all models. Figure
5.7 shows an example of these LID systems.
Figure 5.7: Phoneme Recognition followed by Language Modelling (PRLM).
However phoneme recognisers in different languages can be used in parallel, and each
phoneme recogniser is followed by all the language models. A parallel Phoneme
Recognition followed by Language Modelling (PPRLM) system provides state-of-the-
art language recognition performance [Zissman, 1996]. Figure 4.8 shows an example
of PPRLM system. In this case the scores from the different models should be fused
before making a decision, because it produces several scores for the same language.
The fusion techniques are presented in the next section.
33
[Document title]
Figure 5.8: Parallel Phoneme Recognition followed by Language Modelling (PPRLM).
5.4 Fused systems
If two or more language recognisers are complementary, then combining them

could increase the language recognition rate. That is why language recogniser fusion
has been developed by researchers. Fusion is also useful to combine the scores of a
PPRLM system [Zissman, 1997]. This section presents two fused systems.
5.4.1 Product-rule fusion
Given a speech utterance, a score vector is computed for each recogniser, this
vector contains the scores of all the target languages. To get the final score for each
language, one needs to fuse the scores of each recogniser. For a system with K
recognisers and L languages, the score for an input utterance X to a language l is
computed from the likelihoods.
Normalisation guarantees that the output from the different phoneme

recognisers are in a common range, and the product produces the final score among all
the recognisers. However it assumes that the individual PRLM systems are
independent. Moreover, a quasi null score from a recogniser k may have a big impact
on the final score. Despite that, it appears that it is an effective way to fuse scores
[Zissman, 1996].
34
[Document title]
5.4.2 Gaussian-based fusion
A Gaussian-based fusion has been proposed by Zissman [Zissman, 1997]. The

principle is as follows: each recogniser produces a score (or it could be a vector of
scores) for each language to be identified. Then Linear Discriminant Analysis is applied
to the score vector to project it into a smaller dimensional space where the languages
are well discriminated. Finally, the projected score vector is processed by a Gaussian
classifier for each language which produces a final score for this language.
35
[Document title]
CHAPTER-6
GMM MODELLING TECHNIQUES

As described in the previous chapter, LID systems need classification models to
model the languages that have to be identified. This chapter describes classification
methods commonly used in this field.
6.1 Overview
Two very famous classifiers are widely used in LID because of their high
performance and easy computation: Support Vector Machines and the Gaussian
Mixture Model. Support Vector Machines (SVM) is a data classification technique. An
SVM produces a model from training data which will predict the class of testing
examples. The principle is “find a linear separating hyperplane with the maximal
margin in this space” [Hsu et al., 2003]. Then this linear hyperplane will determine the
class of a testing sample according to: if it is on one side of the boundary or the other
(for a binary classification). However, input data may not be linearly separable, SVM
allows us to map the training data into a higher dimensional space, then the optimal
separating hyperplane can be computed in this space. The calculus of the SVM
equations are explained in detail in this book [Vapnik, 1998]. The Gaussian Mixture
Model (GMM) is an unsupervised classification technique which, applied to LID, tries
to model the phonetic sound of the language by approximating their probability
distribution. It is the modelling method retained for this project because it offers a large
number of degrees of freedom, is easy to train and, as we saw in the previous chapter,
gave better results than SVM. GMM is deeper described in the next section.
6.2. GAUSSIAN MIXTURE MODEL
The Gaussian Mixture Model (GMM) is an unsupervised classification technique. The

idea of GMM is to approximate the probability distribution of the samples from a class C by a
linear combination of K Gaussian distributions:
(6.1)
36
[Document title]
Where the weights have to satisfy the constraint ∑𝑘⋅

𝑖=1 𝛱𝑖 = 1 and N is the normal
distribution:
(6.2)
Figure 6.1 illustrates three plots of the estimation of a dataset distribution in one
dimension by a GMM with two, three and five components. The dataset has been
created by the fusion of three normally distributed datasets. The plots show that two
components is not enough to estimate this distribution accurately. However, three
components seems fine, which is logical according to the dataset composition. We can
conjecture that the more components compose a GMM, the more data distribution is
precisely estimated.
6.2.1 GMM training
The training of a GMM consists in finding the different parameters which best
estimate the probability distributions of the training data X=(x1,...,xT). These
parameters are: the weight πi, the mean µi and the covariance matrix Σi of each
Gaussian distribution (or component)Ni. Let’s call Θ the representation of a N
components GMM parameters. This means:
(6.3)
There are several ways to train a GMM, some methods can be found in
[McLachlan and Peel, 2000].
37
[Document title]
Figure 6.1: Plots of the estimation (in red) by a GMM of a data distribution in one
dimension (in blue) with 2, 3 and 5 components.
The components of each GMM are also plotted in green method is to maximise
the likelihood by estimating the Gaussians’ parameters according to the Expectation
Maximisation (EM) algorithm proposed by [Dempster et al] .
(6.4)
Given a training data set X=(x1,...,xt)and assuming the independence between

xi vectors, the likelihood L for the parameters Θ Maximising L(Θ)is equivalent to
maximising log L(Θ)because the logarithm is strictly increasing. In our case, the
logarithm simplifies the following maths. So we are looking for̂ Θ which maximises:
(6.5)
However, this maximisation is non linear and very complex. That is why the
EM algorithm is used. The EM algorithm is an iterative process which builds an
estimator of the maximum according to the estimation of the previous step. The idea is
to estimate in which component belongs each sample xi to make the maximisation
problem simple. So new variables z ij are introduced, z ij is equal to 1 if the sample xi
belongs to the component j else 0. So the complete log likelihood becomes:
(6.6)
38
[Document title]
Then the algorithm is in two steps at each iteration:
Expectation step
At a given iteration k, the expectation step is to compute the expectation of the

zij according to the Bayes’ rule (this associates each zij to one component):
(6.7)
Maximisation step
The maximisation step finds Θ(k+1) which maximises the function:
(6.8)
This can be solved component by component, the parameters of the component

j at the step k can be computed as follows:
(6.9)
(6.10)
These two steps are repeated until convergence. Convergence is obtained when:
(6.11)
6.2.2 GMM testing
To score the belonging of a testing utterance vector X=(x1,...,xT)to a class, the

likelihood of this vector is computed. For multi-class classification task, a GMM needs
to be built for each desired class in the classifier. So the likelihood is computed for each
GMM. Then, a back-end decision system is used to compare each score as described in
the next chapter.
39
[Document title]
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 CONCLUSION
This dissertation was about automatic spoken language identification. It showed
the technical background and steps necessary for understanding and building an LID
system: including relevant features, classification methods and speech enhancement
techniques. For the purpose of this project, an LID system was created by implementing
some of the techniques described in this report. Because of time limitations, models
fusion has not been experimented, as indeed model training can take several days.
Moreover, because of resource limitations, the path of phoneme recognition (which can
lead to speech recognition) has not been explored. Indeed, labelled data were not
available for this purpose. So, that is why this work has concentrated on simple models
with acoustic features. The implementation was tested on twelve languages. It gave
satisfying results on audiobooks (above 80% on ten seconds and longer utterances) and
can be used to detect the language of a speaker through a microphone thanks to a GUI.
Indian Institute of Technology Kharagpur - Multi Lingual Indian Language Speech
Corpus (IITKGP-MLILSC) is used during the course of present study. In. A minimum
of ten speakers including both male and female are present in each language. From each
speaker, 5–10 minutes of data is recorded at 16kHz sampling rate and 16 bits per
sample, such that a minimum of one hour data is available for each language. In this
study, spectral features extracted from the speech data are used for the task of language
identification. Conventional Gaussian mixture modeling technique is used to develop
language models for language identification. The accuracy of the LID system not only
depends on the feature vector but also on the parameters of the GMM such as
dimensions of feature vectors, number of feature vectors and number of mixture
components. In this work, performance of language identification system is analyzed in
speaker independent case only i.e., data from different speakers is used for training and
testing the language models. From the entire available data set, speech data from two
speakers (1 male and 1 female) is omitted during the process of developing the two
speakers(who are not involved in the training process)is used to test the LID system.
For analysing influence of length testing speech sample on the performance of LID
40
[Document title]
performance of LID is computed for testing speech sample with various lengths such
as 3 sec, 5 sec and 10sec. Multiple LID systems are developed by varying the number
of mixture components from 8 to 64 to analyze the influence of number of mixture
components on the performance of LID. The performance of LID is computed for 50
different test cases from the testing data set and average of all the test cases.
7.2 FUTURE SCOPE
 Different residue of behavioural feature may be extracted in addition to

MFCC for speaker verification.
 In this project we considered GMM modelling technique in next work many

other technique may be used like JAFA, i-vector etc.
41
[Document title]
REFERENCES:
1. Ambikairajah, E., Li, H., Wang, L., Yin, B., and Sethu, V.: Language
identification: a tutorial. Circuits and Systems Magazine, IEEE, 11(2),
2011, pp. 82–108.
2. Chelba, C., Silva, J., and Acero, A.: Soft indexing of speech content for search
in spoken documents. Computer Speech & Language, 21(3), 2007,
pp. 458–478.
3. Cimarusti, D., and Ives, R.B.: Development of an Automatic Identification
System of Spoken Languages: Phase 1. Proc. ICASSP82, Vol. 7, May 1982, pp.
1661–1664.
4. Navrtil, J.: Spoken Language Recognition A step Toward Multi-linguality in
Speech Processing, IEEE Trans. Speech Audio Processing, Vol. 9, September
2001, pp. 678–685.
5. Foil, J.T.: Language Identification Using Noisy Speech, Proc. ICASSP86,
April 1986, pp. 861– 864.
6. Thyme-Gobbel, A.E., Hutchins, S.E.: On using prosodic cues in automatic
language identification, International Conference on Spoken Language
Processing, Vol. 3, 1996, pp. 1768–1772.
7. Bhaskararao, P.: Salient phonetic features of Indian languages in speech
technology. Sadhana, 36(5), 2011, pp. 587–599.
8. Schultz, T., Rogina, I., and Waibel, A.: LVCSR- Based Language
Identification. Proc. ICASSP96, May 1996, pp. 781–784.
9. Kadambe, S., Hieronymus, J.: Language identification with phonological and
lexical models. Proc. ICASSP95, Vol. 5, 1995, pp. 3507-3511.
10. Thomas, H.L., Parris, E.S., and Write, J.H.: Re- current substrings and data
fusion for language recognition. International Conference on Spoken Language
Processing, Vol. 2, 1998, pp. 169- 173.
11. Quatieri, T. F.: Discrete-Time Speech Signal Processing: Principles and
Practice. Engle- wood Cliffs, NJ, USA: Prentice-Hall, 2002.
12. Zissman, M. A.: Comparison of four approaches to automatic language
identification of telephone speech. IEEE Transactions on Speech and Audio
Processing, 4(1), 1996, 31.
42
[Document title]
13. Pellegrino, F., and Andr-Obrecht, R.: Automatic language identification: an

alternative approach to phonetic modelling. Signal Processing, 80(7), 2000, pp.
1231–1244.
14. Torres-Carrasquillo, P. A., Reynolds, D., and Deller Jr, J. R.: Language
identification using Gaussian mixture model tokenization. In Acoustics,
Speech, and Signal Processing (ICASSP), Vol. 1, IEEE May 2002, pp. I– 757.
43
[Document title]
APPENDIX A
SOFTWARE REQUIREMENT
OVERVIEW OF MATLAB
MATLAB is a high performance language for technical computing. It integrates
computation visualization and programming in an easy to use environment. MATLAB
stands for matrix laboratory. It was written originally to provide easy access to matrix
software developed by LINPACK (linear system package) and EISPACK (Eigen
system package) projects. MATLAB is therefore built on a foundation of sophisticated
matrix software in which the basic element is matrix that does not require pre-
dimensioning.
MATLAB is a programming package specifically designed for quick and easy
scientific calculations and I/O. It has literally hundreds of built in functions for a wide
variety of computations and many toolboxes designed for specific research disciplines,
including statistics, optimization, solution of partial differential equations, data
analysis.
MATLAB help function and browser functions are to find any additional
features that may need or want to use. MATLAB is a high-performance language for
technical computing. It integrates computation, visualization, and programming
environment. MATLAB is a modern programming language environment: it has
sophisticated data structures, contains built in editing and debugging tools, and supports
object-oriented programming. These factors make MATLAB an excellent tool for
teaching and research.
MATLAB has many advantages compared to conventional computer languages
for solving technical problems. MATLAB is an interactive system whose basic data
element is an array that does not require dimensioning. It also has easy to use graphics
commands that make the visualization of results immediately available. Specific
applications are collected in packages referred to as toolbox. Thereare toolboxes for
signal processing, symbolic computation, control theory, simulation, optimization, and
several other fields of applied science and engineering.
Typical uses of MATLAB
The typical usage areas of MATLAB are
44
[Document title]
1. Math and computation

2. Algorithm development
3. Data acquisition
4. Data analysis, exploration and visualization
5. Scientific and engineering graphics
MATLAB is an interactive system whose basic data element is an array that does not
require dimensioning. This allows you to solve many technical computing problems,
especially those with matrix and vector formulations, in a fraction of the time it would
take to write a program in a scalar non-interactive language such as C or FORTRAN.
MATLAB features a family of add-on application-specific solutions called
toolboxes. Very important to most users of MATLAB, toolboxes allow you to learn and
apply specialized technology. Toolboxes are comprehensive collections of MATLAB
functions (M-files) that extend the MATLAB environment to solve particular classes
of problems. Areas in which toolboxes are available include signal processing, image
processing, control systems, neural networks, fuzzy logic, wavelets, simulation, and
many others.
Features of MATLAB:
The GUI construction

The Graphical User Interface (GUI) is an interactive system that helps to establish
good communication between the processor and organizer. The functional operation of
the GUI is compatible with the Applets in JAVA. The MATLAB Toolbox provides
more functions to create GUI main frames.
The GUIs can be created by the GUIDE (Graphical User Interface Development
Environment) which is a package in MATLAB Toolbox. The GUI makes the process
so easy to operate and reduces the risk. GUIDE, the MATLAB Graphical User Interface
development environment, provides a set of tools for creating graphical user interfaces
(GUIs). These tools greatly simplify the process of designing and building GUIs. We
can use the GUIDE tools to develop.
Lay out the GUI: Layout Editor, we can lay out a GUI easily by clicking and dragging
GUI components -- such as panels, buttons, text fields, sliders, menus, and so on -- into
the layout area.
45
[Document title]
A. Program the GUI: GUIDE automatically generates an M-file that controls how
the GUI operates. The M-file initializes the GUI and contains a framework for all the
GUI call backs -- the commands that are executed when a user clicks a GUI component.
Using the M-file editor, we can add code to the call backs to perform the functions.
B. GUIDE stores a GUI in two files, which are generated the first time when we
save or run the GUI:
C. A FIG-file, with extension. fig, which contains a complete description of the
GUI layout and the components of the GUI: push buttons, menus, axes, and so on.
D. An M-file, with extension .m, which contains the code that controls the GUI,
including the call backs for its components.
E. These two files correspond to the tasks of lying out and programming the GUI.
When we lay out of the GUI in the Layout Editor, our work is stored in the FIG-file.
When we program the GUI, our work is stored in the M-file.
The MATLAB Application Program Interface (API)
This is a library that allows you to write C and FORTRAN programs that interact
with MATLAB. It includes facilities for calling routines from MATLAB (dynamic
linking), calling MATLAB as a computational engine, and for reading and writing
MAT-files.
MATLAB Working Environment
MATLAB Desktop
MATLAB Desktop is the main Mat lab application window. The desktop contains
five sub windows, the command window, the workspace browser, the current directory
window, the command history window, and one or more figure windows, which are
shown only when the user displays a graphic.
46
[Document title]
Figure: A.2 MATLAB desktop

The command window is where the user types MATLAB commands and expressions
at the prompt (>>) and where the output of those commands is displayed. MATLAB
defines the workspace as the set of variables that the user creates in a work session.
The workspace browser shows these variables and some information about them.
Double clicking on a variable in the workspace browser launches the Array Editor,
which can be used to obtain information and income instances edit certain properties of
the variable.
The current Directory tab, above the workspace tab shows the contents of the current
directory, whose path is shown in the current directory window. For example, in the
windows operating system the path might be as follows: C:\MATLAB\Work,
indicating that directory “work” is a subdirectory of the main directory “MATLAB”;
which is installed in drive C. clicking on the arrow in the current directory window
shows a list of recently used paths. Clicking on the button to the right of the window
allows the user to change the current directory.
MATLAB uses a search path to find M-files and other MATLAB related files, which
are organize in directories in the computer file system. Any file run in MATLAB must
reside in the current directory or in a directory that is on search path. By default, the
47
[Document title]
files supplied with MATLAB and math works toolboxes are included in the search path.
This is the easiest way to see which directories are on the search path. The easiest way
to see which directories are soon the search paths, or to add or modify a search path, is
to select set path from the File menu the desktop, and then use the set path dialog box.
It is good practice to add any commonly used directories to the search path to avoid
repeatedly having the change the current directory.
The Command History Window contains a record of the commands a user has
entered in the command window, including both current and previous MATLAB
sessions. Previously entered MATLAB commands can be selected and re-executed
from the command history window by right clicking on a command or sequence of
commands. This action launches a menu from which to select various options in
addition to executing the commands. This is useful to select various options in addition
to executing the commands. This is a useful feature when experimenting with various
commands in a work session.
Using the MATLAB Editor to create M-Files
The MATLAB editor is both a text editor specialized for creating M-files and a
graphical MATLAB debugger. The editor can appear in a window by itself, or it can be
a sub window in the desktop. M-files are denoted by the extension .m, as in pixel up
.The MATLAB editor window has numerous pull-down menus for tasks such as saving,
viewing, and debugging files. Because it performs some simple checks and also uses
colour to differentiate between various elements of code, this text editor is
recommended as the tool of choice for writing and editing M-functions. To open the
editor, type edit at the prompt opens the M-file filename. m in an editor window, ready
for editing. As noted earlier, the file must be in the current directory, or in a directory
in the search path.
Getting Help
The principal way to get help online is to use the MATLAB help browser, opened as
a separate window either by clicking on the question mark symbol (?) on the desktop
toolbar, or by typing help browser at the prompt in the command window. The help
Browser is a web browser integrated into the MATLAB desktop that displays a
Hypertext Mark up Language (HTML) documents. The Help Browser consists of two
panes, the help navigator pane, used to find information, and the display pane, used to
48
[Document title]
view the information. Self-explanatory tabs other than navigator pane are used to
perform a search.
APPENDIX B
SOURCE CODE
Program for MFCC train:
clear all;
clc;
x1=0;
a=dir('trainwavdata/*');
for i=3:length(a)
allwav=dir(fullfile('trainwavdata',a(i).name,'*.wav'));
for j=1:length(allwav)
fname=fullfile('trainwavdata',a(i).name,allwav(j).name);
[y fs]=wavread(fname);
clear fname;
sig=y.*y;
E=mean(sig);
Threshold=0.05*E;
k=1;
for b=1:100:(length(sig)-100)
if((sum(sig(b:b+100)))/100 > Threshold)
dest(k:k+100)=y(b:b+100);
k=k+100;
end;
end;
clear FS Threshold E sig y ;
49
[Document title]
dest=dest';
if j==1
x1=dest;
else
x1=vertcat(x1,dest);
end;
clear dest;
end;
y1=mfcc_rasta_delta_pkm_v1(x1,8000,13,26,20,10,1,1,2);
save(fullfile('mfcc_train',a(i).name),'y1');
clear y1,x1;
end;
Program for MFCC test:
%to extract the mfcc features of the testingdata after removing silence
clear all;
clc;
a=dir('testwavdata/*');
for i=3:length(a)
allwav=dir(fullfile('testwavdata',a(i).name,'*.wav'));
for j=1:length(allwav)
fname=fullfile('testwavdata',a(i).name,allwav(j).name);
[y FS]=wavread(fname);
sig=y.*y;
E=mean(sig);
Threshold=0.05*E;
50
[Document title]
k=1;
for b=1:100:(length(sig)-100)
if((sum(sig(b:b+100)))/100 > Threshold)
dest(k:k+100)=y(b:b+100);
k=k+100;
end;
end;
dest=dest';
clear FS FFX Threshold E sig y ;
% clear fname dest;

y1=mfcc_rasta_delta_pkm_v1(dest,8000,13,26,20,10,1,1,2);
mkdir(fullfile('mfcc_test',a(i).name));
save(fullfile('mfcc_test',a(i).name, regexprep(allwav(j).name, '.wav', '')),'y1');
clear y1 dest fname ;
end;
end;
Program for GMM training:
1.
clear all;
clc;
a=dir('mfcc_train');
for i=3:length(a)
dim=39;
centres=16;
MIX=gmm(dim,centres,'diag');
load(fullfile('mfcc_train',a(i).name));
foptions(14)=1;
MIX=gmminit(MIX,y1,foptions);
51
[Document title]
%MIX.priors
OPTIONS(1)=1;
OPTIONS(14)=75;
[MIX,OPTIONS,ERRLOG]=gmmem(MIX,y1,OPTIONS);
save(fullfile('allcleanmodels_16',a(i).name),'MIX');
clear y1 MIX;
end;
2.
clear all;
clc;
a=dir('mfcc_train');
for i=3:length(a)
dim=39;
centres=32;
MIX=gmm(dim,centres,'diag');
load(fullfile('mfcc_train',a(i).name));
foptions(14)=1;
MIX=gmminit(MIX,y1,foptions);
%MIX.priors
OPTIONS(1)=1;
OPTIONS(14)=75;
[MIX,OPTIONS,ERRLOG]=gmmem(MIX,y1,OPTIONS);
save(fullfile('allcleanmodels_32',a(i).name),'MIX');
clear y1 MIX;
end;
Program for GMM testing:
%% Testing with GMM modelling

clear all;
clc;
52
[Document title]
a=dir('allcleanmodels_16/*.mat');
b=dir('mfcc_test/*');
confmat=zeros(length(a));
for i=3:length(b)
c=dir(fullfile('mfcc_test',b(i).name,'*.mat'));
for j=1:length(c)
load(fullfile('mfcc_test',b(i).name,c(j).name));
d=zeros(length(a),1);
for k=1:length(a)
load(fullfile('allcleanmodels_16',a(k).name));
d(k)=mean(log(gmmprob(MIX,y1)));
clear MIX;
end;
d=d';
ak=find(d==max(d));
confmat(i-2,ak)=confmat(i-2,ak)+1;
clear d y1;
end;
end;
sum=0;
for g=1:length(a)
for h=1:length(b)
if(g==h)
sum=sum+confmat(g,h);
end;
end;
end;
percentage=(sum/(2*length(a)))*100;
53
[Document title]
54

Automatic Language Identification Using GMM Documentation

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Automatic Language Identification Using GMM Documentation

Încărcat de

Drepturi de autor:

Formate disponibile

[Document title]

OVERVIEW OF THE PROJECT

This chapter include introduction about the project, problem statement,

Automatic spoken Language Identiﬁcation (LID) is one part of such human to

Moreover, in some applications, an automatic LID system should be very quick,

The second approach is to use a phoneme recogniser of one language or several

1.2 PROBLEM STATEMENT

The main motive of this research is to build a language identification system

The main objective of a language identification system is to identify the

1.5 METHODOLOGY APOPTED

Methodology for Training phase

2. Convert to vocal feature.

3. Apply feature extraction methods.

4. Apply GMM model for training.

5. Train GMM for generated model.

Methodology for Testing phase

1.Input -Audio signal.

2. Convert to vocal feature.

4. Apply GMM model on generated feature.

7. Classify feature according to trained dataset.

8. Identify Language based on classified feature.

1.6 TOOLS REQUIRED

 MATLAB Software (R2013a version).

1.7 ORGANIZATION OF THE REPORT

Chapter 6 presents GMM Modelling techniques which included different phases of

2.2 GENESIS OF THE REPORT

Language ID utilizing phoneme acknowledgment and phonotactic

Language Identification (LID) taking into account language dependent

 Forward and backward bigram based language models

SVM based speaker confirmation utilizing GMM Model, Gaussian mixture

Acoustic, phonetic and discriminative approaches deal with programmed

Joint factor analysis versus Eigen channels in speaker recognition this

In proposed system we use JFA methods for accuracy in speaker verification

For speaker confirmation front-end factor analysis speak to another speaker

I-vectors in the connection of phonetically-constrained short utterance for

Language recognition in i-vectors space The idea of alleged I Vectors, where

3.1 Human Speech

3.1.1 Speech Formation

A very detailed explanation about speech production and a description of

Language is used to communicate, so speech is a series of messages. These

 Consonants are articulated in the presence of constrictions in the throat or

Vowels are articulated without major constrictions and obstructions.

The source and the ﬁlter

3.1.2 language Discrimination and its diﬃculties

Some characteristics are useful to discriminate languages. Following is a list of the

In the previous section we saw which features characterise a language.

 when we are speaking, emotions emphasise a word or a situation. However

street sounds, environmental noises need to be ﬁltered before proceeding to the

3.2 Digital Signal Processing

The sound produced by our vocal tract is a continuous signal. However it is

3.2.1 Window functions

Sounds produced when we speak change very quickly, however we want to

For the speciﬁc Hamming window α is equal to 0.46. A classical (3.1)

The Fourier Transform is a transformation of a continuous function in the time

Figure 3.2: Plot of a 20 ms hamming window in the time domain.

3.2.2 DIGITAL SIGNAL PROCESSING

A ﬁlter is more easily expressed in the frequency domain according to its

Where the z-transform is simply a generalisation of the Fourier transform where

So the z-transform of a digital signal s[n]is deﬁned as:

3.2.3 Discrete Cosine Transform

Figure 3.6: shows an example of DCT coefficients applied to a speech signal.

3.2.4 Digital ﬁlters

A digital ﬁlter is a system which modiﬁes characteristics of a digital signal.