Documente Academic
Documente Profesional
Documente Cultură
CHAPTER 1
1.1 INTRODUCTION
The world becomes more and more cosmopolitan, and with the exponential
growth of information technologies, human-to-machine communication becomes
more and more usual, in particular voice interaction with a machine. For a complete
interaction the computer should recognise human speech, however, some applications
does not need full speech recognition but only identification of the spoken language.
Moreover, this identification may require to be done relatively quickly, so taking the
time to rebuild the entire speech may be a waste of time.
Automatic spoken LID is not new. Research has been done since the 1970s and
LID systems have become more and more complex. The simplest systems use only
acoustic features; then to increase the accuracy, phonotactics features like phoneme
recognition are added. The most complicated systems use word level approaches and
1
[Document title]
analyse the syntax of the language. The top performance of these kind of systems is
high [ Zissman , 1996].
However this top performance is obtained under some strict lab conditions and
existing LID techniques need to be significantly improved for real applications. This
project attempts to investigate several key issues related to applying LID in real
applications.
In general, there are two main approaches to building LID systems: The first
approach generally deals with only acoustic features and unlabelled data. The
language identification features are extracted then the model is trained. The typical
model used is a Gaussian Mixture Models (GMM) [Reynolds, 2008] for each
language to be identified because of GMM’s high performance and its easy
computation.
The report is organised as follows: The first chapter presents the challenge of
LID and the necessary knowledge of signal processing. Then each chapter describes
a step of an LID system: including speech enhancement techniques to reduce
background noise, feature extraction, language modelling methods and different
decision making techniques according to the purpose of the system.
Finally, the last chapter presents the implementation done for the purpose of this
project and showed its results in real application.
In a multi lingual country like India, there are many places like airports,
call centres, service centres etc where multiple language recognition is required to
provide services .
2
[Document title]
1.3 MOTIVATION
1.4 OBJECTIVE
In this, system can work in phases i.e., training and testing phase.
1. Input signal.
3. Apply feature extraction methods. (MFCC, GFCC, PLP, SDC and combination of
this feature).
3
[Document title]
Chapter 1 presents overview of project, motivation and objectives and goals and
applications. Chapter 2 presents literature survey regarding the project. This survey
helped a lot in getting complete description of the project. Chapter 3 presents
Introduction to speech which is basic for our domain. Chapter 4 presents Speech
Enhancement which included signal to noise ratio, effect of noise on speech signal Etc.
Chapter 5 presents Language Identification Systems which includes basic block
diagrams of language identification systems and its detailed explanation of blocks.
4
[Document title]
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
In the previous chapter, introduction of the project, problem statement,
motivation, objective, methodology adopted, tools used and organization of the
project were discussed. In this chapter literature survey of the project is discussed.
5
[Document title]
a viable strategy for variability remuneration. System build a support vector machine
utilizing strategy called as GMM supervector.
Total variability model utilizes i-vector approach based on Joint Factor Analysis
for speaker verification. JFA model based on speaker and channel components
comprises of two particular spaces: the speaker space element characterized by the
eigen voice matrix V and the channel space element represent by the eigen channel
matrix U. Only single space is utilized rather than two, which refers to as the total
variability space. The total variability network contains the eigenvectors that have the
biggest eigenvalues of the total variability covariance matrix. Given an utterances, the
new speaker-and channel-subordinate GMM supervector characterized as takes after:
M = m + Tw
6
[Document title]
The methods based on I vector model for speaker recognition presents a study
for how the current selection of factor analysis techniques that perform when utterance
lengths are significantly reduced. Problems of short utterance with factor analysis
approaches will be investigated in future.
2.3 CONCLUSION
In this chapter, reference links have been referred and the analysis of the
literature survey of the project was discussed.
7
[Document title]
CHAPTER 3
INTRODUCTION TO SPEECH
The first toward completing this project is to understand how human speech is
formed and what are the differences between languages.
Phoneme
8
[Document title]
A speech sound is produced at the vocal cords by the air flow coming from the
lungs, then this sound resonates in the oral or nasal cavities according to the position of
the mouth’s organs. One of the most important theories of speech processing is to model
the airflow at the vocal cords as a source and model the resonating in the mouth as a
filter h[n]:
x[n]=e[n]∗h[n].
The separation between source and filter is a big challenge in speech recognition
and in good features extraction. However it is necessary, because some characteristics
(like phoneme classification) is mostly dependent on the filter.
Language Discrimination
Phonetic: Each language has its own set of phonemes, with a specific frequency
of occurrence for each. Even if two languages share the same phoneme, the rules
governing the allowing of previous or successive phonemes can be different.
They are called the “phonotactics constraints”.
Prosody: This refers to the rhythm of a language. For some languages, prosodic
characteristics are useful in distinguishing words and meaning. These can be the
stress, the duration of phonemes or the pitch. For example some languages, like
Mandarin, are tonal, so pitch is highly relevant in this case.
9
[Document title]
Vocabulary: Obviously, each language has its own words, but all languages do
not share the same root, or at least the construction of the words is different.
Syntax: Each language has its own word order, placing words in a sentence
according to their function.
Difficulties
Each speaker has his own voice therefore he/she speaks the language
differently. Indeed a speaker’s vocal anatomy is unique, so the harmonics
produced during speaking are not the same.
Furthermore, two speakers may not pronounce a phoneme in the same way because
of their accent. That is why a language cannot be reduced to one speaker, and LID
systems must be trained by a broad range of speakers. Moreover, the pitch of a male
voice may be highly different from the pitch of a female or child’s voice. So generally
LID systems distinguish male from female voice. A study of the role of pitch can be
found in the chapter six of the book [Huang et al., 2001].
The worst inconvenience for an LID system are environmental noises. Indeed,
they may interfere with the speaker’s voice and add irrelevant information for the LID
system which may then make mistakes. As, for example, with background music or
10
[Document title]
11
[Document title]
In the case of periodic signal x T[n]with period T, we can apply the Discrete
Fourier Transform (DFT). It transforms the signal into a sum of T harmonic sinusoids
and represents the signal by T coefficients. The DFT of periodic signal x T[n]is defined
as:
(3.3)
Figure 3.2 shows an example of DFT absolute coefficients applied to a speech signal.
The DFT of a periodic signal normally requires T2 operations. However, a faster
algorithm called Fast Fourier Transform (FFT) needs only T ln T operation. These
algorithms and their explanations can be found in the book [Brigham, 1988].
Transfer function
12
[Document title]
(3.4)
(3.5)
Moreover, it transpires that H(z)is the z-transform of the impulse response h [n].
Thus, the output transform signal can simply be computed by multiplying H (z)and the
input transform signal X(z).
The Discrete Cosine Transform (DCT) is similar to the DFT; it transforms the signal
into a sum of T sinusoids. More works about DCT can be found in the book [Rao and
Yip, 1990]. There are several definitions of the DCT, however it is more commonly
defined as:
(3.6)
The advantage of the DCT is its data compression capacity. Indeed, as a speech
signal is formed by a majority of low frequencies, most of the signal information is
13
[Document title]
concentrated in the first DCT coefficients. So the signal can be approximated with fewer
coefficients. On this point DCT performs better than DFT, because the DFT tries to
simulate a periodic signal which ends at the same amplitude that it begins (this is in
general unlikely for an ordinary signal). The advantage of the DCT is its data
compression capacity. Indeed, as a speech signal is formed by a majority of low
frequencies, most of the signal information is concentrated in the first DCT coefficients.
So the signal can be approximated with fewer coefficients. On this point DCT performs
better than DFT, because the DFT tries to simulate a periodic signal which ends at the
same amplitude that it begins (this is in general unlikely for an ordinary signal).
The advantage of the DCT is its data compression capacity. Indeed, as a speech
signal is formed by a majority of low frequencies, most of the signal information is
concentrated in the first DCT coefficients. So the signal can be approximated with fewer
coefficients. On this point DCT performs better than DFT, because the DFT tries to
simulate a periodic signal which ends at the same amplitude that it begins (this is in
general unlikely f3r an ordinary signal).
Figure 3.3 plot of speech waveform (above) and its DFT (below).
14
[Document title]
Impulse response
A time invariant filter is fully represented by its impulse response h[n]. This
means that h[n]is the output signal produced by the filter then the input signal is the
impulse signal δ[n]. Where δ[n]is expressed as:
(3.8)
(3.9)
15
[Document title]
CHAPTER-4
SPEECH ENHANCEMENT
Nowadays, with the hugely increasing use of mobile phones with
embedded microphones, an LID system must be prepared to deal with all kinds of
noises which reduce its performance. Indeed, white noise from a telephone channel,
street noise or music may merge with speech and disturb the information contained in
the speech sample. Noise might have to be removed by pre-processing the signal in
order to gain good performance for a recognition task. This chapter describes some
speech enhancement techniques.
The speech to noise ratio (SNR) is an indicator that measures the quality of a
signal according to the level of background noise. For an LID task, the SNR is more
commonly called the speech to noise ratio. This means that if the SNR is low, then the
strength of the noise may be too high compared to the strength of the speech. In such a
case, the speech may be corrupted and unintelligible. Whereas a high SNR means a
good listenability of the speech. The SNR is computed as follows:
(4.1)
16
[Document title]
(4.2)
(4.3)
Where X(n) is the Discrete Fourier Transform of a discrete signal x(n). Thus the
expression of the power of a signal x[n ]becomes
(4.4)
(4.5)
17
[Document title]
added. White noise is a signal with a flat spectrum, which means that each frequency
has the same magnitude. Figure 3.1 represents the waveform (in the time domain) of
these clean and noisy speech samples. We can see that the mean is unaltered, but the
variance is slightly increased. Figure 3.2 shows the same clean and noisy speech
samples, but this time in the frequency domain by plotting the magnitude of the signals
according to frequency. We can see this time that the noise has increased the mean of
the magnitude and the variance is altered too.
Figure 4.1: Plots of speech signals in time domain. Above, a clean speech signal. Below, same
speech signal but corrupted by uncorrelated white noise.
18
[Document title]
(4.6)
Which can be derived in the frequency domain by computing the Discrete Fourier Transform:
(4.7)
Figure 4.2: Plots of magnitude spectra of speech (in frequency domain). Above, a clean speech
signal. Below, same speech signal but corrupted by uncorrelated white noise.
19
[Document title]
Where the exponent b can be equal to one or two, according to the applied
method: b=1 for magnitude spectral subtraction and b=2 for power spectral subtraction.
An estimation of the noise spectrum may be performed during frames without speech
(which is determined by a voice activity detector). So the mean is computed on a subset
of these voiceless frames. Moreover, this noise estimation could be smoothed in
frequency according to the kind of noise [ Berout et al., 1979]. For example, estimation
of a typical white noise may be set to be flat in frequency.
Because the first assumption is not completely true, it could result that
magnitude of the clean speech spectrum estimation is negative. So, it is defined to zero
when this is the case. The clean speech spectrum estimation becomes:
(4.8)
(4.9)
20
[Document title]
This new kind of noise is called musical noise because the band of frequencies
responsible of it is very narrow and spaced from each other so, it may be perceived as
a tone. Figure 3.3 (adapted from [Westall et al., 1998] ) illustrates this problem.
Figure 4.3: Illustration of musical noise created by spectral subtraction (adapted from [Westall
et al., 1998]).
(4.10)
Besides that, the method remains the same. This method is described by a block
diagram on Figure 3.4 adapted from [Berouti et al., 1979]. Finally figure 3.5 shows the
result of the spectral subtraction applied to the noise-corrupted speech plotted in the
bottom on figure 3.1. We can see that the signal is clearer, but not as good as the original
noiseless signal in the top on figure 3.1.
21
[Document title]
Figure 4.4: Block diagram of the spectral noise subtraction method (adapted from [Berouti]
22
[Document title]
Figure 4.5: Plots of speech signals in time domain. Above: a speech signal
corrupted by uncorrelated white noise. Below: result of spectral subtraction
applied to the noisy speech above.
(4.11)
(4.12)
23
[Document title]
and the valleys are frames without voice (so only contain noise). To obtain a noise
power estimation, the idea now is to estimate the noise power by tracking minima on a
sliding window. Ideally, the size of this sliding window should be as wide as the number
of frames between two valleys. Figure 4.7
Figure 4.6: Plots of a periodogram (above) and its smooth version (below).
illustrates this noise power estimation method. The sliding window has been taken
small for the purpose of example. However this method suffers from different issues,
as [Martin, 2001] describes: First, wide peaks in the smooth signal power could be
larger than the sliding window. So the estimation may incorporate speech parts. Then,
the estimation is biased toward lower values. Indeed, only one frame with a very low
power is enough for the estimation to take this value during all the length of the sliding
window. Finally, if the noise increases suddenly then the noise power estimation will
be late by the length of the sliding window.[Martin, 2001] gives also a more realistic
24
[Document title]
Figure 4.7: Plots of a smooth periodogram in blue and its noise power estimation by minima
tracking in red.
(4.13)
Then [Sohn et al., 1999] is assuming the fact that a real and imaginary part of
an FFT coefficient could be statistically represented as Gaussian random variables. By
making this statistical assumption, the probability densities for each frequency bin k
become:
25
[Document title]
(4.14)
(4.15)
λ N may be estimated according to noise power estimation methods like the one in
section 4.3. So, we only need to estimate the parameters ξk. This may be done by
maximising the likelihood. Thus, we compute the derivative of log zero, and we obtain:
(4.16)
this bias, moreover it presents a more realistic VAD which considers previously made
decisions for each new decision.
Figure 4.8 shows the result of this VAD applied to a speech signal. The top
spectrogram is from the original speech signal and the bottom one is after removing the
frames without speech detected by the VAD. Above each spectrogram, the waveform
of the signal is plotted. We can see that the blanks in the speech are removed without
altering the frequencies and the energy of each frame. More blanks may be removed ,
however this may alter the speech more. For LID, a VAD can be useful in reducing the
size of the data by removing irrelevant information; so more data could be used
26
[Document title]
27
[Document title]
28
[Document title]
CHAPTER 5
Now that the understanding of the formation of human speech and the methods
of speech enhancement have been reviewed, the functioning of LID systems can be
examined.
5.1 Overview
The next sections present the acoustic features, the phonotactics-based system
and finally how complementary systems can be fused to improve their performance.
However, only models from acoustic features were implemented in this project because
of resource and time limitations. Indeed, labelled data for phonotactics-based system
were not available and model’s training takes several audio files.
29
[Document title]
Figure 5.1: Five levels of LID features (adapted from [Tong et al., 2006]).
days on computers that were at my disposal whereas multi models are needed to run relevant
experiences on fused system.
However, a better set of features derived from the MFCCs has been discovered;
they are called the shifted delta cepstral coefficients (SDC). To prove their efficiency,
these new coefficients have been used with a GMM of 2048 Gaussian components
[Torres-Carrasquillo et al., 2002]. They have also been used with a SVM classifier
[Campbell et al., 2006] with a Generalised Linear Discriminant Sequence (GLDS)
kernel [?]. These two classification methods are described in chapter 5. According to
these two studies produced by the same team, the GMM got a slightly better score than
SVM at the NIST Language recognition evaluation where the procedure of evaluation
can be found at [A. Martin, 2005]. As regards to the good results of these algorithms,
the prototype implemented for this project is based on these two previous studies.
30
[Document title]
The definition and the computation algorithm of the SDCs are presented below,
beginning with the definition and computation algorithm of the MFCCs.
The computation of the MFCC is illustrated by the block diagram in figure 5.3.
Once the input signal has been windowed, each frame is projected in the
frequency domain thanks to the DFT as described in section 2.2.2. Then a Mel filter-
bank is applied to the projected frame. This filter-bank rescaled the frame in the Mel
scale which is presented in figure 5.4. The Mel Scale represents what frequency a
human ear really hears for a corresponding sound frequency.
The Figure 5.5 shows an example of a Mel filter-bank, each filter (triangle) will
lead to a coefficient.
Then, we use the logarithm to get coefficients in decibels. Finally the DCT is
applied to concentrate the information in the first coefficients as described in section
2.2.3.
31
[Document title]
Figure 5.5: Plot of a Mel filter-bank of 24 filters (crosses indicate centre frequency of each
filter).
The SDCs are computed from the MFCCs ci. They have four parameters N-dP-
k, “where N is the number of cepstral coefficients computed at each frame, d represents
the time advance and delay for the delta computation, k is the number of blocks whose
delta coefficients are concatenated to form the final feature vector, and P is the time
shift between consecutive blocks” [Torres-Carrasquillo et al., 2002].
Where ∆c(t ,i)=c(t +I p +d)−c(t +I P−d). Figure 5.6 shows an illustration of this computation.
Figure 5.6: Computation of the SDC feature vector at frame t for parameters N-d-P-k (adapted
from [Torres-Carrasquillo et al., 2002]).
32
[Document title]
However phoneme recognisers in different languages can be used in parallel, and each
phoneme recogniser is followed by all the language models. A parallel Phoneme
Recognition followed by Language Modelling (PPRLM) system provides state-of-the-
art language recognition performance [Zissman, 1996]. Figure 4.8 shows an example
of PPRLM system. In this case the scores from the different models should be fused
before making a decision, because it produces several scores for the same language.
The fusion techniques are presented in the next section.
33
[Document title]
Given a speech utterance, a score vector is computed for each recogniser, this
vector contains the scores of all the target languages. To get the final score for each
language, one needs to fuse the scores of each recogniser. For a system with K
recognisers and L languages, the score for an input utterance X to a language l is
computed from the likelihoods.
34
[Document title]
35
[Document title]
CHAPTER-6
6.1 Overview
Two very famous classifiers are widely used in LID because of their high
performance and easy computation: Support Vector Machines and the Gaussian
Mixture Model. Support Vector Machines (SVM) is a data classification technique. An
SVM produces a model from training data which will predict the class of testing
examples. The principle is “find a linear separating hyperplane with the maximal
margin in this space” [Hsu et al., 2003]. Then this linear hyperplane will determine the
class of a testing sample according to: if it is on one side of the boundary or the other
(for a binary classification). However, input data may not be linearly separable, SVM
allows us to map the training data into a higher dimensional space, then the optimal
separating hyperplane can be computed in this space. The calculus of the SVM
equations are explained in detail in this book [Vapnik, 1998]. The Gaussian Mixture
Model (GMM) is an unsupervised classification technique which, applied to LID, tries
to model the phonetic sound of the language by approximating their probability
distribution. It is the modelling method retained for this project because it offers a large
number of degrees of freedom, is easy to train and, as we saw in the previous chapter,
gave better results than SVM. GMM is deeper described in the next section.
(6.1)
36
[Document title]
distribution:
(6.2)
Figure 6.1 illustrates three plots of the estimation of a dataset distribution in one
dimension by a GMM with two, three and five components. The dataset has been
created by the fusion of three normally distributed datasets. The plots show that two
components is not enough to estimate this distribution accurately. However, three
components seems fine, which is logical according to the dataset composition. We can
conjecture that the more components compose a GMM, the more data distribution is
precisely estimated.
The training of a GMM consists in finding the different parameters which best
estimate the probability distributions of the training data X=(x1,...,xT). These
parameters are: the weight πi, the mean µi and the covariance matrix Σi of each
Gaussian distribution (or component)Ni. Let’s call Θ the representation of a N
components GMM parameters. This means:
(6.3)
There are several ways to train a GMM, some methods can be found in
[McLachlan and Peel, 2000].
37
[Document title]
Figure 6.1: Plots of the estimation (in red) by a GMM of a data distribution in one
dimension (in blue) with 2, 3 and 5 components.
The components of each GMM are also plotted in green method is to maximise
the likelihood by estimating the Gaussians’ parameters according to the Expectation
(6.4)
(6.5)
However, this maximisation is non linear and very complex. That is why the
EM algorithm is used. The EM algorithm is an iterative process which builds an
estimator of the maximum according to the estimation of the previous step. The idea is
to estimate in which component belongs each sample xi to make the maximisation
problem simple. So new variables z ij are introduced, z ij is equal to 1 if the sample xi
belongs to the component j else 0. So the complete log likelihood becomes:
(6.6)
38
[Document title]
Expectation step
(6.7)
Maximisation step
(6.8)
(6.9)
(6.10)
These two steps are repeated until convergence. Convergence is obtained when:
(6.11)
39
[Document title]
CHAPTER 7
7.1 CONCLUSION
This dissertation was about automatic spoken language identification. It showed
the technical background and steps necessary for understanding and building an LID
system: including relevant features, classification methods and speech enhancement
techniques. For the purpose of this project, an LID system was created by implementing
some of the techniques described in this report. Because of time limitations, models
fusion has not been experimented, as indeed model training can take several days.
Moreover, because of resource limitations, the path of phoneme recognition (which can
lead to speech recognition) has not been explored. Indeed, labelled data were not
available for this purpose. So, that is why this work has concentrated on simple models
with acoustic features. The implementation was tested on twelve languages. It gave
satisfying results on audiobooks (above 80% on ten seconds and longer utterances) and
can be used to detect the language of a speaker through a microphone thanks to a GUI.
Indian Institute of Technology Kharagpur - Multi Lingual Indian Language Speech
Corpus (IITKGP-MLILSC) is used during the course of present study. In. A minimum
of ten speakers including both male and female are present in each language. From each
speaker, 5–10 minutes of data is recorded at 16kHz sampling rate and 16 bits per
sample, such that a minimum of one hour data is available for each language. In this
study, spectral features extracted from the speech data are used for the task of language
identification. Conventional Gaussian mixture modeling technique is used to develop
language models for language identification. The accuracy of the LID system not only
depends on the feature vector but also on the parameters of the GMM such as
dimensions of feature vectors, number of feature vectors and number of mixture
components. In this work, performance of language identification system is analyzed in
speaker independent case only i.e., data from different speakers is used for training and
testing the language models. From the entire available data set, speech data from two
speakers (1 male and 1 female) is omitted during the process of developing the two
speakers(who are not involved in the training process)is used to test the LID system.
For analysing influence of length testing speech sample on the performance of LID
40
[Document title]
performance of LID is computed for testing speech sample with various lengths such
as 3 sec, 5 sec and 10sec. Multiple LID systems are developed by varying the number
of mixture components from 8 to 64 to analyze the influence of number of mixture
components on the performance of LID. The performance of LID is computed for 50
different test cases from the testing data set and average of all the test cases.
41
[Document title]
REFERENCES:
1. Ambikairajah, E., Li, H., Wang, L., Yin, B., and Sethu, V.: Language
identification: a tutorial. Circuits and Systems Magazine, IEEE, 11(2),
2011, pp. 82–108.
2. Chelba, C., Silva, J., and Acero, A.: Soft indexing of speech content for search
in spoken documents. Computer Speech & Language, 21(3), 2007,
pp. 458–478.
3. Cimarusti, D., and Ives, R.B.: Development of an Automatic Identification
System of Spoken Languages: Phase 1. Proc. ICASSP82, Vol. 7, May 1982, pp.
1661–1664.
4. Navrtil, J.: Spoken Language Recognition A step Toward Multi-linguality in
Speech Processing, IEEE Trans. Speech Audio Processing, Vol. 9, September
2001, pp. 678–685.
5. Foil, J.T.: Language Identification Using Noisy Speech, Proc. ICASSP86,
April 1986, pp. 861– 864.
6. Thyme-Gobbel, A.E., Hutchins, S.E.: On using prosodic cues in automatic
language identification, International Conference on Spoken Language
Processing, Vol. 3, 1996, pp. 1768–1772.
7. Bhaskararao, P.: Salient phonetic features of Indian languages in speech
technology. Sadhana, 36(5), 2011, pp. 587–599.
8. Schultz, T., Rogina, I., and Waibel, A.: LVCSR- Based Language
Identification. Proc. ICASSP96, May 1996, pp. 781–784.
9. Kadambe, S., Hieronymus, J.: Language identification with phonological and
lexical models. Proc. ICASSP95, Vol. 5, 1995, pp. 3507-3511.
10. Thomas, H.L., Parris, E.S., and Write, J.H.: Re- current substrings and data
fusion for language recognition. International Conference on Spoken Language
Processing, Vol. 2, 1998, pp. 169- 173.
11. Quatieri, T. F.: Discrete-Time Speech Signal Processing: Principles and
Practice. Engle- wood Cliffs, NJ, USA: Prentice-Hall, 2002.
12. Zissman, M. A.: Comparison of four approaches to automatic language
identification of telephone speech. IEEE Transactions on Speech and Audio
Processing, 4(1), 1996, 31.
42
[Document title]
43
[Document title]
APPENDIX A
SOFTWARE REQUIREMENT
OVERVIEW OF MATLAB
MATLAB is a high performance language for technical computing. It integrates
computation visualization and programming in an easy to use environment. MATLAB
stands for matrix laboratory. It was written originally to provide easy access to matrix
software developed by LINPACK (linear system package) and EISPACK (Eigen
system package) projects. MATLAB is therefore built on a foundation of sophisticated
matrix software in which the basic element is matrix that does not require pre-
dimensioning.
MATLAB is a programming package specifically designed for quick and easy
scientific calculations and I/O. It has literally hundreds of built in functions for a wide
variety of computations and many toolboxes designed for specific research disciplines,
including statistics, optimization, solution of partial differential equations, data
analysis.
MATLAB help function and browser functions are to find any additional
features that may need or want to use. MATLAB is a high-performance language for
technical computing. It integrates computation, visualization, and programming
environment. MATLAB is a modern programming language environment: it has
sophisticated data structures, contains built in editing and debugging tools, and supports
object-oriented programming. These factors make MATLAB an excellent tool for
teaching and research.
MATLAB has many advantages compared to conventional computer languages
for solving technical problems. MATLAB is an interactive system whose basic data
element is an array that does not require dimensioning. It also has easy to use graphics
commands that make the visualization of results immediately available. Specific
applications are collected in packages referred to as toolbox. Thereare toolboxes for
signal processing, symbolic computation, control theory, simulation, optimization, and
several other fields of applied science and engineering.
Typical uses of MATLAB
The typical usage areas of MATLAB are
44
[Document title]
Features of MATLAB:
45
[Document title]
A. Program the GUI: GUIDE automatically generates an M-file that controls how
the GUI operates. The M-file initializes the GUI and contains a framework for all the
GUI call backs -- the commands that are executed when a user clicks a GUI component.
Using the M-file editor, we can add code to the call backs to perform the functions.
B. GUIDE stores a GUI in two files, which are generated the first time when we
save or run the GUI:
C. A FIG-file, with extension. fig, which contains a complete description of the
GUI layout and the components of the GUI: push buttons, menus, axes, and so on.
D. An M-file, with extension .m, which contains the code that controls the GUI,
including the call backs for its components.
E. These two files correspond to the tasks of lying out and programming the GUI.
When we lay out of the GUI in the Layout Editor, our work is stored in the FIG-file.
When we program the GUI, our work is stored in the M-file.
The MATLAB Application Program Interface (API)
This is a library that allows you to write C and FORTRAN programs that interact
with MATLAB. It includes facilities for calling routines from MATLAB (dynamic
linking), calling MATLAB as a computational engine, and for reading and writing
MAT-files.
MATLAB Working Environment
MATLAB Desktop
MATLAB Desktop is the main Mat lab application window. The desktop contains
five sub windows, the command window, the workspace browser, the current directory
window, the command history window, and one or more figure windows, which are
shown only when the user displays a graphic.
46
[Document title]
47
[Document title]
files supplied with MATLAB and math works toolboxes are included in the search path.
This is the easiest way to see which directories are on the search path. The easiest way
to see which directories are soon the search paths, or to add or modify a search path, is
to select set path from the File menu the desktop, and then use the set path dialog box.
It is good practice to add any commonly used directories to the search path to avoid
repeatedly having the change the current directory.
The Command History Window contains a record of the commands a user has
entered in the command window, including both current and previous MATLAB
sessions. Previously entered MATLAB commands can be selected and re-executed
from the command history window by right clicking on a command or sequence of
commands. This action launches a menu from which to select various options in
addition to executing the commands. This is useful to select various options in addition
to executing the commands. This is a useful feature when experimenting with various
commands in a work session.
Using the MATLAB Editor to create M-Files
The MATLAB editor is both a text editor specialized for creating M-files and a
graphical MATLAB debugger. The editor can appear in a window by itself, or it can be
a sub window in the desktop. M-files are denoted by the extension .m, as in pixel up
.The MATLAB editor window has numerous pull-down menus for tasks such as saving,
viewing, and debugging files. Because it performs some simple checks and also uses
colour to differentiate between various elements of code, this text editor is
recommended as the tool of choice for writing and editing M-functions. To open the
editor, type edit at the prompt opens the M-file filename. m in an editor window, ready
for editing. As noted earlier, the file must be in the current directory, or in a directory
in the search path.
Getting Help
The principal way to get help online is to use the MATLAB help browser, opened as
a separate window either by clicking on the question mark symbol (?) on the desktop
toolbar, or by typing help browser at the prompt in the command window. The help
Browser is a web browser integrated into the MATLAB desktop that displays a
Hypertext Mark up Language (HTML) documents. The Help Browser consists of two
panes, the help navigator pane, used to find information, and the display pane, used to
48
[Document title]
view the information. Self-explanatory tabs other than navigator pane are used to
perform a search.
APPENDIX B
SOURCE CODE
clear all;
clc;
x1=0;
a=dir('trainwavdata/*');
for i=3:length(a)
allwav=dir(fullfile('trainwavdata',a(i).name,'*.wav'));
for j=1:length(allwav)
fname=fullfile('trainwavdata',a(i).name,allwav(j).name);
[y fs]=wavread(fname);
clear fname;
sig=y.*y;
E=mean(sig);
Threshold=0.05*E;
k=1;
for b=1:100:(length(sig)-100)
if((sum(sig(b:b+100)))/100 > Threshold)
dest(k:k+100)=y(b:b+100);
k=k+100;
end;
end;
clear FS Threshold E sig y ;
49
[Document title]
dest=dest';
if j==1
x1=dest;
else
x1=vertcat(x1,dest);
end;
clear dest;
end;
y1=mfcc_rasta_delta_pkm_v1(x1,8000,13,26,20,10,1,1,2);
save(fullfile('mfcc_train',a(i).name),'y1');
clear y1,x1;
end;
%to extract the mfcc features of the testingdata after removing silence
clear all;
clc;
a=dir('testwavdata/*');
for i=3:length(a)
allwav=dir(fullfile('testwavdata',a(i).name,'*.wav'));
for j=1:length(allwav)
fname=fullfile('testwavdata',a(i).name,allwav(j).name);
[y FS]=wavread(fname);
sig=y.*y;
E=mean(sig);
Threshold=0.05*E;
50
[Document title]
k=1;
for b=1:100:(length(sig)-100)
if((sum(sig(b:b+100)))/100 > Threshold)
dest(k:k+100)=y(b:b+100);
k=k+100;
end;
end;
dest=dest';
clear FS FFX Threshold E sig y ;
1.
clear all;
clc;
a=dir('mfcc_train');
for i=3:length(a)
dim=39;
centres=16;
MIX=gmm(dim,centres,'diag');
load(fullfile('mfcc_train',a(i).name));
foptions(14)=1;
MIX=gmminit(MIX,y1,foptions);
51
[Document title]
%MIX.priors
OPTIONS(1)=1;
OPTIONS(14)=75;
[MIX,OPTIONS,ERRLOG]=gmmem(MIX,y1,OPTIONS);
save(fullfile('allcleanmodels_16',a(i).name),'MIX');
clear y1 MIX;
end;
2.
clear all;
clc;
a=dir('mfcc_train');
for i=3:length(a)
dim=39;
centres=32;
MIX=gmm(dim,centres,'diag');
load(fullfile('mfcc_train',a(i).name));
foptions(14)=1;
MIX=gmminit(MIX,y1,foptions);
%MIX.priors
OPTIONS(1)=1;
OPTIONS(14)=75;
[MIX,OPTIONS,ERRLOG]=gmmem(MIX,y1,OPTIONS);
save(fullfile('allcleanmodels_32',a(i).name),'MIX');
clear y1 MIX;
end;
52
[Document title]
a=dir('allcleanmodels_16/*.mat');
b=dir('mfcc_test/*');
confmat=zeros(length(a));
for i=3:length(b)
c=dir(fullfile('mfcc_test',b(i).name,'*.mat'));
for j=1:length(c)
load(fullfile('mfcc_test',b(i).name,c(j).name));
d=zeros(length(a),1);
for k=1:length(a)
load(fullfile('allcleanmodels_16',a(k).name));
d(k)=mean(log(gmmprob(MIX,y1)));
clear MIX;
end;
d=d';
ak=find(d==max(d));
confmat(i-2,ak)=confmat(i-2,ak)+1;
clear d y1;
end;
end;
sum=0;
for g=1:length(a)
for h=1:length(b)
if(g==h)
sum=sum+confmat(g,h);
end;
end;
end;
percentage=(sum/(2*length(a)))*100;
53
[Document title]
54