Sunteți pe pagina 1din 15

A crash course on Automatic Speech

Recognition and using CMU sphinx to build


small ASRs
By
Shyam.k
Swathanthra Malayalam Computing

Introduction
Why Speech?
Most effective and natural form of human communication
Systems can be more user-friendly and more people will be
able to access the technology with ease

Applications are enormous and I leave that to your


imagination(language translators and tutors, IVRS,
indexing audio recording...etc)
ASR is still an unsolved problem
Researches have to be made to utilize its full potential

But mature enough to use under controlled conditions

2006-06-01

Introduction-2
Why is Speech Recognition hard?
Tremendous range of variability in speech,large vocabulary...

Disciplines in Speech Technology


Physiology of speech production and hearing
signal processing
Linear Algebra
Probability Theory, statistical estimation and modeling
Information Theory
Linguistics

2006-06-01

Speech production and Acoustic Phonetic approach


Analogy of speech
production with electric
network and modelling
the vocal tract
But speech is not that
deterministic
Statistical approaches
showed much better
results than acoustic
phonetic approach and so
these methods lost their
importance
2006-06-01

statistical pattern recognition approach


Speech Recognition is a type of pattern recognition
problem
Raw sample streams of audio are not well suited for
matching
Features or mathematical expressions are formulated
which when applied in an audio stream represents
changes in speech
So extraction of such features is the very first step
spectral analysis..but not enough
cepstral analysis...aha!but not yet there
Mel Frequency Cepstral Coefficients(MFCC)..WOW!

2006-06-01

Template matching
A simple idea in the way of statistical pattern
recognition is to pre-record a word to be
recognized,compute the feature vectors, and compare
the vectors to find the more closely matched input.
The same theory is expanded for the whole ASR
systems
The concept of matching the templates is enhanced to the
famous Dynamic Programming(DP) algorithm
Simple templates are replaced by models which are initially
trained
Different types of models are there as acoustic,pronounciation
and language models.

2006-06-01

Hidden Markov Models


Speech can be considered invariable over a very small
interval of time
HMM models speech by considering it as a set of such
small intervals, with each such segments invariant
within themselves
Each such segments can be modeled by an HMM state
Each state has a model which is built over a probability
distribution that describes the feature vector
variation(aka speech variation) over that segment
Thus we have the probability for a given speech
segment to be a particular HMM state based on the
probability distribution
2006-06-01

HMM example
Figure shows A Simple HMM with three states

2006-06-01

DynamicProgramming
The match between two strings can be calculated using a
2d trellis as shown below,by giving different scores to
different possible operations like insertion,substitution
and deletion,as we move along the trellis

2006-06-01

application of DP string matching


applying DP,we are
searching for a path
which aligns the given
word with the reference
the best score at the end
of the trellis represents
the best alignment path
between the two words
comparing different best
scores enables us to
recognize the speech!!

2006-06-01

10

QUICK FINISH
Using the same DP we can train or build the probability
models for the HMM
With many such HMMs for several words we can form
a sentence HMM which will contribute to the
continuous recognition of words
The grammar of a language can be applied to these
sentence HMMs as transition probabilities between
words(known as language weights)
Means finally we can define a language with these
models!!

2006-06-01

11

CMU SPHINX
Sphinx is a world class ASR system developed and
maintained by Carnegie Mellon University (CMU)
Different versions namely sphinx2, sphinx3,sphinx4,
pocketsphinx and sphinxtrain.
Sphinx4 is a new ASR system written entirely in JAVA
sphinx2 is a fast speech recognition system,semi
continuous HMMs
pocket sphinx is the fastest recognition system,though
its not as accurate as sphinx2 or sphinx3
sphinx3 uses continuous HMMs

2006-06-01

12

CMUSPHINX-Training
Sphinx train is the training package
It requires the following files
phone list-which specifies the phones used in our particular
application with each phone in a seperate line
dictionary-which specifies how each and every word in our
vocabulary is made with the phones specified in the above list
filler dictionary-which specifies the special words such as
silent breath cough etc..
transcripts-which specifies the content of each audio file in
the database with the words in the dictionary
obviously the speech files with the same file names as per the
transcripts

2006-06-01

13

CMUSPHINX- Training: minimal usage


Perl scripts
perl scripts are provided for easy usage of sphinxtrain so that
once we have the files ready at the appropriate locations we
just need to run the perl scripts to get the acuoustical models
ready.So lets go the easy way!
The files such as phonelist,dictionary,transcripts etc are kept in
project/etc directory
speech files are kept in seperate directory for each users in the
project/wav directory
once we have these ready,we could setup the project by calling
sphinx_tutorial.pl and there after make_feats.pl (if necessary)
Then we train the model by either just calling RunAll.pl script
in scripts_pl dir or by calling each component program of
training manually.
2006-06-01

14

CMUSPHINX-Decoding
There is different types of decoders namely
sphinx2,sphinx3,sphinx4 and pocket sphinx
we could select one from these decoders according to
the type of application we are having
Once we have the trained models, we just want to call
the decoders by providing those model files and other
necessary data for decoding
Sphinx have a nice API set which enables us developers
to integrate sphinx to our own applications.

2006-06-01

15

S-ar putea să vă placă și