Sunteți pe pagina 1din 58

Assamese Numeral Corpus for Speech Recognition using ANN

A Report Submitted in Partial Fulllment of the Requirements for the Degree of

Master of Science

Submitted By Krishna Dutta and Mousmita Sarma

to the

DEPARTMENT OF ELECTRONICS AND COMMUNICATION TECHNOLOGY


Gauhati University, GUWAHATI-781014 June,2010

CERTIFICATE

This is to certify that the project work entitled Assamese Numeral Corpus for Speech Recognition using ANN, being submitted by Krishna Dutta (Roll. No. 172)(G.U Registration No. 009952 of 2005-06) and Mousmita Sarma (Roll. No. 173)(G.U Registration No. 009969 of 2005-06), in partial fulllment of the requirement for the award of the degree of M.Sc. in Electronics and Communication Technology , Department of Electronic and Communication Technology, Gauhati University, Guwahati-781014, Assam is a record of the students own work carried out under my supervision and guidance

Mr. Kandarpa Kumar Sarma Lecturer, Department of Electronic and Communication Technology
Gauhati University, Guwahati-781014 Jun,2010

CERTIFICATE

This is to certify that the project work entitled Assamese Character Segmentation using Neural Networks, being submitted by Kaustubh Bhattacharyya (Roll. No. 01)(G.U Regestration No. 036192 of 2001-02), in partial fulllment of the requirement for the award of the degree of M.Phil. in Electronics Science and Technology, Department of Electronics Science, Gauhati University, Guwahati-781014, Assam is a record of the students own work carried out under the supervision and guidance of Mr. Kandarpa Kumar Sarma, Lecturer, Department of Electronics Science, Gauhati University, Guwahati-781014, Assam.

Dr. Pranayee Dutta Professor, and HOD,


Department of Electronics Science, Gauhati University, External Examiner Guwahati-781014 January, 2009

Internal Examiner

Acknowledgements
At the very beginning we would like convey our sincere gratitude to Mr. Kandrapa Kumar Sarma, Lecture ,Department Of Electronics and Communication Technology, Gauhati University, who have provided his valuable suggestions and opinions while carrying out this work and preparing the report. We are also thankful to Prof. Pranayee Dutta, Head of the Department, Department of Electronics and Communication Technology, who allowed us to carry this project work and provided us all the laboratory facilities. Last but not the least we would like to express our heartiest thanks to our classmates for their constant inspiration and support during the project work. Department of Electronic and Communication Technology, Gauhati University January, 2010 Mousmita Sarma Krishna Dutta

Abstract Speech corpus is one of the major components in a Speech Processing System where one of the primary requirements is to recognize an input sample. The quality and details captured in speech corpus directly aects the precision of recognition. The project work proposes a platform for speech corpus generation using an adaptive LMS lter and LPC cepstrum, as a part of an ANN based Speech Recognition System which is exclusively designed to recognize isolated numerals of Assamese language- a major language in the North Eastern part of India. The work focuses on designing an adaptive lter congured as a pre-emphasis block for optimal feature extraction so that the performance of an ANN-based Speech Recognition System can be improved.The ANN type used here is a recurrent neural network (RNN) which due to its ability to deal with time-varying signals is inherently capable of handling tasks like speech recognition and synthesis.

Contents
Contents 1 Introduction 1.1 What is Speech? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 What is Speech processing? . . . . . . . . . . . . . . . . . . . . . . . 1.3 Speech Recognition-An Introduction . . . . . . . . . . . . . . . . . . 1.3.1 Certain terms: . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Speaker Dependent vs. Speaker Independent Speech Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Various Stages of Speech Recognition: . . . . . . . . . . . . . . . . . . 1.5 Pre-Processing: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Digital Filter-an overview . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Role of Digital lter in speech recognition: . . . . . . . . . . . 1.6.2 What is a digital lter? . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Finite Impulse Response Filter: . . . . . . . . . . . . . . . . . 1.6.4 Transfer Function . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 Properties: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.6 Design Techniques of FIR Filter: . . . . . . . . . . . . . . . . 1.6.7 FIR FILTER STRUCTURE: . . . . . . . . . . . . . . . . . . 1.6.8 BASIC STRUCTURE OF IIR FILTER: . . . . . . . . . . . . 1.6.9 Direct Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.10 Cascade FORM . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.11 Parallel FORM . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.12 Lattice Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Design Techniques of IIR Filter . . . . . . . . . . . . . . . . . . . . . 1.7.1 Approximation of Derivative Method . . . . . . . . . . . . . . 1.7.2 Impulse Invariant Method . . . . . . . . . . . . . . . . . . . . i i 1 1 3 4 5 7 8 8 10 12 12 13 13 14 15 15 17 18 20 21 22 22 22 23 24

1.7.3 1.8 Vector 1.8.1 1.8.2 1.8.3 1.8.4 1.8.5 1.8.6 1.8.7 1.8.8

Bilinear Transformation Method . . . . . . . . . . . . . . . . . quantization Technique of Speech Recognition: . . . . . . . . . Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . The LVQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . How does VQ work . . . . . . . . . . . . . . . . . . . . . . . How does the search engine work . . . . . . . . . . . . . . . . Vector quantization in Speech Recognition system: . . . . . . Feature Vectors and Vector Space: . . . . . . . . . . . . . . . . Linear predictive coding . . . . . . . . . . . . . . . . . . . . . Introduction to Graphical User Interface (GUI) MATLAB 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .

25 25 25 26 28 29 30 31 33 33 34 34 34 39 40 40 41 42 42 42 43 44 44 45 48

2 Literature Survey 2.1 Digital Filter- previous work done . . . . . . . . . . . . . . . . . . . . 2.2 Speech Recognition using Neural network- previous work done . . . . 2.3 Assamese speech processing- a few case studies . . . . . . . . . . . . . 3 Problem Denition 3.1 Filter Structures that has to be designed . 3.2 VQ codebook design . . . . . . . . . . . . . . . 3.3 Diagrammatic representation of present work . . 3.4 Description of dierent Block: . . . . . . . . . . 3.4.1 Signal Capture: . . . . . . . . . . . . . . 3.4.2 Signal Preprocessing: . . . . . . . . . . . 3.4.3 Digital Filter Bank: . . . . . . . . . . . . 3.4.4 Vector Quantization Block: . . . . . . . 3.5 Graphic Window: . . . . . . . . . . . . . . . . . 4 Future Direction and Conclusion

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

List of Figures
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 Speech Recognition System . . . . . . . . . . . . . . . . . . . . . . . Speech Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . Direct Form FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . Cascade Form FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . Lattice Form FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . Direct Form 1 IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . Direct Form 2 IIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . Cascade Realization of IIR lter . . . . . . . . . . . . . . . . . . . . . Parallel Realization of IIR lter . . . . . . . . . . . . . . . . . . . . . 8 10 17 18 19 20 21 22 23 24 26 27 29 31 41 43 46 47

1.10 Lattice Realization of IIR lter . . . . . . . . . . . . . . . . . . . . . 1.11 One dimensional Vector quantization . . . . . . . . . . . . . . . . . . 1.12 Two-dimensional vector quantization . . . . . . . . . . . . . . . . . . 1.13 The Encoder and Decoder in a Vector Quantizer . . . . . . . . . . . . 1.14 Vector Quantization Based Speech Recognition System . . . . . . . . 3.1 3.2 3.3 3.4 Digital Filter Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . Sub-blocks of the Vector quantization block . . . . . . . . . . . . . . View of the graphic window . . . . . . . . . . . . . . . . . . . . . . .

iii

List of Tables

iv

Chapter 1 Introduction
1.1 What is Speech?

Speech is the vocalized form of human communication. It is based upon the syntactic combination of lexical and names that are drawn from very large (usually > 10, 000 dierent words) vocabularies. Each spoken word is created out of the phonetic combination of a limited set of vowel and consonant speech sound units. These vocabularies, the syntax which structures them, and their set of speech sound units, dier creating the existence of many thousands of dierent types of mutually unintelligible human languages. While one produces speech sounds, the air ow from lungs rst passes the glottis and then throat and mouth. Depending on which speech sound one articulate, the speech signal can be excited in three possible ways: Voiced excitation: The glottis is closed. The air pressure forces the glottis to open and close periodically thus generating a periodic pulse train (triangleshaped). This fundamental frequency usually lies in the range from 80Hz to 350Hz. 1

Unvoiced excitation: The glottis is open and the air passes a narrow passage in the throat or mouth. This results in a turbulence which generates a noise signal. The spectral shape of the noise is determined by the location of the narrowness. Transient excitation: A closure in the throat or mouth will raise the air pressure. By suddenly opening the closure the air pressure drops down immediately. (plosive burst) With some speech sounds these three kinds of excitation occur in combination. The spectral shape of the speech signal is determined by the shape of the vocal tract (the pipe formed by your throat, tongue, teeth and lips). Characteristics of the Speech Signals are: Bandwidth: The bandwidth of the signal is 4 kHz. The bandwidth of the speech signal is much higher than the 4 kHz. However within a bandwidth of 4 kHz the speech signal contains all the information necessary to understand a human voice. Fundamental Frequency: The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz. Using voiced excitation for the speech sound will result in a pulse train, the so-called fundamental frequency. Voiced excitation is used when articulating vowels and some of the consonants. Peaks in the Spectrum: There are peaks in the spectral distribution of energy at (2n 1) 500Hz; n = 1, 2, 3... After passing the glottis, the vocal tract gives a characteristic spectral shape to

the speech signal. If one simplies the vocal tract to a straight pipe (the length is about 17cm), one can see that the pipe shows resonance at the frequencies as given by above equation. The Envelope of the Power Spectrum: The envelope of the power spectrum of the signal shows a decrease 6dB per octave with increasing frequency. The pulse sequence from the glottis has a power spectrum decreasing towards higher frequencies by 12dB per octave. The emission characteristics of the lips show a high-pass characteristic with +6dB per octave. Thus, this results in an overall decrease of 6dB per octave.

1.2

What is Speech processing?

Speech processing is the study of speech signals and the processing methods of these signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signal. It is also closely tied to natural language processing (NLP), as its input can come from / output can go to NLP applications. E.g. text-to-speech synthesis may use a syntactic parser on its input text and speech recognitions output may be used by e.g. information extraction techniques. Speech processing can be divided into the following categories: Speech recognition: It deals with the analysis of the linguistic contents of a speech signal. Speaker recognition: Here the aim is to recognize the identity of the speaker.

Enhancement of speech signals: Here the aim is to reduce audio noise. Speech coding: It is a specialized form of data compression important in the telecommunication area. Voice analysis: It is used in medical purposes for analysis of vocal loading and dysfunction of the vocal cords. Speech synthesis: the articial synthesis of speech, which usually means computer-generated speech. Speech enhancement: Here the aim is to enhancing the perceptual quality of a speech signal by removing the destructive eects of noise, limited capacity recording equipment, impairments, etc.

1.3

Speech Recognition-An Introduction

Speech recognition is the process of converting spoken input to text. Speech recognition is thus sometimes referred to as speech-to-text. Speech recognition allows us to provide input to an application with us voice. Just like clicking with mouse, typing on keyboard, or pressing a key on the phone keypad provides input to an application, speech recognition allows providing input by talking. In the desktop world, we need a microphone to be able to do this. The speech recognition process is performed by a software component known as the speech recognition engine. The primary function of the speech recognition engine is to process spoken input and translate it. The application my be in two form-

1. The speech signal may be recognized as a command for the application. This is called command and control application. 2. If an application handles the recognized text simply as text, then it is considered a dictation application. In a dictation application, the system returns simply the speech in text form.

1.3.1

Certain terms:

Following are a few of the basic terms and concepts that are fundamental to speech recognition. Utterances: When the user says something, this is known as an utterance. An utterance is any stream of speech between two periods of silence. Utterances are sent to the speech engine to be processed. Silence, in speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an utterance. Heres how it works. When the engine detects audio input- in other words, a lack of silence -the beginning of an utterance is signaled. Similarly, when the engine detects a certain amount of silence following the audio, the end of the utterance occurs. An utterance can be a single word, or it can contain multiple words (a phrase or a sentence). For example, checking, checking account, or Id like to know the balance of my checking account please are all examples of possible utterances things that a caller might say to a banking application. Whether these words and phrases are valid at a particular point in a dialog is determined by which grammars are active. Pronunciations: The speech recognition engine uses all sorts of data, statistical models, and algorithms to convert spoken input into text. One piece of information

that the speech recognition engine uses to process a word is its pronunciation, which represents what the speech engine thinks a word should sound like. Words can have multiple pronunciations associated with them. For example, the word the has at least two pronunciations in the U.S. English language: thee and thuh. Grammars: In speech recognition application one must specify the words and phrases that user can say the application. These words and phrases are dened to the speech recognition engine and are used in the recognition process. You can specify the valid words and phrases in a number of dierent ways a grammar uses a particular syntax, or set of rules, to dene the words and phrases that can be recognized by the engine. A grammar can be as simple as a list of words, or it can be exible enough to allow such variability in what can be said that it approaches natural language capability. Grammars dene the domain, or context, within which the recognition engine works. The engine compares the current utterance against the words and phrases in the active grammars. If the user says something that is not in the grammar, the speech engine will not be able to decipher it correctly. Accuracy: The performance of a speech recognition system is measurable. Perhaps the most widely used measurement is accuracy. It is typically a quantitative measurement and can be calculated in several ways. Arguably the most important measurement of accuracy is whether the desired end result occurred. This measurement is useful in validating application design. Acceptance and Rejection: When the recognition engine processes an utterance, it returns a result. The result can be either of two states: acceptance or rejection. An

accepted utterance is one in which the engine returns recognized text. Not all utterances that are processed by the speech engine are accepted. Acceptance or rejection is agged by the engine with each processed utterance.

1.3.2

Speaker Dependent vs. Speaker Independent Speech Recognition System

There are two types of speech recognition. One is called speaker-dependent and the other is speaker-independent Speaker-dependent software works by learning the unique characteristics of a single persons voice, in a way similar to voice recognition. New users must rst train the software by speaking to it, so the computer can analyze how the person talks. This often means users have to read a few pages of text to the computer before they can use the speech recognition software. Speaker-independent software is designed to recognize anyones voice, so no training is involved. This means it is the only real option for applications such as interactive voice response systems - where businesses cant ask callers to read pages of text before using the system. The downside is that speaker-independent software is generally less accurate than speaker-dependent software. Speech recognition engines that are speaker independent generally deal with this fact by limiting the grammars they use. By using a smaller list of recognized words, the speech engine is more likely to correctly recognize what a speaker said. Speakerdependent software is commonly used for dictation software, while speaker-independent software is more commonly found in telephone applications.

Figure 1.1: Speech Recognition System

1.3.3

Conclusion

Speech recognition will revolutionize the way people conduct business over the Web and will, ultimately, dierentiate world-class e-businesses. These solutions can greatly expand the accessibility of Web-based self-service transactions to customers who would otherwise not have access, and, at the same time, leverage a business existing Web investments. Speech recognition and Voice XML clearly represent the next wave of the Web.

1.4

Various Stages of Speech Recognition:

The various stages of speech recognition in sequence are shown in the block diagram of gure- 1.1. Voice Input: It is the audio input coming into the recognition engine

which it have to process. Analog to Digital Conversion: The voice signal is always an analog continuous time signal. Hence it is necessary to convert it to a discrete form. For which the analog to digital converter is used which involves following steps1. Sampling- Process of converting a continuous signal into a discrete signal 2. Quantization: It is the process of approximating a continuous range of values 3. Encoder: It converts the quantized output levels into a binary bit stream. 4. Filter: Filter is used to measure energy levels for various points on the frequency spectrum. Knowing the relative importance of dierent frequency bands (for speech) makes this process more ecient .High frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale) Acoustic Model: An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech. Language Model: Language modeling is used in many natural languages processing applications such as speech recognition tries to capture the properties of a language and to predict the next word in a speech sequence. Speech Engine: Once the speech data is in the proper format, the engine searches for the best match. It does this by taking into consideration the words and phrases it knows about, which is provided by the language model, along with its knowledge of the environment in which it is operating which is provided by the acoustic model.

10

Figure 1.2: Speech Production Once it identies the most likely match for what was said, it returns what it recognized as a text string. Speech engines try very hard to nd a match, and always returns its best guess for what was said. Feedback: Feedback can be provided in order to reduce system error and to improve system performance.

1.5

Pre-Processing:

The production of speech can be separated into two parts: Producing the excitation signal and forming the spectral shape. Thus, a simplied model of speech production can be drawn as shown in gure- 1.2 This model works as follows: Voiced excitation is modeled by a pulse generator which generates a pulse train (of

11

triangle-shaped pulses) with its spectrum given by P(f). The unvoiced excitation is modeled by a white noise generator with spectrum N(f). To mix voiced and unvoiced excitation, one can adjust the signal amplitude of the impulse generator (v) and the noise generator (u). The output of both generators is then added and fed into the box modeling the vocal tract and performing the spectral shaping with the transmission function H(f). The emission characteristics of the lips is modeled by R(f). Hence, the spectrum S(f) of the speech signal is given as: S(f ) = v p(f ) u N (f ) H(f ) R(f ) = X(f ) H(f ) R(f ) (1.5.1)

To inuence the speech sound, the following parameters in our speech production model:

The mixture between voiced and unvoiced excitation (determined by v and u) The fundamental frequency determined by P(f) The spectral shaping determined by H(f) The signal amplitude depending on v and u These are the technical parameters describing a speech signal. To perform speech recognition, the parameters given above have to be computed from the time signal. This is called acoustic preprocessing. Then these are forwarded to the speech recognizer. For the speech recognizer, the most valuable information is contained in the way the spectral shape of the speech signal changes in time. To reect these dynamic changes, the spectral shape is determined in short intervals of time, e.g., every 10 ms. By directly computing the spectrum of the speech signal,

12

the fundamental frequency would be implicitly contained in the measured spectrum. This results in unwanted ripples in the spectrum.

1.6
1.6.1

Digital Filter-an overview


Role of Digital lter in speech recognition:

In signal processing, a lter removes unwanted parts of the signal, such as random noise, or extracts the useful parts of the signal, such as the components lying within a certain frequency range. In signal processing, there are many instances in which an input signal to a system contains extra unnecessary content or additional noise which can degrade the quality of the desired portion. In such cases we may remove or ltered out the useless samples. For example, in the case of the telephone system, there is no reason to transmit very high frequencies since most speech falls within the band of 400 to 3, 400 Hz. Therefore, in this case, all frequencies above and below that band are ltered out. The frequency band between 400 and 3, 400 Hz, which isnt ltered out, is known as the pass band, and the frequency band that is blocked out is known as the stop band. In speech recognition digital lter plays an important role in the spectral analysis of the speech signal. The bandwidth of the speech signal is much higher than the 4 kHz. However within a bandwidth of 4 kHz the speech signal contains all the information necessary to understand a human voice. Hence initially a low pass lter removes the frequency above 4 KHz from the speech signal. A host of digital lter bank again extracts the actual information carrying portion from that 4 KHz bandwidth signal.

13

1.6.2

What is a digital lter?

A digital lter is a system that performs mathematical operations on a sampled, discrete-time signal to reduce or enhance certain aspects of that signal. Simply a digital lter is a discrete time, discrete amplitude convolver. Fourier transform theory revels that the linear convolution of two sequences in the time domain is the same as multiplication of two corresponding spectral sequences in the frequency domain. Filtering is in essence the multiplication of the signal spectrum by the frequency domain impulse response of the lter. An analog signal may be processed by a digital lter by rst being digitized and represented as a sequence of numbers, then manipulated mathematically, and then reconstructed as a new analog signal. A digital lter is characterized by its transfer function, or equivalently, its dierence equation. The transfer function for a linear, time-invariant, digital lter can be expressed as a transfer function in the Z-domain; if it is causal, then it has the form: B(z) b0 + b1 Z 1 + b2 Z 2 + .......... + bN Z N = A(z) 1 + a1 Z 1 + a2 Z 2 + .......... + aM Z M

H(z) =

(1.6.1)

The digital lters are mainly discrete time invariant systems which may be Finite Impulse Response (FIR) or Innite Impulse Response (IIR).

1.6.3

Finite Impulse Response Filter:

A FIR lter is a type of digital lter which have an impulse response sequence of nite duration, i.e. it has a nite number of non-zero terms. FIR lters are usually implemented by non-recursive structures-all zeros. It has no feedback. The impulse response of an Nth order FIR lter lasts for N+ 1 sample and then dies to zero.

14

The dierence equation that denes the output of an FIR lter in terms of its input is:

y[n] = b0 x[n] + b1 x[n 1] + .......... + bN x[n N ] where , x[n] is the input signal, y[n] is the output signal, bi are the lter coecients,

(1.6.2)

and N is the lter order, an N th order lter has (N + 1) terms on the right-hand side; these are commonly referred to as taps. This equation can also be expressed as a convolution of the coecient sequence bi with the input signal: y[n] =
N i=0

bi x[n i]

(1.6.3)

That is, the lter output is a weighted sum of the current and a nite number of previous values of the input.

1.6.4

Transfer Function

The impulse response h[n] can be calculated if we set x[n] = [n] in the above relation; where [n] is the delta impulse. The impulse response for an FIR lter then becomes the set of coecients bn as follows h[n] = for n = 0toN The Z-transform of the impulse response yields the transfer function of the FIR lter H(Z) = Zh{[n]} =
N n=0 N i=0

bi [n i] = bn

(1.6.4)

bn Z n

(1.6.5)

15

FIR lters are clearly bounded-input bounded-output (BIBO) stable, since the output is a sum of a nite number of nite multiples of the input values.

1.6.5

Properties:

1. FIR lters have an exact linear phase. 2. They are always stable. 3. The design methods are generally linear. 4. They can be realized eciently in hardware. 5. The lter start up transients has nite duration.

1.6.6

Design Techniques of FIR Filter:

The sinusoidal steady state transfer function of a digital lter is periodic in the sampling frequency and it can be expanded in a Fourier series. Thus H(ej ) =
k=

h[n]ejnt

(1.6.6)

Where h(n) represents the terms of the unit impulse response. H(e ) =
j k=

h[n] cos(nt) +

k=

h[n] sin(nt) = Hr (f ) + jHi (f )

(1.6.7)

Hr (f )= Even function of frequency Hi (f )= Odd function of frequency The design procedure is given below 1. For lter operation set Hi (t)=0

16

2. Expand Hr (t) in Fourier series 3. The unit pulse response is determined from the Fourier co-ecient using the following equations

h(0) = a0 h(n) = 1 an 2 1 h(n) = an 2

There are two problems involved in this technique. The rst is that the transfer function H(ej ) Represents a non casual digital lter of nonnite duration. Hence to obtain nite duration impulse response truncating and delaying is done. This modication does not aect the amplitude response of the lter; however it results in oscillation in the pass band and stop band due to the slow convergence of the Fourier series near the paint of discontinuity. This eect is known as Gibbs phenomenon. The limitations arises due to Gibbs phenomenon can be reduced by modifying the Fourier Co-ecient using a set of time- limited weighing function , w(n), referred to as window functions. The process involves circular convolution of h (n) and w (n) which results in a FIR approximation signalh(n) , which have a nite time interval N x N There are various window functions Rectangular Window Hamming window Henning window

17

Blackman window Berlet window Kaiser window

1.6.7

FIR FILTER STRUCTURE:

Finite Impulse Response (FIR) Filter structure is mainly three types1. Direct Form 2. Cascade Form 3. Lattice Form Direct Form: The lter structure in which the multiplier co-ecient are precisely the coecient of the Transfer Function are called DIRECT FORM STRUCTURE. A direct form realization of an FIR lter can be readily developed from the convolution sum description as shown gure- for N = 4 as shown in g 1.3

Figure 1.3: Direct Form FIR Filter

18

Cascade Form: A higher order Transfer Function can also be realized as a cascade of second order FIR sections and possibly a rst order section. A cascade realization is shown in gure-1.4.

Figure 1.4: Cascade Form FIR Filter

Lattice Structure It is also possible to implement FIR lters in a lattice structure: this is sometimes used in adaptive ltering as shown in g-1.5.

1.6.8

BASIC STRUCTURE OF IIR FILTER:

Innite impulse response (IIR) is a property of signal processing systems. Filters with this property are known as IIR lters. IIR systems have an impulse response function that is non-zero over an innite length of time. This is in contrast to nite impulse response lters (FIR), which have xed-duration impulse responses.

IIR lter systems are characterized by the constant coecient dierence equation

19

Figure 1.5: Lattice Form FIR Filter (1). y(n) =


N k=1

ak y(n k) +

M k=0

ak x(n k)

(1.6.8)

Where ak and bk are constant with a0 = 0 and M N . From the above equation we see that for IIR response system involves a recursive computational algorithm. The most important IIR structure are Direct Direct Form 1 Direct Form 2 Cascade Parallel Lattice

20

1.6.9

Direct Form

Equation (1) is the standard form of the system transfer function. Direct Form Realization can directly draw from the above equations. Direct Form 1 is the Cascade representation of two systems where the rst contain only the Zeros and second contain only the Poles. The Direct Form 1 diagram is drawn considering M = N as shown in g-1.6. Since we are dealing with linear system, so Direct Form 2 can be realized by simply

Figure 1.6: Direct Form 1 IIR Filter

21

exchanged the location of All Pole System and All Zero System. The advantage of this realization is that this makes system design simpler. The gure of this system is shown in g-1.7.

Figure 1.7: Direct Form 2 IIR Filter

1.6.10

Cascade FORM

In Cascade Realization the transfer function is divided into no of transfer function H1 (z), H2 (z), .Hk (z). The resultant realization is given in g-1.8. The advantage of

22

Figure 1.8: Cascade Realization of IIR lter the cascade form is that the overall transfer function of the lter can be determined, and also it has the same zeros and poles as the individual component, since the transfer function has the product of the components.

1.6.11

Parallel FORM

By using the partial fraction expansion, the transfer function of an IIR lter can be realized in a parallel form. The parallel realization is useful in high speed ltering application since the lter operation is performed in parallel. The gure of an parallel realization is given in g-1.9

1.6.12

Lattice Form

The Lattice realization of IIR Filter of stage 2 is given by the following equation y(n) = a1 (1)y(n 1) a2 (2)y(n 2) + x(n) The diagrammatic representation of Lattice realization is given in g-1.10 (1.6.9)

1.7

Design Techniques of IIR Filter

There are several techniques to implement a Innite Impulse Response (IIR) Digital Filter. Generally used all implementation techniques accomplished the lter design

23

Figure 1.9: Parallel Realization of IIR lter simply convert a Analog Filter in to its Digital Form. There are three main method of design of IIR Filter 1. Approximation of Derivative Method. 2. Impulse invariant Method 3. Bilinear Transformation method

1.7.1

Approximation of Derivative Method

In this method analog lter to digital lter conversion is implemented by simply putting S= 1 z 1 T (1.7.1)

For example let us consider a Transfer Function of a Analog Filter is given by H(s) = 1 s+2 (1.7.2)

24

Figure 1.10: Lattice Realization of IIR lter Now putting the value of s and take T = 1 sec we get H(s) = 1 3 z 1 (1.7.3)

1.7.2

Impulse Invariant Method

In Impulse Invariant Method the conversion is implemented by putting 1 1 = 1 epi T s pi 1z (1.7.4)

Again the above two method are suitable for applied in low pass lter and band pass lter whose resonant frequency is low. For high pass lter and band reject lter these two methods are not applicable. To overcome this problem Bilinear transformation is used.

25

1.7.3

Bilinear Transformation Method

In bilinear Transformation Method the analog lter to digital lter conversion done by equating S= 2(z 1) T (z + 1) (1.7.5)

1.8

Vector quantization Technique of Speech Recognition:

1.8.1

Vector Quantization

Vector quantization (VQ) is a lossy data compression method based on the principle of block coding. It is a xed-to-xed length algorithm. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence. The use of a training sequence bypasses the need for multi-dimensional integration. A VQ that is designed using this algorithm are referred to in the literature as an LBG-VQ. A VQ is nothing more than an approximator. The idea is similar to that of roundingo (say to the nearest integer). An example of a 1-dimensional VQ is shown in gure 1.12: Here, every number less than 2 is approximated by 3. Every number between 2 and 0 are approximated by 1. Every number between 0 and 2 are approximated by +1. Every number greater than 2 is approximated by +3. Note that the approximate values are uniquely represented by 2 bits. This is a 1-dimensional, 2-bit VQ. It has a rate of 2 bits/dimension. An example of a 2-dimensional VQ is shown in

26

Figure 1.11: One dimensional Vector quantization the gure1.13. Here, every pair of numbers falling in a particular region is approximated by a red star associated with that region. Note that there are 16 regions and 16 red stars each of which can be uniquely represented by 4 bits. Thus, this is a 2-dimensional, 4-bit VQ. Its rate is also2 bits / dimension.

1.8.2

The LVQ Algorithm

The LVG VQ design algorithm is an iterative algorithm which alternatively solves the above two optimality criteria. The algorithm requires an initial codebook C(0). This initial codebook is obtained by the splitting method. In this method, an initial code vector is set as the average of the entire training sequence. This code vector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The nal two code vectors are spitted into four and the process is repeated until the desired number of code vectors is obtained. The algorithm is summarized below 1. Given is Fixed > 0 to be a small number

27

Figure 1.12: Two-dimensional vector quantization 2. Let N = 1 and .


C1

M 1 = Xm M m=1

(1.8.1)

calculate
Dave

M 1 = Xm C1 2 M k m=1

(1.8.2)

3. Splitting: For i = 1, 2...., N , set Ci (0) = (1 + )Ci CN +i = (1 )Ci set N = 2N (1.8.3) (1.8.4)

28

4. Iteration: Let Dave (0) = Dave Set the iteration index i = 0

(a) For m 1, 2....., M nd the minimum value of Xm Cn (i)2 over all n = 1, 2, ....., N . Let be the index which achieves the minimum. Set
Q(Xm ) = Cn (i)

(b) For n = 1, 2, ....., N update the code vector (c) Set i = i + 1


(d) Calculate Dave = 1 Mk

M
m=1

Xm Q(Xm )2

(e) If

Dave (i1)Dave (i) Dave (i1)

> , go back to step (a).

i (f) Set Dave = Dave For n = 1, 2, , ....., N , set Cn = Cn (i) as the nal code

vector 5. Repeat Steps 3 and 4 until the desired number of code vectors is obtained.

1.8.3

How does VQ work

A vector quantizer is composed of two operations. The rst is the encoder, and the second is the decoder. The encoder takes an input vector and outputs the index of the codeword that oers the lowest distortion. In this case the lowest distortion is found by evaluating the Euclidean distance between input vector and each codeword in the codebook. Once the closest codeword is found, the index of that codeword is sent through a channel (the channel could be computer storage, communications channel, and so on). When the encoder receives the index of the codeword, it replaces the index with the associated codeword. Figure shows a block diagram of the operation of the encoder and decoder.

29

Figure 1.13: The Encoder and Decoder in a Vector Quantizer

1.8.4

How does the search engine work

Although VQ oers more compression for the same distortion rate as scalar quantization and PCM, yet is not as widely implemented. This due to two things. The rst is the time it takes to generate the codebook, and second is the speed of the search. Many algorithms have been proposed to increase the speed of the search. Some of them reduce the math used to determine the codeword that oers the minimum distortion, other algorithms preprocess the codeword and exploit underlying structure. The simplest search method, which is also the slowest, is full search. In full search an input vector is compared with every codeword in the codebook. If there were

30

M input vectors, N code words, and each vector is in k dimensions, then the number of multiplies becomes kM N , the number of additions and subtractions become M N ((k 1)+k) = M N (2k 1), and the number of comparisons becomes M N (k 1). This makes full search an expensive method.

1.8.5

Vector quantization in Speech Recognition system:

Speech recognition based on vector quantization has proved its usefulness for isolated words speech recognition for small vocabularies (100 200) and one speaker, due to its easy implementation and its fastest calculation. One of the main objectives for using digital signal processing (DSP) techniques in speech recognition is to take advantage of the redundancy that exists in speech, and thus reduce the amount of data and operations necessary to process it. During these operations a good set of features are obtained, that will be used to implement the recognition. For speech recognition using vector quantization one codebook is used for each word of the recognition vocabulary. Each codebook is created from a training sequence containing repetitions of one vocabulary word. For example, a codebook for the word one would be designed by running the vector quantizer design algorithm on a training sequence of several repetitions of the word one. In order to recognize a spoken word using vector quantization, it is passed through each vector quantizer of the recognition vocabulary and the quantizer give a set of distortions d1 (t), d2 (t), ..., dM (t). These distortions correspond to the distance of the vector quantizer that best t the frame t of the voice signal, for each quantizer1, ..., M. The decision on what word was spoken is made by comparing the global distortionsD1 , D2 , ..., DM from each quantizer, and

31

by selecting the word of the vocabulary whose quantizer gives the smallest global distortion during t = 1, ..., T as shown in g-1.14.

Figure 1.14: Vector Quantization Based Speech Recognition System

1.8.6

Feature Vectors and Vector Space:

If we have a set of numbers representing certain features of an object that we want to describe, it is useful for further processing to construct a vector out of these numbers by assigning each measured value to one component of the vector. In order to dene feature vector and vector space let us take an example of an air conditioning system which will measure the temperature and relative humidity in a oce. If we measure

32

those parameters every second or so and we put the temperature into the rst component and the humidity into the second component of a vector, we will get a series of two-dimensional vectors describing how the air in the oce changes in time. Since these so-called feature vectors have two components, we can interpret the vectors as points in a two-dimensional vector space. Thus we can draw a two-dimensional map of our measurements as sketched in gure-. Each point in our map represents the temperature and humidity in the oce at a given time. As we know, there are certain values of temperature and humidity which we nd more comfortable than other values. In the map the comfortable value- pairs are shown as points labeled + and the less comfortable ones are shown as -. We can see that they form regions of convenience and inconvenience respectively. In our case speech signal is represented by a series of feature vectors which are computed every 10ms. A whole word will comprise dozens of those vectors, and we know that the number of vectors (the duration) of a word will depend on how fast a person is speaking. We have to done vector classication. In speech recognition, we have to classify not only single vectors, but sequences of vectors. Lets assume we would want to recognize a few command words or digits. For an utterance of a word w which is TX vectors long, we will get a sequence of vectors X = {X0 , X1 ...XTX1 }. From the acoustic preprocessing stage. Then we will need a way to compute a distance between this unknown sequence of vectors X and known sequences of vectors Wk = Wk0 , Wk1 , ..., WkT W k which are prototypes for the words we want to recognize. Let our vocabulary (here: the set of classes) contain V dierent words (w0 , w1 , ...wV 1 ). On the basis of nearest Neighbor classication we will allow a word Wv (classW v2) to be represented by a set of prototypesWk , v, k = 0, 1, ..., (Kv 1)

33

to reect all the variations possible due to dierent pronunciation or even dierent speakers

1.8.7

Linear predictive coding

Linear predictive coding(LPC) is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal of speech in compressed form, using the information of a linear predictive model. It is one of the most powerful speech analysis techniques, and one of the most useful methods for encoding good quality speech at a low bit rate and provides extremely accurate estimates of speech parameters. Linear prediction is a mathematical operation where future values of a discrete-time signal are estimated as a linear function of previous samples.

1.8.8

Introduction to Graphical User Interface (GUI) MATLAB 7.1 Introduction

A graphical user interface (GUI) is a pictorial interface to a program. A good GUI can make programs easier to use by providing them with a consistent appearance and with intuitive controls like pushbuttons, list boxes, sliders, menus, and so forth. The GUI should behave in an understandable and predictable manner, so that a user knows what to expect when he or she performs an action. For example, when a mouse click occurs on a pushbutton, the GUI should initiate the action described on the label of the button.

Chapter 2 Literature Survey


2.1
1.

Digital Filter- previous work done

2.2

Speech Recognition using Neural network- previous work done

1. A work titled Forming a Corpus of Voice Queries for Music Information Retrieval: A Pilot Study,developed by David Bainbridge,John R. McPherson and Sally Jo Cunningham of Department of Computer Science University of Waikato Hamilton New Zealand which focuses on a pilot study that eld test a procedure for collecting audio queries to use in rening the collection methodology, and make a nal set of queries freely available to MIR researchers. 2. A work titled Learning Vector Quantization And Neural Predictive Coding For Nonlinear Speech Feature Extraction developed by M. Chetouani, B. Gas, J.L. Zarader of Laboratoire des Instruments et Systemes dIle-De-France University 34

35

Paris VI,where they presents a new nonlinear feature extraction method based on the Learning Vector Quantization (LVQ) and the Neural Predictive Coding (NPC). 3. Analysis of Speech Recognition Techniques for use in a Non-Speech Sound Recognition System is another work proposed by Michael Cowling, Member, IEEE and Renate Sitte, Member, IEEE, of Grith University, Gold Coast, Qld, Australia 9726,discusses the use of speech recognition techniques in non-speech sound recognition. 4. G.Rigoll Faculty of Electrical Engineering Department of Computer Science of Gerhard-Mercator-University Duisburg,developed a work which focuses on a new neural network paradigm and its application to recognition of speech patterns is presented. The novel NN paradigm is a multilayer version of the well-known LVQ algorithm from Kohonen. The work titled Speech Recognition Experiments With A New Multilayer Lvq Network (Mlvq). 5. Feature Generation Based On Maximum Classication Probability For Improved Speech Recognition is another work which focuses on a feature generation process that is based on linear transformation of the original log-spectral representation.The work was developed by Xiang Li and Richard M. Stern,Department of Electrical and Computer Engineering and School of Computer Science Carnegie Mellon University of Pittsburgh, Pennsylvania 15213 USA 6. A work titled Linear Predictive Coding And Cepstrum Coecients For Mining Time Variant Information From Software Repositories,developed by Giuliano Antoniol,Vincenzo Fabio Rollo and Gabriele Venturi RCOST- University Of

36

Sannio,ITALY where an approach to recover time variant information from software repositories have proposed. 7. Speech Coding and Phoneme Classication Using MATLAB and NeuralWorks is a work developed by Brett A. St. George, Ellen C. Wooten, and Louiza Sellami Department of Electrical Engineering U.S. Naval Academy Annapolis, MD,where applications involving speech coding and phonetic classication are introduced as educational tools for reinforcing signal processing concepts. 8. Speech Recognition Using an Enhanced FVQ Based on a Codeword Dependent Distribution Normalization and Codeword Weighting by Fuzzy Objective Function,developed by Hwan Jin Choi and Yung Hwan Oh of Department of Computer Science Korea Advanced Institute of Science and Technology 373-1 Yusong-gu Kusong-dong Taejon, Korea which presents a new variant of parameter estimation methods for discrete hidden Markov models(HMM) in speech recognition. 9. A work titled Recognition System Based On Phonemes Using Neural Networks developed by N.Uma Maheswari,A.P.Kabilan ,R.Venkatesh,which focuses on implementing a two-module speaker independent speech recognition system for all-British English speech. The rst module performs phoneme recognition using two-level neural networks. The second module executes word recognition from the string of phonemes employing Hidden Markov Model. 10. Another work titled, Recurrent Neural Network with Backpropagation through Time for Speech Recognition,carried by Abdul Manan Ahmad, Saliza Ismail+, Den Fairol SamaonL, proposes a fully connected hidden layer between the input

37

and state nodes and the output. Besides that,they also investigated and show that this hidden layer makes the learning of complex classication tasks more ecient.The work also investigated dierence between LPCC and MFCC in feature extraction process. 11. Speech Recognition Based on Articial Neural Networks is another work developed by Veera Ala-Keturi of Helsinki University of Technology,where hidden Markov models (HMMs) and neural neworks (.NNs) are used together in speech recognition. 12. Mikael Boden of School of Information Science, Computer and Electrical Engineering Halmstad University published a paper titled A guide to recurrent neural networks and backpropagation,where he provided a guidance to some of the concepts surrounding recurrent neural networks. 13. A work titled, Speech Recognition with Missing Data using Recurrent Neural Nets,developed by S. Parveen and P.D. Green of Speech and Hearing Research Group, Department of Computer Science University of Shefeld Shefeld S14DP, UK ,where they proposed a missing data approach to improving the robustness of automatic speech recognition to added noise, an initial process identies spectral temporal regions which are dominated by the speech source. 14. Another work titled, Isolated Word Speech Recognition Using Vector Quantization Techniques and Articial Neural Networks,developed by Jesus Savage, Carlos Rivera, Vanessa Aguilar of Facultad de Ingeniera Departamento de Ingeniera en Computacion University of Mexico, UNAM,where they showed how to combine speech recognition techniques based on Vector Quantization (VQ)

38

together with Articial Neural Networks (ANN). 15. Application of the Elman Network and Synthetically Relational Analysis to Fault Diagnosis for Suction Fan is another work developed by Hong Rao,Information Engineering School Nanchang University, Nanchang,China, Mingfu Fu,Nanchang University ,Nanchang,China, and Mingxiang Xie,College of Mechanical and Energy Engineering, Zhejiang University, Hangzhou, China, where they introduces the theory of Elman network and gray relational analysis along with a method based on the synthetically relational analysis and Elman network was presented and applied to fault diagnosis of Suction Fan. 16. BRAD A. HAWICKHORST AND STEPHEN A. ZAHORIAN of Department of Electrical and Computer Engineering Old Dominion University, Norfolk, VA 23529 developed a work titled A COMPARISON OF THREE NEURAL NETWORK ARCHITECTURES FOR AUTOMATIC SPEECH RECOGNITION where they compares three neural network architecturestwo layer feedforward perceptrons trained with back propagation, Radial Basis Function (RBF) networks, and Learning Vector Quantization (LVQ) network as classier in automatic speech recognition. A 17. 18.

39

2.3

Assamese speech processing- a few case studies

1. A work titled Emotion recognition from Assamese speeches using MFCC features and GMM classier by Aditya Bihar Kandali, Student Member, IEEE, Aurobinda Routray, Member, IEEE, and Tapan Kumar Basu which presents a method based on Gaussian mixture model (GMM) classier and Mel-frequency cepstral coecients (MFCC) as features for emotion recognition from Assamese speeches. For training and testing of the method, data collection is carried out in Jorhat (Assam, India), which consisted of acted speeches of one short emotionally biased sentence repeated 5 times with dierent styles by 27 speakers (14 Male and 13 Female) for training and one long emotional speech by each speaker for testing. The experiments are performed for the cases of (a) text-independent but speaker-dependent and (b) text independent and speaker-independent. 2. Another work titled Parsing of part-of-speech tagged Assamese Texts carried by Mirzanur Rahman, Sufal Das and Utpal Sharma on Department of Information Technology, Sikkim Manipal Institute of Technology Rangpo, Sikkim-737136, India, Department of Computer Science and Engineering, Tezpur University Tezpur, Assam-784028, India. They produce a technique to check the grammatical structures of the sentences in Assamese text. They have made grammar rules by analyzing the structures of Assamese sentences. Their parsing program nds the grammatical errors, if any, in the Assamese sentence. If there is no error, the program will generate the parse tree for the Assamese sentence.

Chapter 3 Problem Denition


The work is related to speech recognition of isolated Assamese numerals. The work involves collection of samples, preprocessing, feature extraction, design and training of the VQ-network and testing. The primary objective is to develop a speech recognition system with has language specic applications. Assamese is a major language in the North Eastern part of India. No such specic works are reported which aims to develop a system exclusively for Assamese speech recognition. The present work provides the basic frame work of a model which can be extended to achieve that objective.

3.1

Filter Structures that has to be designed

The dierent digital lter structure to be implement in lter bank of our system are shown in the gure 3.1. The lter bank will be used to select a particular lter which can extract the portion of the speech signal which actually carries the information.

40

41

Figure 3.1: Digital Filter Scheme

3.2

VQ codebook design

The primary steps involved are as below: 1. Determine the number of codeword, N, or the size of the codebook. 1. Determine the number of codeword, N, or the size of the codebook. 2. Select N codeword at random, and let that be the initial codebook. The initial codeword can be randomly chosen from the set of input vectors. 3. Using the Euclidean distance measure, clusterize, the vectors around each codeword. This is done by taking each input vector and nding the Euclidean distance between it and each codeword. The input vector belongs to the cluster of

42

the codeword that yields the minimum distance. 4. Compute the new set of codeword. This is done by obtaining the average of each cluster. Add the component of each vector and divide by the number of vectors in the cluster. 1 yi = xij m j=1
m

(3.2.1)

where i is the component of each vector (x, y, z, ... directions), m is the number of vectors in the cluster. 5. Repeat steps 3 and 4 until the either the codeword dont change or the change in the code word is small.

3.3

Diagrammatic representation of present work

Figure 3.2 shows the block diagram of the system we are going to implement. The input of the system will be the continuous time speech signal of the Assamese numerals. The output of the system will be the text form of the speech signal.

3.4
3.4.1

Description of dierent Block:


Signal Capture:

Speech samples of a person uttering the numerals at dierent conditions will be recorded which will be used as the input of the system as shown in g- 3.3.

43

Figure 3.2: System Block Diagram

3.4.2

Signal Preprocessing:

The signal preprocessing subsystem conditions the raw speech signal and prepares it for subsequent manipulations and analysis. This subsystem performs analog-todigital conversion, and perform and signal conditioning necessary. The main steps will cover computation of pre-processing parameters and sampling of the incoming continuous time speech signal. A low pass lter of cut-o frequency 4 kHz will be used to remove the redundant frequency from the speech signal.

44

3.4.3

Digital Filter Bank:

The digital lter bank will contained the dierent structure and various order of FIR and IIR lter designed for spectral analysis. The lter which can the speech signal with a proper bandwidth removing the unnecessary frequency will be selected and the set of acoustic vector will be directed towards the vector quantizer block.

3.4.4

Vector Quantization Block:

The vector quantization block will contain several sub-blocks as shown in gure 3.3 The sub-blocks are explained below Feature data extraction: The input of this block is the digital speech signal. It will use Linear predictive coding (LPC) to generate the feature vector. The output will be a set of acoustic vector. Code book Generation: Using the acoustic speech vector as input a LBG algorithm based system will design the codebook. o Distortion Calculation: The acoustic vectors generated by the testing speech signal will be individually compared to the codebook. The codeword closest to each test vector will found based on Euclidean Distance. This minimum Euclidean Distance, or Distortion Factor, will be stored until the Distortion Factor for each test vector has been calculated. Then the Average Distortion Factor will be found and normalized. Threshold Generation block: This block is used to set the sensitivity level of the system. The sensitivity level is called the threshold. The average distortion

45

factor will be multiplied by a scaling factor to generate the threshold and saved to le. The le to which the threshold is saved will be named after the username. Decision Making Block: The threshold value will be read from disk from a le named after the user name. This threshold value will then be compared to the average distortion factor. If the average distortion factor is greater than the threshold, nothing will be recognized. If the average distortion factor is less than the threshold, the code word having the smallest distance will be assigned to the speech signal.

3.5

Graphic Window:

Figure 34 shows the diagrammatic view of the graphic window that will be used to implement the system.

46

Figure 3.3: Sub-blocks of the Vector quantization block

47

Figure 3.4: View of the graphic window

Chapter 4 Future Direction and Conclusion


The work will try to focus on speech recognition of Assamese isolated numeral spoken under varied conditions. The work will initially focus on the speaker dependent aspect to simplify the design and formulate a framework for further improvement which can include speaker independence and mood variations.For that the following is to be considered: Collection of samples: Here speech samples of a person uttering the numerals at dierent conditions will be recorded. Pre-processing: This step will make the recorded signal samples suitable for the subsequent stages. A host of digital lter banks will be designed which will make the signal suitable for design of the system and application for testing. Generation of Features: Here LPC based features will be generated Design of the VQ based codebook:The VQ based training network will be formed by using the LGB algorithm. A code book will be generated and held as a data base record. During testing, the input samples shall be passed similarly through the pre-processing and feature extraction stages. The features captured 48

49

from the input samples will be passed through the VQ-network and a search will be made for the best match with the code book records. This will form the basis of a decision device which will trigger the output.

S-ar putea să vă placă și