Sunteți pe pagina 1din 4

SCRIPT TO SPEECH SYNTHESIS FOR MARATHI LANGUAGE

1. Prof. N.B. Pasalkar


Director of Technical Education, Maharashtra State, Mumbai.
director @dte.org.in
2. Prof. (Mrs.) C.V.Joshi
Professor in Electronics, Pune Institute of Engineering and Technology, Pune.
joshi_chhaya@hotmail.com
3. Ms. Ashwinee S. Jagtap
Student, Pune Institute of Engineering and Technology, Pune.
shwini2003@yahoo.co.in

ABSTRACT
This paper introduces the Digital Signal Processing
approaches to the script-to-speech synthesis of Indian
language Marathi, in order to improve performance.
Hybrid speech synthesis systems using both Neural
Network and DSP models can yield improved
performance compared with the use of either model alone.
Speech Processing is often very difficult because of the
large amounts of data used to represent speech. We have
used speech compression algorithm aim to remove
redundancy in data in such a way which makes speech
reconstruction possible.
1. INTRODUCTION
The benefit of information technology revolution can be
enjoyed by the masses in India only when human oriented
interfaces to computers are developed and deployed.
Spoken language is still the means of communication
used first and foremost by humans. Therefore, it is natural
for people to expect speech interfaces with computers.
Spoken dialogue with machines involves significant
advances in the integration of speech input/output
technologies
and
natural
language
processing
technologies. There are more than 300 million people,
who use Devnagarii script, but still the script is in the
primary stage of research when optical character
recognition & speech synthesis are concerned.
In large part, Natural Language (NL) research has
been pursued in computer science and linguistics
departments; the goal is to model language understanding
motivated by a desire to understand cognitive processes.
On the other hand, speech processing research has largely
been practiced in engineering departments with practical
applications in mind. Techniques motivated by
knowledge of human processes have therefore been less

important than techniques that can be automatically


developed or tuned, and broad coverage of a
representative sample is more important than coverage of
any particular phenomenon. The integration of speech
processing and NL needs to overcome not only technical
challenges but also the differences in motivation,
interests, theoretical underpinnings, techniques, tools, and
criteria for success of the two groups.
2. NATURE OF MARATHI ALPHABET
Marathi is the official language of state in Maharashtra &
towards the end of last century, there were around 72
million people using Marathi. Marathi language is written
in Devnagarii script which is phonetic in nature [2,10].
The collaboration between speech processing and NL is
necessary. So while developing script-to-speech system
for Marathi, it is important to understand the nature of
Marathi alphabet.
2.1. The Consonants Marathi consonants have an
implicit (a) vowel included in them. They have been
categorized according to their phonetic properties. These
are 5 Vargas(Groups) and non-Varg consonants. Each
Varg contains 5 consonants, the last of which is nasal one.
The first four consonants of each Varg, constitute the
Primary & Secondary pair. The second consonant of each
pair is the aspirated counterpart (has an additional h
sound) of the first one.
2.2. Anuswar ( ) Anuswar indicates a nasal consonant
sound. When an Anuswar comes before a consonant
belonging to any of the 5 Vargs, then it represents the
nasal consonant belonging to the Varg. Before a non-Varg
consonant however the anuswar represents a different
nasal sound. For example,

2.7. Diacritic Mark: Nukta ( . ) - The Nukta is used for


and
characters.
2.8. Punctuation All punctuation marks used in
devnagarii scripts are borrowed from English.
2.9. Numerals Devnagarii script has its own numerals.
3. OPTICAL CHARACTER RECOGNITION

2.3. Visarg ( : ) Comes after a vowel sound, and


represents a sound similar to h.
2.4. Vowels & Vowel signs (Matras) There are
separate symbols for all the vowels in Devnagarii script
which are pronounced independently (either at the
beginning of a word, or after a vowel sound). The
consonants in Devnagarii script themselves have an
implicit vowel (a). To indicate a vowel sound other than
the implicit one, a vowel-sign (Matra) is attached to the
consonant. Thus there are equivalent Matras for all the
vowels, excepting the vowel.

2.5. Vowel Omission Sign: Halant ( ) In Devnagarii


script consonants are assumed to have an implicit vowel
a within them unless an explicit Matra (vowel-sign)
is attached. Thus a special sign Halant ( ) is needed for
indicating that the consonant does not have the implicit
vowel in it.
2.6. Conjucts Devnagarii script contain numerous
conjucts, which essentially are clusters of upto four
consonants without intervening implicit vowels. The
shape of these conjucts can differ from those of the
constituting
consonants.
For
example:

The script to speech conversion aims to read out the


scanned images of Devanagarii text. The Optical
Character Recognition (OCR) system converts scanned
images into text files. We have used the Back propagation
Neural Network for efficient recognition where the errors
re back propagated by feed-forward method in the neural
network of multiple layers, i.e. the input layer, the output
layer and the middle layer or hidden layers. In the
network the neural connections go from a vertex to one
with a higher number. This is the back propagation where
the gradient vector of a fitting criterion for a feed-forward
neural network with respect to the parameter or weights.
For reading a paragraph, we split paragraph into lines
using horizontal histogram [6]. After that we split lines
into words using vertical histogram [6]. The words are
separated into characters using vertical histogram. These
characters we recognize by neural networks and make
them talk with the help of speech synthesis module.
4. SPEECH SYNTHESIS
We have tried to separately store the consonants and
vowels and then recognize them. This is being done
exploiting the purely phonetic nature of Marathi language.
Marathi alphabetic character signals are analysed for
phonemes (pure consonant and vowels). Different
parameters like amplitude, pitch, frequency, time period
for each signal are analysed and related properly after
reducing the noise part in those.
Pronunciation of punctuation mark sequences differ
markedly by depending upon where they occur. The
effect of punctuation marks is also context determined.
For example, a period at the end of a sentence has a major
effect on sentence prosody. But a period at the end of an
abbreviation, in a decimal number, or after a middle
initial in a name has a very different significance & must
not be misinterpreted as a sentence termination in
pronunciation. All the above phenomena can be handled
reasonably successfully, but they require an extensive,
nonalgorithmic computer program. Such a program
captures facts about normal Marathi written text that
almost every literate person knows implicitly. It probably
does not provide a good model of our mental processing
but it does model human knowledge. We have taken care
of punctuation marks by changing sampling rate of
sentence.

Speech Processing is often very difficult because of


the large amounts of data used to represent speech.
Speech compression algorithm aim to remove redundancy
in data in such a way which makes speech reconstruction
possible [7,9]. This is called information preserving
compression. The algorithm we used for data compression
& speech reconstruction is shown in figure 1.
Digitized Speech

The first step removes information redundancy caused by


high correlation of speech data. We used Discrete Cosine
Transform for compression. The second step is coding of
data. Compressed data are decoded after archiving &
reconstructed. We have taken care that no non-redundant
speech data get lost in the data compression process.

Data redundancy
reduction

Coding

Archiving

Digitized Speech

Reconstruction

Decoding

Figure 1. Data compression & speech reconstruction


Discrete Cosine Transform - DCT has much of its
magnitude concentrated at lower frequencies including dc
(direct current). There is little or no contribution from
high spatial frequencies. Most of the speech energy is
contained in lower frequencies [4,5]. The significance of
frequency representation using the DCT is this. We either
eliminate high frequency components or represent them
with fewer bits.
The one-dimensional DCT [1,5] of a sequence
{u (n), 0 n N-1} is defined as
N-1

V[k] = C[k] U[n] Cos [ (2n + 1) k/2N]


n=0
for
0 k N-1
where _____
C[0] = 1/N ,
_____
C[k] = 2/N
for 1 k N-1.
The inverse transformation is given by
N-1

U[n] = C[k] V[k] Cos [ (2n + 1) k/2N]


k=0
for

0 n N-1

We have used 128 point DCT i.e. N= 128 in our case.


Thus U[n] can be represented as a weighted sum of 128

cosine functions. These 128 functions are basis cosine


functions. Each of these discrete functions is simply a
continuous cosine function sampled at eight points. These
functions have several important properties.
Complete A weighted sum of the 128 functions
can be found for any 128-sample values.
Minimal None of the 128 waveforms can be
represented by any weighted combination of
the others & all 128 are required for
completeness.
Unique No other set of cosine waveforms other
than scaled versions of the 128 waveforms can
be used to represent all possible sequences of
128 sample values.
These properties apply in the general case of using N
cosine functions to represent a one-dimensional (mono)
wav file of N samples. That is any discrete function [U
[n], 0 n N-1] can be represented by the set of discrete
cosine functions {Cos [( (2n + 1) k)/ (2N)], 0 k N1}.
5. CONCLUSION
Efforts are on implementation of both OCR and Speech
systems and we hope an integration of both the systems
will help the real needy persons like illiterate and blind
persons in the first hand and of course the Offices of
Maharashtra towards office automation or e-governance
at length.

6. REFERENCES

5. Sayood K., Intoduction to Data Compression,


Harcourt India Pvt. Ltd., New Delhi.

1. Che-Hong Chen, Bin-Da Liu, Jar-Ferr Yang, and


Jiun Lung Wang, Efficient Recursive Structures for
Forward and Inverse Discrete Cosine Transform,
IEEE Transactions on Signal Processing, Vol 52,
No 9, September 2004.

6. Sonka M., Hlavac V., Boyle R., Image Processing,


analysis, and machine Vision, Thomson Asia pvt. Ltd.,
Singapore.

2. Indian Script Code For Information Exchange


ISCII, Bureau of Indian Standards, New Delhi.
3. Oppenheim A. V., Schafer R. W., Digital Signal
Processing, Prentice Hall of India Pvt. Ltd, New
Delhi.
4. Proakis J. G., Radar C. M., Ling F., Nikias C. L.,
Advanced Digital Signal Processing, Macmillan
Publishing Company, U.S.

7. www.data-compression.com/speech.shtml
8. www.speech.cs.cmu.edu/
9. www.datacompression.info/speech/
10. http://www.ncb.ernet.in/matrubhasha/

S-ar putea să vă placă și