Sunteți pe pagina 1din 5

Phonetically Rich Hindi Sentence Corpus for Creation of Speech Database

Vishal Chourasia Samudravijaya K Manohar Chandwani


School Of Computers Tata Institute of Fundamental IET , Devi Ahilya
IPS Academy, Indore Research, Mumbai, Vishwavidalaya, Indore,
INDIA 452012 INDIA 400005 INDIA 452011
chourasiavishal@yahoo.com chief@tifr.res.in mc.iet@dauniv.ac.in

Abstract achieve a robust recognition. The purpose of


selection of phonetically-rich sentences is to
This paper reports on methodology used in the provide a good coverage of pairs of phones in the
generation of a phonetically rich Hindi text corpus. sentence. The current work paves way for
The corpus will be used as a resource for creation generation of a database of Hindi speech that will
of a continuous speech, multi-speaker, and large facilitate for acoustic phonetic studies, training and
vocabulary speech database for Hindi Language. testing of automatic speech Recognition systems. It
The larger goal is to facilitate the recognition of is hoped that the availability of this speech corpus
large vocabulary, continuous spoken fluently by would also stimulate the basic research in Hindi
any native speaker. This paper describes the acoustic-phonetics and phonology.
design, structure and phonetic analysis of text
corpus for Hindi. An analysis of the phonetic The rest of the paper is organized as follows. An
richness of sentences designed by this method is outline of the design of phonetically rich sentences
provided. is given in section 2. The grapheme to phoneme
conversion and the process of designing
phonetically rich sentences are described in section
1 Introduction 3. An analysis of phonetic richness of sentences in
the text corpus is given in section 4. Some
There are many applications that require high conclusions are drawn in section 5.
accuracy continuous speech recognition of spoken
sentences. Statistical models used for recognition
of speech signal need to be trained with a large 2 Creation of sentence corpus
amount of speech data corresponding to sentences
containing all phonemes of the language in all Speech Recognition is a special case of supervised
valid phonetic contexts. Thus, it is necessary to pattern recognition. So, the models of speech
have databases which comprise of appropriate recognition have to be trained with speech data that
sentences spoken by the typical users in realistic are tagged with their phonetic identity. For speaker
acoustic environment. Speech databases can be independent speech recognition, speech data need
divided into two groups: (i) a database of speech to be collected from a large number of speakers. It
normally spoken in a specific task domain. In this is not practical to ask speakers to speak/read a lot
case, small amount of speech is sufficient to of sentences that contain all phonemes (in various
achieve acceptable recognition accuracy. (ii) a phonetic contexts) of the language. Hence, it is
general purpose speech database that is not tuned desirable to construct sets of sentences that are
to a particular task domain but consists of general phonetically rich. This construction is a laborious
text and hence can be used for recognition of any task and is best automated. For such an automatic
sentence in that language. To construct this kind of or semi-automatic process of designing sentences,
speech database, a phonetically rich set of an adequate corpus of text in the target language is
sentences based on large text database should be a primary requirement. This section describes the
extracted. Previously such a speech database for methodology adapted to generate such a text
Hindi was developed [1]. However, the size and corpus for Hindi language.
scope of this database is limited. Thus, there is a
need for larger database. The problem with most 2.1 An outline of the design process
speech recognition systems is insufficient training
data, containing speaking variation (spontaneous Phonetically rich sentences can be selected from a
speech) caused by speaker variances (cover large large set of text. Traditional sources of text data are
no of speakers). To overcome these problem, a books, magazines and periodicals. However we
large vocabulary speech database is required to need a textual data in electronic form so that it can
be processed by a computer. So we have two
choices: (a) Manually type in the printed data from A Devnagri grapheme is represented by a code that
articles, periodical, magazines etc., and store it in is either 2 or 3 bytes long. All vowel modifiers
electronic form. (b) Use available online sources of ¤maa~a¥ and most pure consonants are encoded
text data. Example of such sources are articles on using 2 bytes. The two-byte codes cover most of
web and online news papers. There are several the Hindi graphemes.
online newspapers that provide content in Hindi
language using the Devnagri script. Examples of Three byte sequences represent the stand alone
such electronic newspapers/magazines are
webdunia, prabhasakshi, samachar, BBC, vowels [A‚ Aa,,‚ e‚ eo‚ ]‚ }], some consonants
haribhoomi, amarujala, hindimilap, navbharat [C‚ ja‚ xa‚ ca‚ T‚ D ] and “danda viram” (full stop)
times etc. Each online content uses its own
grapheme encoding scheme to display Hindi text [.].
using Devngari fonts. To collect a large amount of
text data we have used the Hindi news archives Unlike the Roman script, the Devnagri script is not
available at [2], and a few Hindi articles on the linear. Ligatures represent consonant clusters.
internet. We have collected the 3 years of textual Moreover, the script is not causal. The order of
data of the newspaper to make a big text corpus. display of graphemes do not strictly represent the
The online textual data available on the archive order of phonemes of the language. Thus, special
uses a Devnagri font named “sudipto”. In order to care has to be taken while writing the G2P
compute the phonetic richness of sentences and program. Such special features of Devnagri script
select sentences, the Hindi text in sudipto font (and are illustrated below.
the corresponding phoneme sequence) has to be
Byte Symbol Grapheme Description
represented using Roman symbols. However, such
Sequence
a grapheme to phoneme conversion programme for
195 151 i i small “i”
the sudipto font was not available. Also,
spelling ¤maa~a¥
information about the coding scheme was not
readily available. Hence a grapheme-to-phoneme
(G2P) program was written that takes care of most 194 174 n n Consonant half
conventions of the sudipto font. After G2P “n”
conversion, short sentences (with 4-9 words) 195 150 a a Vertical bar
filtered for possible inclusion in the text corpus. A (vowel /a/)
sentence selection software was used to pick up completion of
those short sentences from the big corpus that character “na”
satisfy the design constraints. A software 194 164 d d Consonant /d/
(“devnag”) [4] was used to generate a postscript 195 188 a Schwa for
file that displays the selected Hindi sentences in completion of
Devnagri script. Then, these sentences are character “da”
manually validated and edited to ensure 195 150 a Vertical bar
grammatical and spelling correctness.
a
(part of
grapheme
3 Grapheme to phoneme conversion cluster to
represent /r/ and
A grapheme is the smallest unit of written /o/
language. The set of graphemes consists of all of 195 172 ro Rapha with
the letters and letter combinations that represent ao-
spelling ¤maa~a¥
phonemes of the language. Hindi uses character
“o”
based Devnagri script; a Devnagri character
195 130 S Y Consonant half
represents either a standalone vowel or vowel in
“sh”
combination with one or more consonants. The
195 150 a a Vertical bar for
sudipto font employs UTF-8 encoding scheme.
completion of
The G2P involves generating Roman symbols
character “sha”
representing Hindi phonemes from the UTF-8
code. Table 1 G2P conversion of word inadao-Ya
A grapheme in UTF-8 is represented as a sequence Consider the process of grapheme to phoneme
of length 1 to 6 bytes. In the current case, all
conversion of the word nirdoSa ¤ inadao-Ya ¥
ASCII characters are represented as a single byte.
(meaning: innocent) .In case of Sudipto font, the
input byte sequence of the word corresponds to the [k a r (tax)]Ê nao~ [n e t r a (eye)]Ê raYT/ [r aa S t r a
following decimal representation.
(nation)]Ê $maala [r u m aa l (handkerchief)]Ê saMpk-
195 151 194 174 195 150 194 164 195 188 195 [s a m p a r k (contact)]Ê paik-Mga [p aa r k I ng
150 195 172 195 130 195 150
(parking)] Ê pdo- [p a r d e (curtain)]Ê dRYTI [d r I s t I
Table 1 illustrates the grapheme to phoneme (eye-sight)]Ê Ëma [k r a m (sequence)]Ê ?Na [R N
conversion according to the byte sequence (loan)]. The phoneme sequence as well as the
generated for the word ‘nirdoSa’ ¤ inadao-Ya ¥. English word corresponding to a word is shown
inside the square bracket after the script.
According to this table, a simple minded
conversion of byte sequence to symbol sequence 3.2 Criteria for selection of sentences
will generate ´i n a d a a r o S a` because this We have adopted a certain criteria to select the
corresponds to the sequence of graphemes as sentences for the dictionary creation. Only short
normally written. While writing ´nirdoSa ¤inadao-Ya¥ in sentences (with a minimum of 4 and maximum of
Hindi, ´i’ ¤i ¥ matra is written first because it 10 words) are picked from the large text corpus.
precedes the consonant /n/ although, in spoken The sentences are manually inspected to see that
language, the vowel follows the consonant. Then does not sound artificial, the sentence is
consonant ´n’ and ‘a’ for ¤n ¥ and ¤a¥ which meaningful, it must not contain any offensive or
sensitive words. Only those sentences which fulfill
constructs the character ¤na¥.Then ´da’ and ´a’ ¤d¥
all the above criteria are retained.
followed by matra ´ro’ with rapha ´r’ as ¤ao-¥ and
word completion ´Sa’ for ¤Ya¥. The G2P conversion 3.3 Phonetic Richness of Sentences
program has to manage this violation of causality
and generate the correct sequence of phonemes of Phonetic rich sentences are needed for robust
the text. estimation of the parameters statistical models of
context sensitive phonemes. A phoneme is context
In the above example, notice that the two byte sensitive if it is associated with several models
sequences represent a single phoneme; the vowel depending on the identity of the preceding and/or
'a' with byte sequence 195 188 is coming after ‘d’ following phoneme. A triphone is characterized by
(194 164) for the completion of the character /da/. both left and right context. If there are M
The consonant ΄S‘ (195 130) is also followed by phonemes in a language, there can be Mⁿ
΄a‘ (195 150) for its completion ΄Sa‘, In this way triphones; where (n=3). A language may not permit
Devnagri characters are constructed as all Mⁿ triphones; where (n=3), though.
combinations of graphemes.
A set of sentences is considered to be phonetically
There are words where a grapheme is represented rich if it contains all permissible triphones of the
by a 3-byte sequence. For example, when a word language in sufficient quantity. While some
begins with an vowel (a ‘standalone’ vowel), the triphones occur abundantly in natural sentences,
vowel is represented by a special grapheme. These quite a few triphones are rare. Thus one has to
are encoded by 3 byte long codes. design/select sentences which are rich in such rare
triphones.
3.1 Phoneme with multiple graphemes
For developing speaker independent recognition
In Hindi language, some phonemes are associated system, it is necessary to collect speech data for a
with multiple grapheme. Such a phoneme is large number of speakers. However speakers are
represented differently in written language, but its generally reluctant to read a large set of sentences.
pronunciation is same. Such different graphemes Hence the primary goal of this work is to generate
used in distinct contexts such as in combination small set of sentences, each of which can be
with annuswar, consonants or vowels. A notable conveniently read by one person. It is not possible
example of such phonemes is /r/; there are many to cover all triphones in one small sentence set.
while graphemes representing /r/ in different Hence in this work, we consider a sentence set to
phonemic contexts. Each such grapheme is be phonetically rich if it contains most, if not all,
allocated a unique 2-byte long UTF-8 code. For phonemes of the language. Thus, special effort has
example, the phoneme /r/ in each of the following to be taken to have as many words with rare
words is associated with different graphemes: kr phonemes as possible.
3.4 Transliteration convention possible) the phonetic context criteria specified by
the user. In our case, we have collected 50,000
The collected set of sentences is first converted to phonetically rich sentences (5000 sets of sentences;
devnagri convention. We converted the roman each set contains 10 sentences). In other words, on
symbols representing Hindi phonemes into its an average, one out of seven sentences is selected
corresponding devnagri script according to by the program. Each set of 10 sentences contains
Velthuis Transliteration scheme[4]. This helped us almost all the phonemes of the Hindi language.
to visualize the text in Devnagri script and validate After applying corpusCrt tool on a corpus of
the text. In addition, speakers find it convenient to sentences, statistics of the selected sentence are
read Hindi sentences in Devnagari script. gathered. These are number of distinct units
(phones) in each sentence set, frequency of
3.5 Selection of sentences occurrence of each phone, frequently occurring
phone as well as rarely occurring phone and total
In order to create a large phonetically rich speech number of units (phones) in a set.
database, collection of large amount of text corpus
is a prerequisite. Here, we took sentences from 4 Corpus Analysis
online news and articles from the web. This helped >From a set of 350,000 sentences, corpusCrt
us in the collection of around 350,000 sentences sentence selection tool selected 5000 sets of
each containing between 4 and 10 words. A phonetically rich sentences; each set contains 10
sentence should contain a minimum of 4 words so sentences. So we have a collection of 50000
that it forms a standalone phrase or a sentences. sentences which are phonetically rich which
Similarly, the maximum number of words is comprises of 41 phonemes of Hindi language. The
restricted to 10 because long sentences tend to be collected sentences are said to be as phonetically
complex and it is difficult for reader to read such rich, most of (preferably all) 41 phonemes are
sentences fluently and naturally. present in each and every sentence set.
3.5.1 Sentence selection tool 4.1 Phonetic richness of sentence sets
After collection of large number of sentences, there Statistics is gathered about the number of distinct
is need to select those sentences which are phoneme in each sentence set. There are 41 distinct
phonetically rich in the sense that all the phonemes phonemes in the present analysis. It may be noted
must be present at least once in each the sentence that there are 10 asphirated plosives in Hindi (5
set. In order to achieve this, we used the corpus voiced and 5 unvoiced). The voiced aspirated
selection program “corpusCrt” program [3]. Given plosives occur rarely. Figure 1 shows the phonetic
a text corpus, this software tool generates sets of richness of the sentence sets in the form of a
phonetic rich sentences that satisfies (as far as histogram. It shows the frequency count of
sentence sets (out of 5000) that contain different of the sentences sets. Analysis of the sentence sets
number of distinct phonemes. The number of with this criterion yields 10 rare phonemes, out of
distinct phonemes (in a sentence set) is shown on the 41 phonemes of Hindi language.
the x-axis, and the percentage of sentence sets
containing that many distinct phonemes is on y- 5 Conclusion
axis of the histogram. It is found that about 4% of
Training of statistical models for automatic speech
the sentence sets contain all the 41 phonemes.
recognition requires large amount of speech data
Only one phoneme is missing in the 16% of sets.
that is rich in phonetic context. It is possible to
The percentage of sentence sets containing 38 and
design small sets of sentences that are not only
39 distinct phonemes are 25% and 27%
phonetically rich but also convenient for speakers
respectively. Histogram shows that almost 72% of
to read. In this work, a nearly automatic method
sets contain at least 37 distinct phonemes. Thus the
has been employed to design 50,000 such
sentence sets derived in this work are phonetically
sentences, starting from online texts that are coded
rich. It may be noted that every sentence set
for display in Devnagari script. The methodology
contains a minimum of 32 distinct phonemes.
is scalable to generation of larger corpus.
Collection of speech data using these sentences
4.2 Distribution of rare phonemes
would lead to development of better performing
Hindi Speech Recognition systems.
After gathering the statistics of distinct phonemes
in sentence sets, we identified rare phonemes
References
whose frequency of occurrence in the sentence sets
is small. Figure 2 shows the percentage of sentence
sets in which rare phonemes occurred. The x- axis [1] Samudravijaya K, P.V.S. Rao and S.S.
represents the rare phonemes; y-axis represents the Agrawal, “Hindi Speech Database”, Proc. Int.
percentage of sentences sets. Only those phonemes Conf. Spoken language processing, ICSLP00,
which occur in less than 90% of the sentence sets October Beijing 2000, CDROM:00192.pdf
are designated as rare phonemes. Figure shows that (http://speech.tifr.res.in/publications.html#sd).
the voiced, asphirated, retroflex plosive /Dh/
[2] http://navbharattimes.indiatimes.com/
occurred in only 40% of sentences sets; the
Archives
unaspirated, unvoiced, dental plosive /th/ occurred
[3] http://gps.tsc.upc.es/veu/personal/
in about 80% of sentence sets. So these two
sesma/sesma/CorpusCrt.php3
phonemes are considered as most rare phonemes in
[4] ftp://ftp.tex.ac.uk/tex.archive/language/
the collected sentence sets. The phonemes /S/, /N/,
devnagari/velthuis
/O/, /kh/, /dh/, /ph/, /U/, /D/ are also considered as
[5] UTF-8 http://www.cl.cam.ac.uk/
rare phonemes as they are present in less than 90%
~mgk25/unicode.html

S-ar putea să vă placă și