0 evaluări0% au considerat acest document util (0 voturi)
68 vizualizări5 pagini
This paper reports on methodology used in the generation of a phonetically rich Hindi text corpus. The corpus will be used as a resource for creation of a continuous speech, multi-speaker, and large vocabulary speech database for Hindi Language. The larger goal is to facilitate the recognition of large vocabulary, continuous spoken fluently by any native speaker.
This paper reports on methodology used in the generation of a phonetically rich Hindi text corpus. The corpus will be used as a resource for creation of a continuous speech, multi-speaker, and large vocabulary speech database for Hindi Language. The larger goal is to facilitate the recognition of large vocabulary, continuous spoken fluently by any native speaker.
Drepturi de autor:
Attribution Non-Commercial (BY-NC)
Formate disponibile
Descărcați ca PDF, TXT sau citiți online pe Scribd
This paper reports on methodology used in the generation of a phonetically rich Hindi text corpus. The corpus will be used as a resource for creation of a continuous speech, multi-speaker, and large vocabulary speech database for Hindi Language. The larger goal is to facilitate the recognition of large vocabulary, continuous spoken fluently by any native speaker.
Drepturi de autor:
Attribution Non-Commercial (BY-NC)
Formate disponibile
Descărcați ca PDF, TXT sau citiți online pe Scribd
Phonetically Rich Hindi Sentence Corpus for Creation of Speech Database
Vishal Chourasia Samudravijaya K Manohar Chandwani
School Of Computers Tata Institute of Fundamental IET , Devi Ahilya IPS Academy, Indore Research, Mumbai, Vishwavidalaya, Indore, INDIA 452012 INDIA 400005 INDIA 452011 chourasiavishal@yahoo.com chief@tifr.res.in mc.iet@dauniv.ac.in
Abstract achieve a robust recognition. The purpose of
selection of phonetically-rich sentences is to This paper reports on methodology used in the provide a good coverage of pairs of phones in the generation of a phonetically rich Hindi text corpus. sentence. The current work paves way for The corpus will be used as a resource for creation generation of a database of Hindi speech that will of a continuous speech, multi-speaker, and large facilitate for acoustic phonetic studies, training and vocabulary speech database for Hindi Language. testing of automatic speech Recognition systems. It The larger goal is to facilitate the recognition of is hoped that the availability of this speech corpus large vocabulary, continuous spoken fluently by would also stimulate the basic research in Hindi any native speaker. This paper describes the acoustic-phonetics and phonology. design, structure and phonetic analysis of text corpus for Hindi. An analysis of the phonetic The rest of the paper is organized as follows. An richness of sentences designed by this method is outline of the design of phonetically rich sentences provided. is given in section 2. The grapheme to phoneme conversion and the process of designing phonetically rich sentences are described in section 1 Introduction 3. An analysis of phonetic richness of sentences in the text corpus is given in section 4. Some There are many applications that require high conclusions are drawn in section 5. accuracy continuous speech recognition of spoken sentences. Statistical models used for recognition of speech signal need to be trained with a large 2 Creation of sentence corpus amount of speech data corresponding to sentences containing all phonemes of the language in all Speech Recognition is a special case of supervised valid phonetic contexts. Thus, it is necessary to pattern recognition. So, the models of speech have databases which comprise of appropriate recognition have to be trained with speech data that sentences spoken by the typical users in realistic are tagged with their phonetic identity. For speaker acoustic environment. Speech databases can be independent speech recognition, speech data need divided into two groups: (i) a database of speech to be collected from a large number of speakers. It normally spoken in a specific task domain. In this is not practical to ask speakers to speak/read a lot case, small amount of speech is sufficient to of sentences that contain all phonemes (in various achieve acceptable recognition accuracy. (ii) a phonetic contexts) of the language. Hence, it is general purpose speech database that is not tuned desirable to construct sets of sentences that are to a particular task domain but consists of general phonetically rich. This construction is a laborious text and hence can be used for recognition of any task and is best automated. For such an automatic sentence in that language. To construct this kind of or semi-automatic process of designing sentences, speech database, a phonetically rich set of an adequate corpus of text in the target language is sentences based on large text database should be a primary requirement. This section describes the extracted. Previously such a speech database for methodology adapted to generate such a text Hindi was developed [1]. However, the size and corpus for Hindi language. scope of this database is limited. Thus, there is a need for larger database. The problem with most 2.1 An outline of the design process speech recognition systems is insufficient training data, containing speaking variation (spontaneous Phonetically rich sentences can be selected from a speech) caused by speaker variances (cover large large set of text. Traditional sources of text data are no of speakers). To overcome these problem, a books, magazines and periodicals. However we large vocabulary speech database is required to need a textual data in electronic form so that it can be processed by a computer. So we have two choices: (a) Manually type in the printed data from A Devnagri grapheme is represented by a code that articles, periodical, magazines etc., and store it in is either 2 or 3 bytes long. All vowel modifiers electronic form. (b) Use available online sources of ¤maa~a¥ and most pure consonants are encoded text data. Example of such sources are articles on using 2 bytes. The two-byte codes cover most of web and online news papers. There are several the Hindi graphemes. online newspapers that provide content in Hindi language using the Devnagri script. Examples of Three byte sequences represent the stand alone such electronic newspapers/magazines are webdunia, prabhasakshi, samachar, BBC, vowels [A‚ Aa,,‚ e‚ eo‚ ]‚ }], some consonants haribhoomi, amarujala, hindimilap, navbharat [C‚ ja‚ xa‚ ca‚ T‚ D ] and “danda viram” (full stop) times etc. Each online content uses its own grapheme encoding scheme to display Hindi text [.]. using Devngari fonts. To collect a large amount of text data we have used the Hindi news archives Unlike the Roman script, the Devnagri script is not available at [2], and a few Hindi articles on the linear. Ligatures represent consonant clusters. internet. We have collected the 3 years of textual Moreover, the script is not causal. The order of data of the newspaper to make a big text corpus. display of graphemes do not strictly represent the The online textual data available on the archive order of phonemes of the language. Thus, special uses a Devnagri font named “sudipto”. In order to care has to be taken while writing the G2P compute the phonetic richness of sentences and program. Such special features of Devnagri script select sentences, the Hindi text in sudipto font (and are illustrated below. the corresponding phoneme sequence) has to be Byte Symbol Grapheme Description represented using Roman symbols. However, such Sequence a grapheme to phoneme conversion programme for 195 151 i i small “i” the sudipto font was not available. Also, spelling ¤maa~a¥ information about the coding scheme was not readily available. Hence a grapheme-to-phoneme (G2P) program was written that takes care of most 194 174 n n Consonant half conventions of the sudipto font. After G2P “n” conversion, short sentences (with 4-9 words) 195 150 a a Vertical bar filtered for possible inclusion in the text corpus. A (vowel /a/) sentence selection software was used to pick up completion of those short sentences from the big corpus that character “na” satisfy the design constraints. A software 194 164 d d Consonant /d/ (“devnag”) [4] was used to generate a postscript 195 188 a Schwa for file that displays the selected Hindi sentences in completion of Devnagri script. Then, these sentences are character “da” manually validated and edited to ensure 195 150 a Vertical bar grammatical and spelling correctness. a (part of grapheme 3 Grapheme to phoneme conversion cluster to represent /r/ and A grapheme is the smallest unit of written /o/ language. The set of graphemes consists of all of 195 172 ro Rapha with the letters and letter combinations that represent ao- spelling ¤maa~a¥ phonemes of the language. Hindi uses character “o” based Devnagri script; a Devnagri character 195 130 S Y Consonant half represents either a standalone vowel or vowel in “sh” combination with one or more consonants. The 195 150 a a Vertical bar for sudipto font employs UTF-8 encoding scheme. completion of The G2P involves generating Roman symbols character “sha” representing Hindi phonemes from the UTF-8 code. Table 1 G2P conversion of word inadao-Ya A grapheme in UTF-8 is represented as a sequence Consider the process of grapheme to phoneme of length 1 to 6 bytes. In the current case, all conversion of the word nirdoSa ¤ inadao-Ya ¥ ASCII characters are represented as a single byte. (meaning: innocent) .In case of Sudipto font, the input byte sequence of the word corresponds to the [k a r (tax)]Ê nao~ [n e t r a (eye)]Ê raYT/ [r aa S t r a following decimal representation. (nation)]Ê $maala [r u m aa l (handkerchief)]Ê saMpk- 195 151 194 174 195 150 194 164 195 188 195 [s a m p a r k (contact)]Ê paik-Mga [p aa r k I ng 150 195 172 195 130 195 150 (parking)] Ê pdo- [p a r d e (curtain)]Ê dRYTI [d r I s t I Table 1 illustrates the grapheme to phoneme (eye-sight)]Ê Ëma [k r a m (sequence)]Ê ?Na [R N conversion according to the byte sequence (loan)]. The phoneme sequence as well as the generated for the word ‘nirdoSa’ ¤ inadao-Ya ¥. English word corresponding to a word is shown inside the square bracket after the script. According to this table, a simple minded conversion of byte sequence to symbol sequence 3.2 Criteria for selection of sentences will generate ´i n a d a a r o S a` because this We have adopted a certain criteria to select the corresponds to the sequence of graphemes as sentences for the dictionary creation. Only short normally written. While writing ´nirdoSa ¤inadao-Ya¥ in sentences (with a minimum of 4 and maximum of Hindi, ´i’ ¤i ¥ matra is written first because it 10 words) are picked from the large text corpus. precedes the consonant /n/ although, in spoken The sentences are manually inspected to see that language, the vowel follows the consonant. Then does not sound artificial, the sentence is consonant ´n’ and ‘a’ for ¤n ¥ and ¤a¥ which meaningful, it must not contain any offensive or sensitive words. Only those sentences which fulfill constructs the character ¤na¥.Then ´da’ and ´a’ ¤d¥ all the above criteria are retained. followed by matra ´ro’ with rapha ´r’ as ¤ao-¥ and word completion ´Sa’ for ¤Ya¥. The G2P conversion 3.3 Phonetic Richness of Sentences program has to manage this violation of causality and generate the correct sequence of phonemes of Phonetic rich sentences are needed for robust the text. estimation of the parameters statistical models of context sensitive phonemes. A phoneme is context In the above example, notice that the two byte sensitive if it is associated with several models sequences represent a single phoneme; the vowel depending on the identity of the preceding and/or 'a' with byte sequence 195 188 is coming after ‘d’ following phoneme. A triphone is characterized by (194 164) for the completion of the character /da/. both left and right context. If there are M The consonant ΄S‘ (195 130) is also followed by phonemes in a language, there can be Mⁿ ΄a‘ (195 150) for its completion ΄Sa‘, In this way triphones; where (n=3). A language may not permit Devnagri characters are constructed as all Mⁿ triphones; where (n=3), though. combinations of graphemes. A set of sentences is considered to be phonetically There are words where a grapheme is represented rich if it contains all permissible triphones of the by a 3-byte sequence. For example, when a word language in sufficient quantity. While some begins with an vowel (a ‘standalone’ vowel), the triphones occur abundantly in natural sentences, vowel is represented by a special grapheme. These quite a few triphones are rare. Thus one has to are encoded by 3 byte long codes. design/select sentences which are rich in such rare triphones. 3.1 Phoneme with multiple graphemes For developing speaker independent recognition In Hindi language, some phonemes are associated system, it is necessary to collect speech data for a with multiple grapheme. Such a phoneme is large number of speakers. However speakers are represented differently in written language, but its generally reluctant to read a large set of sentences. pronunciation is same. Such different graphemes Hence the primary goal of this work is to generate used in distinct contexts such as in combination small set of sentences, each of which can be with annuswar, consonants or vowels. A notable conveniently read by one person. It is not possible example of such phonemes is /r/; there are many to cover all triphones in one small sentence set. while graphemes representing /r/ in different Hence in this work, we consider a sentence set to phonemic contexts. Each such grapheme is be phonetically rich if it contains most, if not all, allocated a unique 2-byte long UTF-8 code. For phonemes of the language. Thus, special effort has example, the phoneme /r/ in each of the following to be taken to have as many words with rare words is associated with different graphemes: kr phonemes as possible. 3.4 Transliteration convention possible) the phonetic context criteria specified by the user. In our case, we have collected 50,000 The collected set of sentences is first converted to phonetically rich sentences (5000 sets of sentences; devnagri convention. We converted the roman each set contains 10 sentences). In other words, on symbols representing Hindi phonemes into its an average, one out of seven sentences is selected corresponding devnagri script according to by the program. Each set of 10 sentences contains Velthuis Transliteration scheme[4]. This helped us almost all the phonemes of the Hindi language. to visualize the text in Devnagri script and validate After applying corpusCrt tool on a corpus of the text. In addition, speakers find it convenient to sentences, statistics of the selected sentence are read Hindi sentences in Devnagari script. gathered. These are number of distinct units (phones) in each sentence set, frequency of 3.5 Selection of sentences occurrence of each phone, frequently occurring phone as well as rarely occurring phone and total In order to create a large phonetically rich speech number of units (phones) in a set. database, collection of large amount of text corpus is a prerequisite. Here, we took sentences from 4 Corpus Analysis online news and articles from the web. This helped >From a set of 350,000 sentences, corpusCrt us in the collection of around 350,000 sentences sentence selection tool selected 5000 sets of each containing between 4 and 10 words. A phonetically rich sentences; each set contains 10 sentence should contain a minimum of 4 words so sentences. So we have a collection of 50000 that it forms a standalone phrase or a sentences. sentences which are phonetically rich which Similarly, the maximum number of words is comprises of 41 phonemes of Hindi language. The restricted to 10 because long sentences tend to be collected sentences are said to be as phonetically complex and it is difficult for reader to read such rich, most of (preferably all) 41 phonemes are sentences fluently and naturally. present in each and every sentence set. 3.5.1 Sentence selection tool 4.1 Phonetic richness of sentence sets After collection of large number of sentences, there Statistics is gathered about the number of distinct is need to select those sentences which are phoneme in each sentence set. There are 41 distinct phonetically rich in the sense that all the phonemes phonemes in the present analysis. It may be noted must be present at least once in each the sentence that there are 10 asphirated plosives in Hindi (5 set. In order to achieve this, we used the corpus voiced and 5 unvoiced). The voiced aspirated selection program “corpusCrt” program [3]. Given plosives occur rarely. Figure 1 shows the phonetic a text corpus, this software tool generates sets of richness of the sentence sets in the form of a phonetic rich sentences that satisfies (as far as histogram. It shows the frequency count of sentence sets (out of 5000) that contain different of the sentences sets. Analysis of the sentence sets number of distinct phonemes. The number of with this criterion yields 10 rare phonemes, out of distinct phonemes (in a sentence set) is shown on the 41 phonemes of Hindi language. the x-axis, and the percentage of sentence sets containing that many distinct phonemes is on y- 5 Conclusion axis of the histogram. It is found that about 4% of Training of statistical models for automatic speech the sentence sets contain all the 41 phonemes. recognition requires large amount of speech data Only one phoneme is missing in the 16% of sets. that is rich in phonetic context. It is possible to The percentage of sentence sets containing 38 and design small sets of sentences that are not only 39 distinct phonemes are 25% and 27% phonetically rich but also convenient for speakers respectively. Histogram shows that almost 72% of to read. In this work, a nearly automatic method sets contain at least 37 distinct phonemes. Thus the has been employed to design 50,000 such sentence sets derived in this work are phonetically sentences, starting from online texts that are coded rich. It may be noted that every sentence set for display in Devnagari script. The methodology contains a minimum of 32 distinct phonemes. is scalable to generation of larger corpus. Collection of speech data using these sentences 4.2 Distribution of rare phonemes would lead to development of better performing Hindi Speech Recognition systems. After gathering the statistics of distinct phonemes in sentence sets, we identified rare phonemes References whose frequency of occurrence in the sentence sets is small. Figure 2 shows the percentage of sentence sets in which rare phonemes occurred. The x- axis [1] Samudravijaya K, P.V.S. Rao and S.S. represents the rare phonemes; y-axis represents the Agrawal, “Hindi Speech Database”, Proc. Int. percentage of sentences sets. Only those phonemes Conf. Spoken language processing, ICSLP00, which occur in less than 90% of the sentence sets October Beijing 2000, CDROM:00192.pdf are designated as rare phonemes. Figure shows that (http://speech.tifr.res.in/publications.html#sd). the voiced, asphirated, retroflex plosive /Dh/ [2] http://navbharattimes.indiatimes.com/ occurred in only 40% of sentences sets; the Archives unaspirated, unvoiced, dental plosive /th/ occurred [3] http://gps.tsc.upc.es/veu/personal/ in about 80% of sentence sets. So these two sesma/sesma/CorpusCrt.php3 phonemes are considered as most rare phonemes in [4] ftp://ftp.tex.ac.uk/tex.archive/language/ the collected sentence sets. The phonemes /S/, /N/, devnagari/velthuis /O/, /kh/, /dh/, /ph/, /U/, /D/ are also considered as [5] UTF-8 http://www.cl.cam.ac.uk/ rare phonemes as they are present in less than 90% ~mgk25/unicode.html