Documente Academic
Documente Profesional
Documente Cultură
31/3/04 8:15 am
Page 145
Abstract
Empirical analysis of any natural language needs to be substantiated with the
statistical findings because without adequate knowledge from statistics any
linguistic study can fall into the quicksand of mistaken data handling and false
observation. Recent introduction of various sub-disciplines (computational
linguistics, corpus linguistics, forensic linguistics, applied linguistics, lexicology, stylometrics, lexicography, and language teaching, etc.) requires various
statistical results of language properties to understand the language as well as to
design sophisticated tools and software for language technology. Keeping this in
mind, we present here some simple frequency counts of characters found in the
Bangla text corpus. Also, we empirically evaluate their functional behaviours in
the language with close reference to the corpus. Here we verify previously made
observations, as well as make some new observations required for various works
of language technology in Bangla.
1 Introduction
Correspondence:
Niladri Sekhar Dash,
Computer Vision and Patter
Recognition Unit,
Indian Statistical Institute.
203 B T Road, Kolkata, West Bengal,
India 700 108.
E-mail:
niladrisekhar@hotmail.com
The advent of corpus has rejuvenated the use of frequency statistics in language study because a corpus containing a large collection of empirical
data with numerous variations of use of linguistic properties is a reliable
resource for both quantitative and qualitative analysis. Though each type
of analysis is different from the other, the combination of both can give a
new dimension to the whole process of language study in general.
Quantitative analysis classifies various linguistic properties by some
predefined parameters, counts their frequency of use, and makes some
observations based on frequency data. Complex statistical techniques are
used to generalize a larger population for making comparisons among
language properties. They help to find which phenomenon is a genuine
reflection of a language or variety, and which is merely a chance occurrence. Looking at these, one can get some idea of frequency and rarity of
various language properties as well as their relative normality or abnormality in the language. Qualitative analysis, on the other hand, provides a
Literary and Linguistic Computing, Vol. 19, No. 2 ALLC 2004; all rights reserved
145
31/3/04 8:15 am
Page 146
31/3/04 8:15 am
Page 147
The list of characters considered here contains vowels, vowel allographs, consonants, consonant graphic variants, consonant clusters,
and other symbols that constitute the body of nearly 300 unique
characters in the language.
(ii) Most of the characters are made with a headline over their head
though there are some characters without this. This information is
handy for identification of characters as well as for counting their
frequency in the corpus.
(iii) For uniformity in computing, all the punctuation marks are
separated from words before the characters are put to statistical
counting.
The following sections present a quantitative-qualitative analysis on the
global occurrence of characters, vowels and their allographs, words with
particular character at first position, consonant clusters, consonant
graphic variants, clusters in words, etc. as found in the corpus.
Char
Char
Char
A
e
r
i
n
k
t
b
s
l
m
p
u
11.965
9.793
8.633
7.745
5.033
4.898
4.312
3.800
2.942
2.866
2.826
2.562
2.379
y
d
o
h
T
g
j
sh
I
Y
ch
c
s.
2.143
2.127
2.027
1.494
1.283
1.279
1.244
1.215
1.201
1.051
1.018
0.931
0.844
th
bh
kh
dh
a
D
n.
~
m.
U
r.
ph
.n
0.805
0.801
0.781
0.775
0.730
0.682
0.672
0.447
0.311
0.302
0.299
0.263
0.244
Th
gh
~n
ai
au
jh
t.
h.
v
Dh
R
Rh
0.241
0.181
0.140
0.124
0.085
0.081
0.061
0.048
0.042
0.026
0.002
0.001
147
31/3/04 8:15 am
Page 148
<i> [i]. Among vowels, <A> [] comes first followed by <e> [e], <i> [i],
<u> [u], and <o> [o] while among consonants, <r> [r] is maximum in
use followed by <n> [n], <k> [k], <t> [t], <b> [b], <s> [s], <l> [l],
<m> [m], <p> [p], <y> [j], and <d> [d]. Use of <r> [r] is higher
because of its two graphic variants (raphalA [rphol], and reph [reph])
while the use of <t> [t] is higher due to the presence of its graphic variant
(khandata [khnoto] half-t) in the language.
For the first time this study shows that the use of consonant graphic
variants has a strong impact on the frequency of characters in Bangla. It is
observed that among the first ten most frequently used characters, six are
consonants while the remaining four are vowels, all of which are easier in
articulation than others present in the language. The high frequency of
use of these vowels and consonants denotes their recurrent presence in
words. Our findings differ from that of Bhattacharya (1965) who noted
that among vowels <a> [] (15%) has the highest occurrence followed by
<A> (11%), and <e> (9%). We wonder how the occurrence of <a> can
surpass that of <A> in the language particularly when <a> exists only in
its single vowel form while <A> has both its vowel and allographic forms
with a high degree of frequency in the language. The findings of Chatterji
(1926, pp. 2712) can be referred to here to identify the frequency of
sounds represented by the characters.
148
Character type
Bhattacharya (1965)
Vowel
Consonant
Cluster
36 %
64 %
39.63 %
52.76 %
7.61 %
31/3/04 8:15 am
Page 149
(OCR) system for Bangla script (Pal and Chaudhuri, 1995). It is also
observed that occurrence of cluster is higher in the older/chaste version
than in newer/colloquial version of texts, which reflects gradual replacement of clusters by single consonants in writings.
Allograph
A
e
i
u
o
34.20
29.44
18.75
6.64
4.50
I
U
R
ai
au
3.71
1.20
0.95
0.34
0.27
149
31/3/04 8:15 am
Page 150
basic and allographic forms, constitute the major part (37.91%) of total
occurrence of characters.
From this study it is observed that Bengali people, in both speech
(Chatterji 1926, pp. 2712) and writing, prefer using central low [],
front high [i] and front mid-high [e] vowels to others. But why they like
this is an open question.
Graphic use
Allographic use
A
e
i
o
a
u
I
U
r.
ai
au
1.47 %
1.42 %
1.40 %
0.88 %
0.62 %
0.38 %
0.01 %
0.01 %
0.01 %
0.01 %
0.01 %
11.88 %
11.43 %
5.89 %
2.24 %
0.00 %
2.87 %
0.80 %
0.11 %
0.09 %
0.04 %
0.04 %
Char
Char
Char
k
p
b
s
e
A
n
m
t
d
h
a
Y
9.81
8.68
8.58
8.24
5.43
4.85
4.64
4.63
4.50
4.47
4.45
3.56
3.11
j
g
sh
r
c
ch
bh
u
o
l
th
kh
ph
2.44
2.29
2.19
2.13
1.99
1.99
1.91
1.70
1.52
1.23
1.21
0.99
0.89
i
dh
gh
T
D
Th
ai
jh
y
Dh
I
s.
r.
0.76
0.72
0.66
0.63
0.43
0.23
0.18
0.13
0.13
0.12
0.05
0.05
0.04
au
U
n.
.n
~n
v
R
Rh
~
h.
m.
t.
0.03
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
150
31/3/04 8:15 am
Page 151
Cluster
pr
ks.
nt
.ng
tr
nd
kt
st
sv
s.T
8.16
3.94
3.51
2.50
2.48
2.46
2.33
2.27
2.06
2.01
ny
sth
gr
by
j~n
cch
rth
nn
ddh
shy
1.99
1.98
1.87
1.73
1.58
1.52
1.48
1.48
1.47
1.31
151
31/3/04 8:15 am
Page 152
%-age
raphalA
yaphalA
reph
others
37.94
26.66
22.70
12.70
152
0 cluster
1 cluster
2 clusters
3 clusters
4 and more clusters
81.57
14.35
2.72
1.10
0.26
31/3/04 8:15 am
Page 153
4.1 Vowels
Bangla has eight vowel symbols to represent seven discrete vowel sounds:
(i) the vowel <a> represents // and /o/, (ii) the vowel <A> denotes //,
(iii) the vowels <i> and <I> (both short and long) represent /i/, (iv) the
vowel <e> represents /e/ and //, (v) the vowel <o> represents /o/, and
(vi) the vowels <u> and <U> (both short and long) represent /u/.
Among these, some vowels behave differently when used within words.
That means, their contextual occurrence interferes with their functions
noted in isolation. While some vowels are restricted in their positional
use, others are modified in shape when used within words. For instance,
vowels in their original forms are mostly restricted to the word-initial
position, but some vowels (e.g. <i>, <e>, <u>, and <o>) are allowed to
occur at the word-intermediate and final positions, among which <i>
and <o> can mostly occur to perform the role of emphatic particles.
The use of the vowel <A> [] at the word-medial position is a recent
phenomenon, which was unknown to us before this corpus was analysed.
This has been observed in some transliterated words (e.g. jAnuAri
[jnuri] January, and oATAr [otr] water etc.) which are borrowed
mostly from English. However, the replacement of <A> by its allograph
(e.g. doAb > doyAb [dob] river basin, and beAdab > beyAdab [bedb]
obstinate etc.) in some Persian and Arabic words has been a regular
norm for changing surface forms of words. Such replacement occurs
because (a) the vowel <A> in its original form is not usually allowed to
occur at the word-medial position where <yA> can easily occur, and (b)
the similarity in utterance between the two sets triggers a smooth
transition at the orthographic level.
In case of pronunciation, problems arise due to variation of utterance
of <a>, and <e> at the word-initial position. At this position the vowel
<a> is uttered either as // or /o/, while the vowel <e> is uttered either as
/e/ or //. Such duality in utterance often creates problems in pronunciation of some words. One has to identify the contexts where one of the two
utterances is to be used. There are certain context-based conditions
which require enough orthographic, lexical, grammatical, and semantic
information to determine the actual pronunciations of words (Dash,
2001).
Literary and Linguistic Computing, Vol. 19, No. 2, 2004
153
31/3/04 8:15 am
Page 154
4.2 Consonants
The number of consonant (35) graphemes used in Bangla is larger than
the number of sounds (30) they denote. Some consonants are similar in
sound (e.g. <j> and <y> //; <sh> and <s.> //; <n.> and <n> /n/; t and
t. /t/) though they differ in form, while others are different both in form
and sound. By default all consonants are vocalic in isolation but nonvocalic when used with vowel allographs. However, <h> // and <Rh>
// (except in As.ARh [] rainy season) are always vocalic while
<~n> // and <t.> /t/ are always non-vocalic irrespective of their contextual occurrence within words. Some consonants (e.g. <.n> //, <~n>
//,< n.> //, <R> //, <Rh> //, and <y> /e/) do not occur at the
word-initial position because they are difficult to utter at this position.
The consonant <t.> changes into <t> when used with a case marker or
suffix (e.g. mahat. > mahater of great).
There are nearly 10,000 words which are formed with the combination of consonants only (e.g. gh()n() [no] thick, m()t() [mto]
like, s()r()b() [srob] vocal, and k()m()l() [kmol] lotus etc.).
This has been possible because all non-allographed consonants and
clusters are vocalic with inherent // or /o/. So, it is difficult to determine
the actual utterances of these words since it is not known which consonant or cluster is vocalic in which way. One needs lexical as well as
semantic knowledge of the language to determine actual utterance of
these words. A few general observations, however, can be made for such
non-allographed words:
(i)
31/3/04 8:15 am
Page 155
tacharya, 2000, pp. 368; Pal, 2001, pp. 17885) provide a partial account
of their overall behaviours in the language. In case of utterance, the
majority of clusters follow the normal sequence of characters. A few
clusters, however, deviate from the standard norm to show peculiarities
(e.g. deletion, addition, displacement of sound etc.) in utterance due to
contextual interventions, as discussed below:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
Due to similarity in form and occurrence, the bilabial <b> and labio-velar
<v> create problems in their articulation (e.g. bAlb() [blb] bulb and
bilbv() [billo] wood-apple, udbeg() [udbe] anxiety and bidvAn()
Literary and Linguistic Computing, Vol. 19, No. 2, 2004
155
31/3/04 8:15 am
Page 156
[biddn] learned etc.) because while bilabial <b> is articulated distinctly, the labio-velar <v> is entirely silent. Language specific lexicosemantic information is necessary to determine the actual utterances of
these words as well as to develop systems for text-to-speech conversion
and language teaching.
Both reph and raphalA are used with the consonants for cluster
formation. Consonants with reph can occur only at the wordmedial and final positions, while consonants with raphaA can occur
at all positions. They cause two utterance variations: (a) consonants
with reph and raphalA at the word-middle and final positions are
mostly doubled in utterance, and (b) non-allographed consonants
with raphalA at the word-initial position are uttered with /o/.
(ii) The variant yaphalA can occur with all characters at any position
within words. At the word-initial position it has three variations: (a)
with consonants attached to the allograph of <A> it is uttered as
//, (b) with non-allographed consonants it is uttered as /e/ if the
consonant is followed by <i>, and (c) in other cases it has no utterance. At the word-medial and final positions all non-allographed
consonants with this variant are doubled in utterance. With the
consonant <h> at the word-medial and final positions it is pronounced as /jjh/ irrespective of any contextual variations.
Among other consonant graphic variants candrabindu and visarga are
always vocalic while anusvAra is non-vocalic. Their functional importance is primarily context bound because if detached from contexts they
lose their independent entities.
5 Conclusion
Development of language has an impact on the evolution of thought
process as well as on the enhancement of the linguistic ability of a speech
community. Script, a form of knowledge representation, uses alphabets
and other relevant signs to encode and decode knowledge as well as to
156
31/3/04 8:15 am
Page 157
Acknowledgements
This is a modified version of the paper presented in the National Conference on Language and Linguistics at Central Institute of Indian Languages,
Mysore, 2830 January 2002. The Ministry of Information Technology,
Govt. of India deserves thanks for providing the Bangla corpus. The
views and suggestions of the unknown reviewers are acknowledged with
thanks.
Notes
For easy understanding the orthographic symbols used in the Bangla
script are represented in Roman characters with extra notation (<I> [i:]
and <U> [u:] represent two long vowels; <r.> [ri] represents syllabic r;
<.n> [] stands for velar nasal; <~n> [] represents alveolar nasal; <T>
[!] and <D> [] stand for two retroflex stops; <y> [e] stands for semivowel, <sh> [] symbolizes palatal s; <s.> stands for retroflex s; <m.> []
is a nasal diacritic; <h.> [] represents aspiration; <~> [ ] stands
for a nasal variant; <t.> [t] represents half-t, and <R> [] symbolizes a
Literary and Linguistic Computing, Vol. 19, No. 2, 2004
157
31/3/04 8:15 am
Page 158
alveolar flapped stop). The Bangla words which are furnished as examples
are substantiated with IPA and meaning for proper comprehension.
References
Bhattacharya, N. (1965). Some Statistical Studies of the Bangla Language. Doctoral
Dissertation. Kolkata: Indian Statistical Institute (MS).
Bhattacharya, S. (1992). Bangla Ucchaaran Abhidhan (Bangla Pronunciation
Dictionary). Kolkata: Sahitya Sansad.
Bhattacharya, S. (2000). Bangalir Bhasa (The Language of Bengali). Kolkata:
Ananda Publishers.
Biber, D., Conrad, S., and Reppen, R. (1998). Corpus Linguistics: Investigating
Language Structure and Use. Cambridge: Cambridge University Press.
Botley, S. P., McEnery, A. M., and Wilson, A. (eds) (2000). Multilingual Corpora
in Teaching and Research. Amsterdam, Atlanta, GA: Rodopi.
Chatterji, S. K. (1926/1993). The Origin and Development of the Bengali Language.
Kolkata: Calcutta University Press. (Reprinted by Rupa Publications, Kolkata
in 1993.)
Chatterji, S. K. (1974). Bangala Bhasatattver Bhumika (An Introduction to
Bangla Linguistics). Kolkata: Calcutta University Press.
Dash, N. S. and Chaudhuri, B. B. (2000). The process of designing a multidisciplinary monolingual sample corpus. International Journal of Corpus
Linguistics, 5(2): 17997.
Dash, N. S. (2001). A Corpus-based Computational Analysis of the Bangla
Language. Doctoral Dissertation. Kolkata, University of Calcutta (MS).
Dewey, G. (1950). Relative Frequency of English Speech Sounds. Cambridge, MA:
Harvard University Press.
Good, I. J. (1957). Distribution of word frequencies. Nature, 179: 595.
Kennedy, G. (1998). An Introduction to Corpus Linguistics. New York: AddisonWesley Longman Inc.
Kettemann, C. B. and Marko, G. (eds) (2002). Teaching and Learning by Doing
Corpus Analysis. Amsterdam, Atlanta, GA: Rodopi.
Kirk, J. M. (ed.) (2000). Corpora Galore: Analyses and Techniques in Describing
English. Amsterdam, Atlanta, GA: Rodopi.
McEnery, T. and Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh
University Press.
Mair, C. and Hundt, M. (eds) (2000). Corpus Linguistics and Linguistics Theory.
Amsterdam, Atlanta, GA: Rodopi.
Mallik, B. P. and Nara, T. (eds) (1994). Gitanjali: Linguistic Statistical Analysis.
Kolkata: Indian Statistical Institute.
Mallik, B. P. and Nara, T. (eds) (1996). Sabhyatar Sankat: Linguistic Statistical
Analysis. Kolkata: Rabindra Bharati University Press.
Mallick, B. P. (ed.) (2000). Sheslekha: Linguistic Statistical Analysis. Kolkata:
Bangla Academy.
Meyer, C. F. A. (2002). English Corpus Linguistics. Cambridge: Cambridge University Press.
158
31/3/04 8:15 am
Page 159
159
31/3/04 8:15 am
Page 160