Bangla Text Corpus - Frequency and Function

LitLin 19_2 145-159 fqh010 FIN
31/3/04 8:15 am
Page 145
Frequency and Function of

Characters Used in the Bangla Text
Corpus
Niladri Sekhar Dash
Indian Statistical Institute, India
Abstract
Empirical analysis of any natural language needs to be substantiated with the
statistical findings because without adequate knowledge from statistics any
linguistic study can fall into the quicksand of mistaken data handling and false
observation. Recent introduction of various sub-disciplines (computational
linguistics, corpus linguistics, forensic linguistics, applied linguistics, lexicology, stylometrics, lexicography, and language teaching, etc.) requires various
statistical results of language properties to understand the language as well as to
design sophisticated tools and software for language technology. Keeping this in
mind, we present here some simple frequency counts of characters found in the
Bangla text corpus. Also, we empirically evaluate their functional behaviours in
the language with close reference to the corpus. Here we verify previously made
observations, as well as make some new observations required for various works
of language technology in Bangla.
1 Introduction
Correspondence:
Niladri Sekhar Dash,
Computer Vision and Patter
Recognition Unit,
Indian Statistical Institute.
203 B T Road, Kolkata, West Bengal,
India 700 108.
E-mail:
niladrisekhar@hotmail.com
The advent of corpus has rejuvenated the use of frequency statistics in language study because a corpus containing a large collection of empirical
data with numerous variations of use of linguistic properties is a reliable
resource for both quantitative and qualitative analysis. Though each type
of analysis is different from the other, the combination of both can give a
new dimension to the whole process of language study in general.
Quantitative analysis classifies various linguistic properties by some
predefined parameters, counts their frequency of use, and makes some
observations based on frequency data. Complex statistical techniques are
used to generalize a larger population for making comparisons among
language properties. They help to find which phenomenon is a genuine
reflection of a language or variety, and which is merely a chance occurrence. Looking at these, one can get some idea of frequency and rarity of
various language properties as well as their relative normality or abnormality in the language. Qualitative analysis, on the other hand, provides a
Literary and Linguistic Computing, Vol. 19, No. 2 ALLC 2004; all rights reserved
145
31/3/04 8:15 am
Page 146
Niladri Sekhar Dash
detailed description of the observed phenomena obtained by quantitative

analysis. Here, no attempt is made to assign frequency value to the
linguistic features since all rare phenomena receive the same attention as
the more frequent ones. However, a line of distinctions among data is
drawn to shoehorn them into finite sets of classification. Without
qualitative analysis, characteristic features observed within the sets of
target population specified by quantitative counts cannot be measured,
since they enable the evaluation of all relevant aspects of the phenomena
in order to draw universal or particular conclusions from them.
Corpora can contribute to both quantitative and qualitative analysis
of a language because results obtained from corpora by quantitative
analysis become richer with qualitative introspection. We have used this
dual-focus approach to study the Bangla corpus designed with the texts
of different types and genres (Dash and Chaudhuri, 2000). Though it is
not very large in size for various large-scale statistical studies (Oakes,
1998), it is large enough to make some simple observations on Bangla at
descriptive level. We present here results of some character-level frequency counts along with some qualitative analyses. We also direct our
focus to the functional behaviour of characters in order to explore their
role in language comprehension and application, which helps us to
evaluate those observations made by earlier scholars.
2 Examining the Past

Among the languages of the world, English has been used for various
frequency counts, long before the birth of electronic corpus (Williams,
1940; Dewey, 1950; Good, 1957; Miller, Newman, and Friedman, 1958).
The availability of electronic corpus has enabled various quantitative
analyses with many new results (Svartvik, 1992; McEnery and Wilson,
1996; Ooi, 1997; Biber et al., 1998; Oakes, 1998; Kennedy, 1998; Botley
et al., 2000; Kirk, 2000; Mair and Hundt, 2000; Tognini-Bonelli, 2001;
Kettemann and Marko, 2002; Meyer, 2002; Peters et al., 2002) that have
not only enriched our understanding about the language, but have also
enabled us to trace many new findings hitherto unknown to us.
Bangla, a language with the status of the fifth most popular language
in the world, has never been put to any kind of quantitative analysis with
the support of a good corpus. However, decades ago Chatterji (1926)
made a quantitative study on the basis of a Bangla dictionary, and on
a few selected texts of the Old Bengali literature. After a gap of four
decades, Bhattacharya (1965) did some frequency study on a small collection of writings of some famous literary figures of Bengal. The last on
this track are Mallik and Nara (1994, 1996) and Mallick (2000), who made
some quantitative studies on the works of Rabindranath Tagore, the
Nobel Laureate. These object-oriented studies are based on individually
collected small language samples, which cannot be claimed as a corpus
because the samples are skewed, lacking in the features of largeness,
balance, and representativeness. In this regard our study is far more
reliable and authentic as it is based on a text corpus that contains four
146
Literary and Linguistic Computing, Vol. 19, No. 2, 2004
31/3/04 8:15 am
Page 147
Frequency and Function of Characters in Bangla Text Corpus
million words of texts collected from various genres, disciplines, and

subjects published between 1981 and 1995 (Dash, 2001).
3 Issues Related with Bangla Characters

Before any kind of frequency study is initiated, the following issues are
dealt with to avoid unwanted mistakes in statistical counting, wrong
observation, and subsequent erroneous analysis.
(i)
The list of characters considered here contains vowels, vowel allographs, consonants, consonant graphic variants, consonant clusters,
and other symbols that constitute the body of nearly 300 unique
characters in the language.
(ii) Most of the characters are made with a headline over their head
though there are some characters without this. This information is
handy for identification of characters as well as for counting their
frequency in the corpus.
(iii) For uniformity in computing, all the punctuation marks are
separated from words before the characters are put to statistical
counting.
The following sections present a quantitative-qualitative analysis on the
global occurrence of characters, vowels and their allographs, words with
particular character at first position, consonant clusters, consonant
graphic variants, clusters in words, etc. as found in the corpus.
3.1 Global character occurrence

A simple frequency study on the global occurrence of characters supplies
much information to evaluate the language from various perspectives.
Here, percentage of each vowel is obtained by adding use of its original
shape as well as its allographic variation. Similarly, percentage of consonants is counted by adding occurrence in their basic shapes as well as in
their orthographic variants (where applicable). It (Table 1) shows that
the vowel <A> [] is maximum in use followed by <e> [e], <r> [r], and
Table 1 The global percentage of characters in Bangla corpus
Char
Char
Char
Char
A
e
r
i
n
k
t
b
s
l
m
p
u
11.965
9.793
8.633
7.745
5.033
4.898
4.312
3.800
2.942
2.866
2.826
2.562
2.379
y
d
o
h
T
g
j
sh
I
Y
ch
c
s.
2.143
2.127
2.027
1.494
1.283
1.279
1.244
1.215
1.201
1.051
1.018
0.931
0.844
th
bh
kh
dh
a
D
n.
~
m.
U
r.
ph
.n
0.805
0.801
0.781
0.775
0.730
0.682
0.672
0.447
0.311
0.302
0.299
0.263
0.244
Th
gh
~n
ai
au
jh
t.
h.
v
Dh
R
Rh
0.241
0.181
0.140
0.124
0.085
0.081
0.061
0.048
0.042
0.026
0.002
0.001
147
31/3/04 8:15 am
Page 148
Niladri Sekhar Dash
 [i]. Among vowels, <A> [] comes first followed by <e> [e], [i],
 [u], and <o> [o] while among consonants, <r> [r] is maximum in
use followed by <n> [n], <k> [k], <t> [t], [b], <s> [s], <l> [l],
<m> [m], [p], <y> [j], and <d> [d]. Use of <r> [r] is higher
because of its two graphic variants (raphalA [rphol], and reph [reph])
while the use of <t> [t] is higher due to the presence of its graphic variant
(khandata [khnoto] half-t) in the language.
For the first time this study shows that the use of consonant graphic
variants has a strong impact on the frequency of characters in Bangla. It is
observed that among the first ten most frequently used characters, six are
consonants while the remaining four are vowels, all of which are easier in
articulation than others present in the language. The high frequency of
use of these vowels and consonants denotes their recurrent presence in
words. Our findings differ from that of Bhattacharya (1965) who noted
that among vowels <a> [] (15%) has the highest occurrence followed by
<A> (11%), and <e> (9%). We wonder how the occurrence of <a> can
surpass that of <A> in the language particularly when <a> exists only in
its single vowel form while <A> has both its vowel and allographic forms
with a high degree of frequency in the language. The findings of Chatterji
(1926, pp. 2712) can be referred to here to identify the frequency of
sounds represented by the characters.
3.2 Basic characters

Use of three basic character types (vowel, consonant, and consonant
cluster) is taken into separate consideration because they constitute a
major part of the total occurrence of characters in the corpus. Probably,
information about their respective occurrence in the corpus will help us
to get an overall pattern of their use in the language. In Table 2, their
overall percentage of use in respect to Bhattacharya (1965) is presented,
which shows that though both vowels and consonants account for nearly
92.39% of the total occurrence, use of consonants is much higher than
that of vowels. Bhattacharya (1965) also observed this predominance of
consonants (64%) over vowels (36%) in his database but did not consider the use of clusters. Occurrence of vowels and diphthongs (38%) in
English is significantly less common than consonants (62%) (Miller et al.,
1958).
From this study it is known for the first time that consonant clusters
are quite frequent in the language. From a sample study it is found that a
regular page of a book contains nearly 1,000 clusters, which require their
proper analysis at the time of designing an Optical Character Recognition
Table 2 The percentage of three basic character types
Percentage of use
148
Character type
Bhattacharya (1965)
ISI corpus (2001)
Vowel
Consonant
Cluster
36 %
64 %
39.63 %
52.76 %
7.61 %
31/3/04 8:15 am
Page 149
(OCR) system for Bangla script (Pal and Chaudhuri, 1995). It is also
observed that occurrence of cluster is higher in the older/chaste version
than in newer/colloquial version of texts, which reflects gradual replacement of clusters by single consonants in writings.
3.3 Vowel allographs

Percentage of vowel allographs (Table 3) represents an overall pattern for
selecting particular allographs in word formation in Bangla. Among all
allographs, use of allograph of <A> is highest followed by the allographs
of <e>, , and <o>, consecutively.
In Hindi also, the allograph of <A> is the highest in use followed by the
allograph of other vowels (Tripathi, 1971). This high percentage of <A>
in the language is a clue for examining the normal speech and writing
patterns of the Bangla speakers. The occurrence of the allograph of vowel
<a> is not possible to count, as the vowel has no allographic representation in the language. Interestingly, this is also true for some other Indian
languages like Hindi, Assamese, and Oriya. Why this is not available
when all other vowels have at least one allograph is a difficult question to
answer. Probably its answer lies in the intricate Sanskrit alphabet system
that has been borrowed into Bangla and many other Indian languages.
The inherent presence of // (orthographically represented by <a>) in all
non-allographed consonants determines its uniqueness in respect to
other vowels and allographs in the language. When allographs of other
vowels are orthographically absent with a consonant at certain contexts
within some words, the default inherent sound of the vowel <a> (either
[] or [o]) confirms its presence in utterance with that particular nonallographed consonant. Thus, the word k()r() (where each consonant
carries the inherent vowel sound of <a> with it) is pronounced either as
[kr], [kr], and [kro] in Bangla depending upon the contexts of use
of the word in a piece of text, as well as on its meaning implied in that
particular context.
3.4 Vowels and allographs

Occurrence of vowels in respect to their allographs also reveals how
language users incline towards using allographs in writing words rather
than original vowels. It is noted (Table 4) that for each vowel (barring
<a>) the percentage of use of allograph is much higher than its original
form. It (Table 4) also shows that vowel <A>, both in its original and
allographic form, is the highest in use followed by <e>, , and
<o>, respectively. In fact, <A>, <e>, , and <o>, including both
Table 3 The percentage of vowel allographs
Allograph
Allograph
A
e
i
u
o
34.20
29.44
18.75
6.64
4.50
I
U
R
ai
au
3.71
1.20
0.95
0.34
0.27
149
31/3/04 8:15 am
Page 150
Niladri Sekhar Dash
basic and allographic forms, constitute the major part (37.91%) of total
occurrence of characters.
From this study it is observed that Bengali people, in both speech
(Chatterji 1926, pp. 2712) and writing, prefer using central low [],
front high [i] and front mid-high [e] vowels to others. But why they like
this is an open question.
3.5 Word initial characters

Words starting with a particular character at the initial position directly
suggests for the behavioural preference of the users as well as morphological restrictions exercised in word formation. It is found (Table 5) that
words starting with the consonant <k> are most frequent at that position
followed by the words starting with , , <s>, <e>, <A>, <n>,
<m>, <t>, <d>, and <h>, consecutively. This implies that other characters are less preferred in word formation due to their language-specific
restrictions at that position. Probably, the placement of <k> at the wordinitial position makes the language users more comfortable in pronunciation than its placement at other positions in words. The number
of words starting with consonants (81.52%) easily outclasses the number
of words starting with vowels (18.48%), which is probably true in other
Table 4 The use of vowels and their allographs
Vowel
Graphic use
Allographic use
A
e
i
o
a
u
I
U
r.
ai
au
1.47 %
1.42 %
1.40 %
0.88 %
0.62 %
0.38 %
0.01 %
0.01 %
0.01 %
0.01 %
0.01 %
11.88 %
11.43 %
5.89 %
2.24 %
0.00 %
2.87 %
0.80 %
0.11 %
0.09 %
0.04 %
0.04 %
Table 5 Words with particular character at first position

Char
Char
Char
Char
k
p
b
s
e
A
n
m
t
d
h
a
Y
9.81
8.68
8.58
8.24
5.43
4.85
4.64
4.63
4.50
4.47
4.45
3.56
3.11
j
g
sh
r
c
ch
bh
u
o
l
th
kh
ph
2.44
2.29
2.19
2.13
1.99
1.99
1.91
1.70
1.52
1.23
1.21
0.99
0.89
i
dh
gh
T
D
Th
ai
jh
y
Dh
I
s.
r.
0.76
0.72
0.66
0.63
0.43
0.23
0.18
0.13
0.13
0.12
0.05
0.05
0.04
au
U
n.
.n
~n
v
R
Rh
~
h.
m.
t.
0.03
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
150
31/3/04 8:15 am
Page 151
languages also. Characters with zero percentage indicate the absence of

words starting with that particular character in the language.
By simple observation it can be argued that the presence of a large
number of consonants in the language is instrumental in the formation
of more words with consonants than with vowels. Strikingly, most words
start with velar, labial, dental, and sibilant consonants, as these are much
easier to begin with than others. Interestingly, 50% among the first
twenty most frequent characters fall within the class of sonants to substantiate the general assumption that Bangla, like French and Spanish, is
one of those soft and sweet languages in the world.
3.6 Consonant clusters

Recurrent use of consonant clusters in the texts exhibits their patterns of
use as well as their functional relevance in the language. Table 6 presents
a list of the first twenty of the most frequently used consonant clusters,
among which <pr>, <ks.>, <tr>, <st>, <sv>, <ny>, <sth>, <gr>, <by>,
<j~n> and <shy> can occur at any position of a word while the remaining can occur only at intermediate and word-final position. Among
these, <pr> is maximum in use because both the consonant and its
graphic variant are quite frequent in use in the language. However,
contrary to the general popular assumption, the occurrence of <pr> at
word-initial position is less than other positions in words.
In some earlier versions of Bangla primers the cluster <ks.> was considered as a consonant. Probably, its high frequency of use in the language
might have motivated the authors of these primers to consider it as a
consonant rather than a cluster. It could have happened for <pr>,
because it registers the highest percentage of use, but unlike <ks.> the
second consonant of the cluster <pr> is extremely likely to be used with
most other consonants. Probably, for this reason <ks.> entertained a
kind of privilege which <pr> did not. However, in the recent version of
the primers, the cluster <ks.> is removed from the list of consonants, and
put in the place where it should actually belong.
3.7 Consonant graphic variants

Among the consonant graphic variants the occurrence of raphalA,
yaphalA and reph is much more common in the language. Their total
Table 6 The percentage of use of consonant clusters
Cluster
Cluster
pr
ks.
nt
.ng
tr
nd
kt
st
sv
s.T
8.16
3.94
3.51
2.50
2.48
2.46
2.33
2.27
2.06
2.01
ny
sth
gr
by
j~n
cch
rth
nn
ddh
shy
1.99
1.98
1.87
1.73
1.58
1.52
1.48
1.48
1.47
1.31
151
31/3/04 8:15 am
Page 152
Niladri Sekhar Dash
percentage of use (87.3%) contrasts markedly with the percentage of use

(12.7%) of all other graphic variants (e.g. vaphalA, maphalA, and
naphalA, etc.) put together.
There is a debate over the relevance of these graphic variants in the
language because their presence has made the language script to a great
extent untidy, and language learning an extremely complicated and
troublesome task. Language learners have to learn the patterns of their
formation and use, which tests the patience of learners in the initial
phase. But an analysis of form and function of these variants (sub-section
4.4) shows that these are designed for the purpose of consonant cluster
formation in a very short and efficient way. Because of their recurrent use
in language script, they are simplified in form for easy use in writing.
3.8 Clusters in words

The relative decrease of clusters in words (Table 8) reveals how the
language is passing through the processes of orthographic simplification
in writing over time. In the present corpus the number of words without
a cluster (81.57%) far exceeds the number of words with one (14.35%),
two (2.72%), and three (1.10%) clusters, which is definitely a boon for
language learners.
A sample survey on some texts taken from fictions, short stories, essays,
diaries, and mass media reveals the marginal presence of clusters in many
pages. This however, is not true to the texts belonging to social science,
natural science, and others, which possess a large number of clusters in
words. In another pilot study two versions of the same text are examined
to observe that texts written in chaste or archaic version contain a much
larger number of clusters (nearly double) than the texts written in
colloquial style. Words which have more than three clusters are fewer in
number, and are mostly Tatsama words derived from Sanskrit with a
strong flavour of archaism. Only a few words are found where all wordTable 7 The percentage of
consonant graphic variants
Consonant graphic variants
%-age
raphalA
yaphalA
reph
others
37.94
26.66
22.70
12.70
Table 8 The relative percentage of

clusters in words
152
No. of cluster in words
0 cluster
1 cluster
2 clusters
3 clusters
4 and more clusters
81.57
14.35
2.72
1.10
0.26
31/3/04 8:15 am
Page 153
forming characters are clusters (e.g. br()hm() [bromo] Brahma, and

pr()cch()nn() [procchonno] disguised, etc.). This study shows that
the use of clusters in words is gradually decreasing; we speculate that it
will decrease further over time through the process of the replacement of
clusters by simple consonants.
4 Functional Behaviours of Characters

In this section we discuss the functional behaviours of characters at
various positions within words. We also evaluate their roles in changing
form and pronunciation of words where they are used.
4.1 Vowels
Bangla has eight vowel symbols to represent seven discrete vowel sounds:
(i) the vowel <a> represents // and /o/, (ii) the vowel <A> denotes //,
(iii) the vowels and (both short and long) represent /i/, (iv) the
vowel <e> represents /e/ and //, (v) the vowel <o> represents /o/, and
(vi) the vowels and (both short and long) represent /u/.
Among these, some vowels behave differently when used within words.
That means, their contextual occurrence interferes with their functions
noted in isolation. While some vowels are restricted in their positional
use, others are modified in shape when used within words. For instance,
vowels in their original forms are mostly restricted to the word-initial
position, but some vowels (e.g. , <e>, , and <o>) are allowed to
occur at the word-intermediate and final positions, among which 
and <o> can mostly occur to perform the role of emphatic particles.
The use of the vowel <A> [] at the word-medial position is a recent
phenomenon, which was unknown to us before this corpus was analysed.
This has been observed in some transliterated words (e.g. jAnuAri
[jnuri] January, and oATAr [otr] water etc.) which are borrowed
mostly from English. However, the replacement of <A> by its allograph
(e.g. doAb > doyAb [dob] river basin, and beAdab > beyAdab [bedb]
obstinate etc.) in some Persian and Arabic words has been a regular
norm for changing surface forms of words. Such replacement occurs
because (a) the vowel <A> in its original form is not usually allowed to
occur at the word-medial position where <yA> can easily occur, and (b)
the similarity in utterance between the two sets triggers a smooth
transition at the orthographic level.
In case of pronunciation, problems arise due to variation of utterance
of <a>, and <e> at the word-initial position. At this position the vowel
<a> is uttered either as // or /o/, while the vowel <e> is uttered either as
/e/ or //. Such duality in utterance often creates problems in pronunciation of some words. One has to identify the contexts where one of the two
utterances is to be used. There are certain context-based conditions
which require enough orthographic, lexical, grammatical, and semantic
information to determine the actual pronunciations of words (Dash,
2001).
153
31/3/04 8:15 am
Page 154
Niladri Sekhar Dash
4.2 Consonants
The number of consonant (35) graphemes used in Bangla is larger than
the number of sounds (30) they denote. Some consonants are similar in
sound (e.g. <j> and <y> //; <sh> and <s.> //; <n.> and <n> /n/; t and
t. /t/) though they differ in form, while others are different both in form
and sound. By default all consonants are vocalic in isolation but nonvocalic when used with vowel allographs. However, <h> // and <Rh>
// (except in As.ARh [] rainy season) are always vocalic while
<~n> // and <t.> /t/ are always non-vocalic irrespective of their contextual occurrence within words. Some consonants (e.g. <.n> //, <~n>
//,< n.> //, <R> //, <Rh> //, and <y> /e/) do not occur at the
word-initial position because they are difficult to utter at this position.
The consonant <t.> changes into <t> when used with a case marker or
suffix (e.g. mahat. > mahater of great).
There are nearly 10,000 words which are formed with the combination of consonants only (e.g. gh()n() [no] thick, m()t() [mto]
like, s()r()b() [srob] vocal, and k()m()l() [kmol] lotus etc.).
This has been possible because all non-allographed consonants and
clusters are vocalic with inherent // or /o/. So, it is difficult to determine
the actual utterances of these words since it is not known which consonant or cluster is vocalic in which way. One needs lexical as well as
semantic knowledge of the language to determine actual utterance of
these words. A few general observations, however, can be made for such
non-allographed words:
(i)
At the word-initial position non-allographed consonants are vocalic

with // or /o/;
(ii) At the word-final position non-allographed consonants are mostly
non-vocalic;
(iii) Normal vocality pattern is: /v~n~v~n/ (v denotes vocality and n
denotes non-vocality), and
(iv) Regular vocalisation is: / o o o/ or vice versa.
4.3 Consonant clusters

Existence of consonant clusters in Bangla can be traced back to the Brahmi
script; the mother of most Indian scripts (Chatterji, 1974, p. 56). It has a
large set of clusters formed by joining two or more consonants, which
however does not happen for the vowels. The reason behind the formation
of eye clusters (orthographic cluster) can be traced to ear clusters
(phonetic cluster), which occur quite frequently in Bangla speech (Sarkar,
1993, p. 25). In regular speech sequence two or more consonant sounds are
joined to produce ear clusters, which has made eye clusters an integrated
element of the Bangla script. This observation carries utmost importance
in language learning, spell-checker designing, optical character recognition
system development, and text-to-speech conversion, etc. It also throws
some light on the linguistic behaviour of the Bangla speakers.
Studies on the utterance pattern of clusters by scholars (Bhattacharya,
1992, pp. 1113; Sarkar, 1993, pp. 2345; Ray, 1997, pp. 1416; Bhat154
31/3/04 8:15 am
Page 155
tacharya, 2000, pp. 368; Pal, 2001, pp. 17885) provide a partial account
of their overall behaviours in the language. In case of utterance, the
majority of clusters follow the normal sequence of characters. A few
clusters, however, deviate from the standard norm to show peculiarities
(e.g. deletion, addition, displacement of sound etc.) in utterance due to
contextual interventions, as discussed below:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
At the word-final position all non-allographed clusters are vocalic

with /o/ except in some borrowed foreign words where one can
identify the impact of foreign pronunciation (e.g. English, Persian,
Arabic, and French etc.) on Bangla speech.
In the case of the cluster <ks> the normal sequence of consonants
(C1C2) is sometimes transpositioned (C2C1) in utterance that
causes /k to be pronounced as /k/ due to anaptyxis, a common
practice in Bangla speech. In language teaching and text-to-speech
conversion this peculiarity needs to be treated with due importance.
In the case of the cluster <ks.> the second character <s.> loses its
utterance to change the utterance of the first character <k> from
/k/ to /kh/. At the word-middle and final positions the first
character is again geminated to produce /kkh/ due to phonetic
assimilation, a regular phenomenon in standard Bangla speech.
The cluster <j~n> causes two utterance variations: // at the wordinitial position, and // at other positions. In both cases C1 loses
its utterance while C2 is nasalized.
In the case of cluster <~nc>, <~nch>, and <~nj>, C1 (a velar nasal
consonant) is mostly pronounced as /n/, but in their reverse
sequence (<c~n>, <j~n>) it loses utterance to nasalize its preceding character.
In the case of clusters <shm> and <sm> at the word-initial position, the C2 loses its utterance to nasalize preceding characters (few
exceptions are found where C2 retains its utterance). In clusters
<tm>, <dm>, <sm>, and <shm> at word-medial and final positions, C2 loses its utterance to nasalize and double the utterance of
the preceding character. In the cluster <kshm>, C2 is entirely lost in
utterance.
In clusters of <hn.>, <hn>, <hm>, and <hl>, the actual orthographic sequence (C1C2) of characters is reversed in utterance
(C2C1).
The labio-velar <v> as a cluster-final member can modify the
utterance of a cluster in three ways: (a) at the word-initial position
it loses its utterance entirely, (b) at the word-medial and final
positions it loses its utterance to double the utterance of its preceding character, and (c) in the cluster of <hv> at the word-middle
and final positions, it produces /b/ (Sarkar 1993, p. 43).
Due to similarity in form and occurrence, the bilabial and labio-velar
<v> create problems in their articulation (e.g. bAlb() [blb] bulb and
bilbv() [billo] wood-apple, udbeg() [udbe] anxiety and bidvAn()
155
31/3/04 8:15 am
Page 156
Niladri Sekhar Dash
[biddn] learned etc.) because while bilabial is articulated distinctly, the labio-velar <v> is entirely silent. Language specific lexicosemantic information is necessary to determine the actual utterances of
these words as well as to develop systems for text-to-speech conversion
and language teaching.
4.4 Consonant graphic variants

The list of graphic variants includes raphalA and rephtwo graphic
variants of <r>, yaphalA [phol]a variant of <Y>; maphalA
[mphol]a variant of <m>; anusvAr(a) [onur], candrabindu
[cndrobindu], visarga [biro], and such similar orthographic
symbols used in Bangla.
These consonant graphic variants take part only in cluster formation.
Among the two variants of <r>, reph occurs at the upper tier of a consonant without causing any structural change while raphalA occurs at the
lower tier causing change in the original shape of some consonants.
The functional behaviours of these characters are always context
bound, and are dictated by the presence or absence of other characters in
the contexts, as discussed below:
(i)
Both reph and raphalA are used with the consonants for cluster
formation. Consonants with reph can occur only at the wordmedial and final positions, while consonants with raphaA can occur
at all positions. They cause two utterance variations: (a) consonants
with reph and raphalA at the word-middle and final positions are
mostly doubled in utterance, and (b) non-allographed consonants
with raphalA at the word-initial position are uttered with /o/.
(ii) The variant yaphalA can occur with all characters at any position
within words. At the word-initial position it has three variations: (a)
with consonants attached to the allograph of <A> it is uttered as
//, (b) with non-allographed consonants it is uttered as /e/ if the
consonant is followed by , and (c) in other cases it has no utterance. At the word-medial and final positions all non-allographed
consonants with this variant are doubled in utterance. With the
consonant <h> at the word-medial and final positions it is pronounced as /jjh/ irrespective of any contextual variations.
Among other consonant graphic variants candrabindu and visarga are
always vocalic while anusvAra is non-vocalic. Their functional importance is primarily context bound because if detached from contexts they
lose their independent entities.
5 Conclusion
Development of language has an impact on the evolution of thought
process as well as on the enhancement of the linguistic ability of a speech
community. Script, a form of knowledge representation, uses alphabets
and other relevant signs to encode and decode knowledge as well as to
156
31/3/04 8:15 am
Page 157
convert auditory sounds into visual symbols. So the study of script is

not only useful for understanding linguistic behaviour of people of a
language community but also for exploring their linguistic-cognitive
interface that empowers them to express events and concepts, knowledge
and information, and ideas and imagination through an accepted set of
linguistic symbols and characters intelligible to them.
Frequency count is the most straightforward way to work with
language data. Here items are classified according to a particular scheme,
and an arithmetical count is made on the number of items within texts
which belong to each class in the scheme. It is definitely useful but it has
certain limitations when one data set is compared with another. It only
gives the number of occurrences of each type, but does not indicate the
prevalence of a type in terms of proportion of the total number of tokens
in the texts. This is not a problem when comparable corpora are of the
same size. But when they are of a different size, frequency counts are to be
made with further caution. Even where disparity of size is not an issue, it
is better to use proportional statistics to present frequencies, since it is
easier to understand than comparing fractions of unusual numbers.
However, this proportional data may not be appropriate for various types
of significance test.
The simple frequency count of characters and the qualitative analysis
of their functional behaviour presented here are useful for designing tools
and systems for optical character recognition, natural language processing, cryptography, spell-checking, machine readable dictionary, and
machine translation, etc. These are also valuable in language education
where learners can be properly informed about how characters occur and
behave in the written texts of the language.
Acknowledgements
This is a modified version of the paper presented in the National Conference on Language and Linguistics at Central Institute of Indian Languages,
Mysore, 2830 January 2002. The Ministry of Information Technology,
Govt. of India deserves thanks for providing the Bangla corpus. The
views and suggestions of the unknown reviewers are acknowledged with
thanks.
Notes
For easy understanding the orthographic symbols used in the Bangla
script are represented in Roman characters with extra notation ( [i:]
and [u:] represent two long vowels; <r.> [ri] represents syllabic r;
<.n> [] stands for velar nasal; <~n> [] represents alveolar nasal; <T>
[!] and <D> [] stand for two retroflex stops; <y> [e] stands for semivowel, <sh> [] symbolizes palatal s; <s.> stands for retroflex s; <m.> []
is a nasal diacritic; <h.> [] represents aspiration; <~> [ ] stands
for a nasal variant; <t.> [t] represents half-t, and <R> [] symbolizes a
157
31/3/04 8:15 am
Page 158
Niladri Sekhar Dash
alveolar flapped stop). The Bangla words which are furnished as examples
are substantiated with IPA and meaning for proper comprehension.
References
Bhattacharya, N. (1965). Some Statistical Studies of the Bangla Language. Doctoral
Dissertation. Kolkata: Indian Statistical Institute (MS).
Bhattacharya, S. (1992). Bangla Ucchaaran Abhidhan (Bangla Pronunciation
Dictionary). Kolkata: Sahitya Sansad.
Bhattacharya, S. (2000). Bangalir Bhasa (The Language of Bengali). Kolkata:
Ananda Publishers.
Biber, D., Conrad, S., and Reppen, R. (1998). Corpus Linguistics: Investigating
Language Structure and Use. Cambridge: Cambridge University Press.
Botley, S. P., McEnery, A. M., and Wilson, A. (eds) (2000). Multilingual Corpora
in Teaching and Research. Amsterdam, Atlanta, GA: Rodopi.
Chatterji, S. K. (1926/1993). The Origin and Development of the Bengali Language.
Kolkata: Calcutta University Press. (Reprinted by Rupa Publications, Kolkata
in 1993.)
Chatterji, S. K. (1974). Bangala Bhasatattver Bhumika (An Introduction to
Bangla Linguistics). Kolkata: Calcutta University Press.
Dash, N. S. and Chaudhuri, B. B. (2000). The process of designing a multidisciplinary monolingual sample corpus. International Journal of Corpus
Linguistics, 5(2): 17997.
Dash, N. S. (2001). A Corpus-based Computational Analysis of the Bangla
Language. Doctoral Dissertation. Kolkata, University of Calcutta (MS).
Dewey, G. (1950). Relative Frequency of English Speech Sounds. Cambridge, MA:
Harvard University Press.
Good, I. J. (1957). Distribution of word frequencies. Nature, 179: 595.
Kennedy, G. (1998). An Introduction to Corpus Linguistics. New York: AddisonWesley Longman Inc.
Kettemann, C. B. and Marko, G. (eds) (2002). Teaching and Learning by Doing
Corpus Analysis. Amsterdam, Atlanta, GA: Rodopi.
Kirk, J. M. (ed.) (2000). Corpora Galore: Analyses and Techniques in Describing
English. Amsterdam, Atlanta, GA: Rodopi.
McEnery, T. and Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh
University Press.
Mair, C. and Hundt, M. (eds) (2000). Corpus Linguistics and Linguistics Theory.
Amsterdam, Atlanta, GA: Rodopi.
Mallik, B. P. and Nara, T. (eds) (1994). Gitanjali: Linguistic Statistical Analysis.
Kolkata: Indian Statistical Institute.
Mallik, B. P. and Nara, T. (eds) (1996). Sabhyatar Sankat: Linguistic Statistical
Analysis. Kolkata: Rabindra Bharati University Press.
Mallick, B. P. (ed.) (2000). Sheslekha: Linguistic Statistical Analysis. Kolkata:
Bangla Academy.
Meyer, C. F. A. (2002). English Corpus Linguistics. Cambridge: Cambridge University Press.
158
31/3/04 8:15 am
Page 159
Miller, G.A, Newman, E. B., and Friedman, E. A. (1958). Length-frequency

statistics for written English. Information and Control, 1: 37089.
Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh
University Press.
Ooi, V. B. Y. (1997). Computer Corpus Lexicography. Edinburgh: Edinburgh University Press.
Pal, U. and Chaudhuri, B. B. (1995). Computer recognition of printed Bangla
script. International Journal of Systems Science, 26(3): 210723.
Pal, P. B. (2001). Dhvanimala Barnamala (The Sounds and the Alphabets).
Pyapirus: Kolkata.
Peters, B. P., Collins, P., and Smith, A. (eds) (2002). New Frontiers of Corpus
Research. Language and Computers. Amsterdam, Atlanta, GA: Rodopi.
Ray, P. S. (1997). Bengali Language Handbook. Kolkata: Bangla Akademy.
Sarkar, P. (1993). Bangla bhasar yuktabyanjan (Consonant clusters in Bangla),
Bhasa, 1: 2345.
Svartvik, J. (ed.) (1992). Directions in Corpus Linguistics: Proceedings of Nobel
Symposium 82. Berlin: Mouton de Gruyter.
Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam: John
Benjamins.
Tripathi, J. N. (1971). A statistical analysis of Devnagari (Hindi) text graphemes.
Journal of IETE, 17(1): 257.
Williams, C. B. (1940). A note on the statistical analysis of sentence length as a
criterion of literary style. Biometrika, 31: 35661.
159
31/3/04 8:15 am
Page 160

Bangla Text Corpus - Frequency and Function

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bangla Text Corpus - Frequency and Function

Încărcat de

Drepturi de autor:

Formate disponibile

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

detailed description of the observed phenomena obtained by quantitative

2 Examining the Past

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of Characters in Bangla Text Corpus

million words of texts collected from various genres, disciplines, and

3 Issues Related with Bangla Characters

3.1 Global character occurrence

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

3.2 Basic characters

ISI corpus (2001)

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of Characters in Bangla Text Corpus

3.3 Vowel allographs

3.4 Vowels and allographs

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

3.5 Word initial characters

Table 5 Words with particular character at first position

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of Characters in Bangla Text Corpus

languages also. Characters with zero percentage indicate the absence of

3.6 Consonant clusters

3.7 Consonant graphic variants

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

percentage of use (87.3%) contrasts markedly with the percentage of use

3.8 Clusters in words

Table 8 The relative percentage of

No. of cluster in words

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of Characters in Bangla Text Corpus

forming characters are clusters (e.g. br()hm() [bromo] Brahma, and

4 Functional Behaviours of Characters

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

At the word-initial position non-allographed consonants are vocalic

4.3 Consonant clusters

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of Characters in Bangla Text Corpus

At the word-final position all non-allographed clusters are vocalic

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

4.4 Consonant graphic variants

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

LitLin 19_2 145-159 fqh010 FIN

Frequency and Function of Characters in Bangla Text Corpus

convert auditory sounds into visual symbols. So the study of script is

LitLin 19_2 145-159 fqh010 FIN

Niladri Sekhar Dash

Literary and Linguistic Computing, Vol. 19, No. 2, 2004

forming characters are clusters (e.g. br()hm() [bromo] Brahma, and