Phoneme-Based English-Amharic Statistical Machine Translation

Phoneme-based English-Amharic Statistical Machine
Translation
Mulu Gebreegziabher Teshome (PhD) Laurent Besacier (Prof.) Girma Taye (PhD)
IT Doctoral Program, Addis Ababa University University Joseph Fourier and Dereje Teferi (PhD)
Addis Ababa, Ethiopia Grenoble, France: Addis Ababa University
Email: mulu@ethiotelecom.et Email: laurent.besacier@imag.fr Addis Ababa, Ethiopia
AbstractThis research considers the application of Statistical of-vocabulary (OOV) problems. The phonemic variations in
method to automatic Machine Translation (MT) from English Amharic that challenges the SMT system are those phonemes
to Amharic. The research focuses on improving the translation that can be represented by more than one series of symbols.
quality by applying phonemic transcription on the target side, For example, there are four distinct letter forms or graphemes
which is Amharic. Accordingly, the BLEU score results for the representing the /h/ sound (h M u ). These letters are all
phoneme-based EASMT system is 37.53% a gain of 2.21 BLEU
audibly the same and are not separate phonemes. This brings
point from another baseline phrase-based EASMT with a BLEU
score result of 35.32%. This clearly shows that phoneme-based about data sparsity or OOV problem, as a single phoneme can
translation outperforms the baseline system. be represented by two or more symbols that are used as options
to write a given word.
The Allophonic variation in Amharic is another problem for
I. I NTRODUCTION SMT. The allophones that pose problems to SMT are those
MT is the application of computers to translate text from allophones called free variants. The free variant allophones in
one natural language to another using computer generated in- Amharic language occur in identical position and are dialectics
structions. The MT experiment is conducted between Amharic that doesnt bring meaning difference, for example, [t] in
and English languages. The two languages represent a wide [t@b@l] and [s] in [s@b@l] are the same allophones that can be
range of linguistic diversity and make the MT task challenging. used interchangeably. This is an indication that phonemes can
English, which belongs to Indo-European language family, is have too many allophone variants. Such allophone variation
the foreign language used in Ethiopia as a medium of instruc- similar to that of phoneme variants contributes to the data
tion and communication in governmental, non-governmental, sparseness that poses a problem to SMT.
regional and international organizations. Amharic on the other Therefore, to determine whether we have different phonemes
hand belongs to the Semitic branch of the Afro-Asiatic lan- or allophones of a single phoneme, the position of sounds
guage family spoken in Ethiopia. It is the ofcial language of within words, distribution of sounds and phonemic or allo-
the Federal Democratic Republic of Ethiopia. phonic variations are some of the challenges to the SMT
Amharic is the second most-widely spoken Semitic language system. Therefore, the purpose of this research is to address
in the world after Arabic as noted by [1]. According to [1], these problems.
[2], Amharic is spoken as a rst language by more than 27 Phoneme-based experiment combined with normalization and
million people of the country. Amharic is also spoken by segmentation using Morfessor by [10] minimizes and solves
many Ethiopians, outside the Amhara Region and the capital the data sparseness and OOV problems. The phoneme-based
city of Addis Ababa, as a second language. [1] also puts its experiment has been conducted in order to address the syllabic
assumption that there are several million Ethiopian immigrants, character variation that are shown on the target side, which
especially in North America and Israel that are native speaker is the Amharic. Since the experiment is made between the
of Amharic. Studies show that English and Amharic are Amharic and English languages, the Amharic syllables have
disparate languages that makes translation very difcult. The to be converted to consonant and vowel combinations similar
variation between English and Amharic languages is: Latin to that of the English language. Therefore, a phoneme-based
vs. Geez script; consonant-vowel vs. syllabic characters; very EASMT is built by converting the Amharic Syllables to
few vs. many orthographic characters; one-to-one vs. duplicate Phoneme-based characters. The conversion methodology is
phonemic characters; simple vs. complex morphology includ- grapheme-to-phoneme applied for all Amharic corpora includ-
ing concatenate vs. non-concatenate word formation and stem ing the Language Model (LM), the training, the tuning and the
vs. root morpheme; word boundaries delimited with white- reference.
space vs. no clear boundary between word and sentence; Latin-
based punctuation vs. Geez-based punctuation; Indo-Arabian II. L ITERATURE REVIEW
vs. Geez numbers; Date-Month-Year difference, and SVO vs. Phoneme is the basic and smallest distinctive unit sound,
SOV syntax. which may bring about a change of meaning, in the sound
Some of the mentioned variations such as syllabic characters, system of a specic language as dened by [4], [6], [7].
duplicate phonemic characters and complex morphology that A given sound is called unit if a different word cannot be
are shown in Amharic brings about data sparseness and out- formed by replacing part of the sound with another different
978-1-4799-7492-4/15/$31.00 2015
c IEEE sound. For example, the word cat in the English language
can be divided in to three phonemes corresponding to the is that this phoneme is used in foreign-loaned words as pointed
letters /c/, /a/ and /t/. The three are unit sounds and each out by [1]. But this is not the only phoneme that is used in
of them can not be further divided in to parts. Two sounds foreign-loan words. There are other two phonemes /p / and /p/
are called distinctive if interchanging the two sounds change that are used in foreign-loan words mostly adopted to Amharic
the meaning of a given word. Changing a single phoneme like [s] [Petros] for P eter, [ws] [Pawlos] for
in the word cat is sufcient to make a word which is P aulus. Of course the consonant phoneme /v/ is also used
recognizably different to a speaker of English. Replacing the in words of foreign origin such as the // /vi/ in [n]
phoneme /c/ in [cat] to /b/ brings about a change in meaning [televiZn] for television or in [m.y.] [Pe.Paye.vi] for
making it different word. The word [bat] is different from the HIV and the // /va/ in [n] [vayolin] for violin.
word [cat] and are recognizably different words to an English Phoneme /v/ is not poor in lexicon and even more richer than
speaker. Similarly, the two paired Amharic words [r] [har] the phoneme /p /, which is even harder to nd lexicons in the
for silk vs. [r] [sar] for grass are examples of distinctive dictionary. Therefore, phoneme /v/ should be included with the
unit sounds in Amharic. rest of Amharic phonemes but not to be omitted or treated as
Amharic has thirty-nine phonemes, out of which, seven are an additional phoneme.
vowel phonemes while thirty-two are consonant phonemes. The distinguishing feature of consonants in Amharic is the
According to [5], it was [9] who noted that there is no presence of ejective and labialized sounds. The ejective con-
agreement among scholars about the number of phonemes sonant phonemes that doesnt occur in English language are
in Amharic. The total number of phonemes is put in the /k /, /t /, //, /p /, /s / produced with the root of the tongue
range from 22-34 in many literature. The debate is whether to retracted. Most phonemes, with the exception of /w/, /P/, /y/,
consider the labialized and palatalised consonants as phonemes /j/ and /p/, have at least one labialized sound formed by
or not, as shown in Table I. /w a/. The phoneme /P/, has one labialized sound formed by
/w @/. These are allophones to their corresponding consonant
Labial Consonants k w , hw , k w , g w phonemes realized when they are followed by two consecutive
Palatal Consonants S, , Z , , vowel phonemes. In most cases, the consonant phoneme is
Table I. A MHARIC L ABIAL AND PALATAL C ONSONANTS realized when followed by the vowel [u] followed by another
vowel like [a]. Some of these allophones are formed by the
sequence of [ua] vowels as in [] [lw a] for lua, [] [mw a]
Due to the reasons to be discussed shortly, the total number of
for mua, [] [sw a] for sua, [] [rw a] for rua . . ..
consonant phonemes, including the labialized and palatalised
However, the phoneme /k w /, /hw /, /k w / and /g w / have about
consonants we use in our experiment is 32.
ve labialized variants that are treated as a one-unit phoneme
The symbols that represent consonant phonemes can be di-
not as allophones. Three of them /k w /, /k w / and /g w / are
vided in to two. The rst group are the twenty-eight basic
considered by [4] as a one-unit phonemes while [3] takes all
consonant phonemes that have each seven separate symbols
four including /hw / as separate phonemes. These labialized
based on the combined vowels. These consonant phonemes,
phonemes unlike the other labialized characters appear in roots
are listed in the traditional order in Table II. The second group
and they occur word-initially where sequences of consonants
are the four labialized phonemes, which are listed in Table I,
are absent in Amharic according to [3]. These phonemes
that are formed by lip-rounding and releasing. The Labialized
have more characters and are relatively common than the rest
consonants have each ve separate symbols combined with
labialized characters as [3] further claries the reason that
only ve of the seven regular vowels as shown in the column
makes these labialized characters as phonemes.
labelled Vowels (Labial) in Table II.
III. E XPERIMENT
Consonants /h/, /l/, /m/, /s/, /r/, /S/, /k /, /b/, /t/,
//, /n/, //, /k/, /w/, /P/, /z/, /Z/, /y/, SMT systems require training on bilingual corpora. The
/d/, //, /g/, /t /, //, /p /, /s /, /f /, English-Amharic parallel corpus from parliamentary docu-
/p/, /v/ ments, which have been used in the previous phrase-based
Regular /@/,/u/, /i/, /a/, /e/,/1/, /o/ baseline preliminary experiment by [8], are used for the
Vowels phoneme-based EASMT experiment as well.
// // // // // // //
Vowels /@/, /i/, /a/, /e/, /1/
A. Size of the parallel corpus
(Labial)
// // // // //
Table II. A MHARIC C ONSONANT AND VOWEL P HONEMES A total of 18,432 English sentences aligned with Amharic
sentences have been used for this experiment. Out of the total
collected data, 90% or 16,432 randomly selected sentence pairs
The symbols that represent the regular Amharic vowels, again have been used for training while the remaining 10% or 2,000
in the traditional order along with their equivalent Geez sentence pairs are equally divided for tuning and testing. Thus,
transcription are shown in Table II. Since // /1/ is epenthetic the experiment is developed using a total of 18,432 English-
vowel, it can be added to make pronunciation easier or it can Amharic bilingual parallel sentences and 254,649 monolingual
be omitted. sentences. The monolingual corpus is used for the LM. The
In most literature, with the exception of [3], the consonant LM corpus contains those data related to parliamentary corpora
phoneme /v/ sound, which is found in words of foreign origin that are not included in the bilingual corpora and news corpora
is not counted as one among the Amharic phonemes, as in collected from the Ethiopian News Agency. Our previous
[1], or is omitted altogether as in [4]. The justication given preliminary experiment in [8] provides the detail counts in the
data sets at sentence, token and vocabulary level used in the are not candidates for segmentation.
experiment. More information about the data acquisition, doc-
ument aligning, tokenization, sentence splitting and sentence
alignment is also discussed in detail in our previous experiment C. Phonemic transcription
as shown in [8].
When a linguist records words as sequences of phonemes
as shown in Table IV, the result is termed a phonemic
B. Phoneme-based segmentation transcription as dened by [6]. The English phoneme
The phoneme-based segmentation is capable to segment transcription in Table IV is similar to [6] while that of the
at word boundaries in the Amharic orthography. The syllable Amharic is similar to [4]. This is to be distinguished from
symbols change at the boundary due to the nearby attached orthographic transcription. The orthographic transcription
inectional markers such as gender, number, possessive, includes the customary spelling system of the language.
person, tense, denite and other markers. The solution is to Table IV is an example that shows the same sentence in
use phoneme segmentation by converting each syllable to a orthographic and phonemic transcription. The rst two lines
consonant-vowel character so that all phoneme symbols at are the orthographic and phonemic transcriptions for English
word boundary become similar. This is just like transliterating followed by two subsequent lines for Amharic.
the Amharic Geez-based script to English characters of
consonant-vowel combination. The difference between the
two is that both do not share the same script as Amharic uses D. Orthographic transcription
Geez script while English uses Latin script.
In order to illustrate the boundary level segmentation, Table Orthographic transcription means the words are written
III can be a good example to show how segmenting using down using the customary spelling system or orthography of
phoneme-based contributes to solve the problem of data the language. The Amharic writing system uses Geez syllable
sparseness and OOV than using the orthographic-based or Geez alphabet called [l] [d@l] meaning letter, which
characters. The vocabularies shown under the rst column was adapted from Geez, the extinct classical language of
heading Orthography in Table III are selected from the Ethiopia. It is a syllabic writing system and consonants and
list of LM used in the experiment and those under the vowels do not exist independently of each other. In other
second column heading Phoneme are taken after the LM words, there are syllabic symbols for every Amharic sound
is converted to phoneme based LM. As can be seen from and each symbol represents a whole syllable in the Amharic
the table in column one, all vocabularies in the rst column orthography. These Amharic syllabic-based scripts are written
have four syllables in common and can be extended to horizontally from left to right.
ve. The fth syllable in all vocabularies are not similar as The total number of consonants in the Amharic orthography
they appear at the boundary that exhibit a change due to is 34. Each consonant has seven orders or shapes or f orms
number, gender, conjunction, possessive and other phoneme depending on the vowel added to the consonant. Thus, the
additions. It is possible to reduce all words to [t + Amharic orthography has 238 major characters. There are also
morpheme] [halan@t + morpheme] as shown in the last 41 labialized variants for most of the consonants. [1] puts the
column heading Final Output. But this is only possible number lower at 37. However, the actual labialized symbols
if a grapheme-to-phoneme transcription is applied rst to found in the Amharic corpora indicates that the total number of
all words then conduct the segmentation and nally back to symbols are 41. This puts the complete list of characters in the
the syllable by applying the reverse phoneme-to-grapheme Amharic writing system as 279 by excluding the punctuations
transcription to get the nal output as shown in the last and numbers or any other symbols if there are any.
column. The Amharic orthography also uses Geez number symbols
The phoneme-to-grapheme transcription helps the segmenter although the Hindu-Arabic numerals are more frequently used
to have more options so that rare words that can be segmented as noted by [1]. However, in the publications of the ofcial
starting from the range at the 3rd position up to 4th be parliamentary documents, the Geez number is still in use for
extended to the 5th position as well. It is highly luckily to writing Amharic content while that of Hindu-Arabic numbers
get more rare words segmented at the fth position as in for writing the English content. Table V shows all the Geez
the nal column in Table III than at the 4th position as in numbers used in Amharic writing.
the rst column. This clearly shows that the segmentation The other distinguishing feature of Amharic orthography is
of morphemes using the phoneme-based contributes to the punctuation characters that are used differ from that of
segment rare words. Each morpheme in the phoneme-based the Latin script based English Writing, although there is no
segmentation listed under the Final Output column represent meaning difference in both writing systems. The Amharic
morphemes in the Amharic language that indicate a stem plus punctuation are taken from the Geez punctuation as shown
segmented morphemes for possession, plural and pronoun. in Table VI.
Thus, phoneme-based segmentation gives not only The orthographic transcription that is used in our day-to-
segmentation that looks similar to the Amharic morphemes
but also provides us more options to segment Amharic Latin , ; : . :- ? !
words as shown in Table III that shows the possibility of Amharic , ; : ~ { ? !
Table VI. P UNCTUATIONS IN A MHARIC
reducing all ve vocabularies to one provided each appear
only once or twice in the list. However, if each vocabulary
is rich in frequency, Morfessor ignores the vocabulary by day written communication is far simpler than a phonemic
considering these vocabularies as most frequent words that transcription. However, both transcriptions convey the same
Orthography Phoneme Possible Segmentation Final Output
+ tn hlfntn hlfnt + n t + n
+ hlfntn hlfnt + n t +
+ hw hlfntmw hlfnt + mw t + hw
+ n hlfntn hlfnt + n t + n
+ m hlfntm hlfnt + m t + m
Table III. S AMPLE A MHARIC S YLLABLE , P HONEME AND POSSIBLE PHONEME SEGMENTATION
Transcription Description
This is an orthographic transcription. This is an orthographic transcription.
/DIs Iz @ foU"nimIk trn"skIpS@n / This is a phonemic transcription.
yh { l f w This is an orthographic transcription.
yh fnmk f nw This is a phonemic transcription.
Table IV. P HONEMIC VS . ORTHOGRAPHIC TRANSCRIPTIONS
Hindu-Arabic 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 10000

Amharic 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D
Table V. N UMBERS IN A MHARIC
information provided one knows the rules. The transcription 1) The labialized characters [ _ . . . ]
rules in any language claries how to interpret a given word. for [lw a mw a rw a sw a Sw a qw a . . . ] respectively, have
For instance, there might be a time the phonemic and ortho- been left unchanged since they dont exhibit any
graphic transcription looks the same. The rst words in both change at the boundary due to number, gender, and
lines for the Amharic example above looks like the same. This other phoneme additions. For example, the labialized
is because each syllable that constitute the word are in the syllable [] [bw a] in [n] [bw anbw a] for faucet
6th order. The vowel sound of the 6th order, as mentioned doesnt change when it is plural as in [nm]
earlier is epenthetic and, of course, it can be omitted during [bw anbw awo] for faucets but the regular syllable
the phonemic transcription of Amharic words. [t] [t1] in [t] [bet1] for house is changed to []
[to] as in [m] [beto] for houses;
E. Amharic grapheme-to-phoneme conversion 2) All syllables S = C have been left unchanged. The
vowel that follows the Consonant C in the 6th order
The phoneme-based experiment is conducted by converting is epenthetic and, therefore, it is omitted;
each Amharic syllable to phoneme character. These characters, 3) All syllables C = V have been left unchanged. If
including the input syllables and the output phonemes, are they appear in the text they are treated just like any
all transcribed using Geez scripts. Let S be the set of all other Consonants although we have dened them as
Syllables in the Amharic orthography as dened in Table VII. Vowels;
The converter recognizes S as any syllable in the Amharic 4) The phonemes /M u / /h/ and their families have
language to be converted to (CV ) for all Amharic syllables been changed to their corresponding phoneme in /h/
where C is any Amharic consonant while V is any Amharic /h/, like wise /U/ /s/ is converted to /s/ /s/ and //
vowel. Let C be the set of all Consonants in the Amharic /s/ to // /s/.
phonemic transcription and V be the set of all Vowels in the
Amharic phonemic transcription. Therefore, the result of converting the following Amharic
The conversion of the Amharic Syllables of any given le is syllable sentence example:
w bt Mzb s
S = Any Amharic syllable, such as:
[ h . . . ] the converter would deliver the following Amharic Phoneme:
[h@ hu hi ha he h ho l@ lu . . . ]
C = C S is any 6th order Amharic ywh hbt yhzb slmhn.
consonant, such as:
[h l m s r . . . ] Where, each syllable in the rst sentence have been converted
[h l m s r . . . ] to sequence of phonemes of consonants and vowels with the
V = Any Amharic vowel, such as: exception of the 6th ordered syllables that are left unchanged.
[ ] Thus, the grapheme-to-phoneme matching is between one
[Pw @ Pu Pi Pa Pe P Po] grapheme and one or two phonemes. For example, Figure
Table VII. A MHARIC P HONEME T RANSCRIPTION
1 shows the grapheme-to-phoneme association for the rst
Amharic word taken from the above example [w] [y@wha]
to /ywh/. The rst and third syllables are matched to two
then performed with the help of Algorithm 1. The following phonemes while the one at the middle is matched to one
are some of the points taken into consideration before, during phoneme.
and after the conversion process: Similarly, all Amharic corpora mentioned earlier that are
Input: he shall be punished with simple imprisonment not exceeding three months and ne
Output-P: syqrb qt k sst wr b mybl qll esrt en ne.
Output-S: rb q st r yl l est ne.
Output-M: rb q st r yl l est ne.
Reference: st r yl l est yl
Table VIII. A N EXAMPLE SHOWING P HONEME - TRANSLATION
Algorithm 1 Amharic Syllable to Amharic Phoneme conver- fully segmented. The output-P has more segmented word by
sion algorithm looking at this translation example compared to the syllable-
Require: An Amharic text transcribed using the Geez script based morpheme segmentation, which is the row indicated
Ensure: An Amharic text converted to Amharic phoneme of by Output-M . The segmented word in the syllable-based
pattern CV segmentation Output-M is missing the conjunction word
1: Accepts an Amharic text le and and this indicates that segmentation based on phoneme
2: while File f is dened/valid do produces better result than segmentation based on syllables.
3: Open f
4: Read f V. C ONCLUSION
5: while line l is not EOF do Although developing a statistical machine translation sys-
6: Read l tem from English to Amharic is difcult due to the differences
7: Replace each syllable S by Phoneme pattern CV that exist between the two languages, our experiment using
8: using the syllable-phoneme conversion pair list the Amharic grapheme-to-phoneme conversion option has been
9: syllable S by Phoneme pattern CV successful in improving the translation process by solving the
10: for all line ln S do data-sparseness and OOV problem.
11: print ln Accordingly, the BLEU score result for the output of the
12: print newline phoneme-based system is 37.53, which is a gain of 2.21 BLEU
13: end for point from that of an experiment made by [8] using the baseline
14: end while phrase-based orthographic transcription score result of 35.32%.
15: end while This is a 6.26% percentage increase.
w
R EFERENCES

y w h [1] Imed Zitouni, Natural language processing of Semitic languages,
Heildelberg, New York: Springer, 2014.
Figure 1. Example for grapheme-to-phoneme matching [2] Central Statistical Agency, Population projection for Ethiopia: 2007-
2037, Addis Ababa, Ethiopia: Central Statistical Agency, 2013.
[3] Anbessa Teferra and Grover Hudson, Essentials of Amharic, Verlag,
used to build the phoneme-based EASMT system have been Kln: Rdiger Kppe, 2007.
converted by taking all the above mentioned options in to [4] Baye Yimam, Yamarigna Sewasiw (Amharic Grammar), 2nd ed. Addis
consideration. Ababa, Ethiopia: CASE, 2008.
[5] Bezza Tesfaw Ayalew, The submorphemic structure of Amharic: toward
a phonosemantic analysis, Urbana-Champaign, USA: University of
IV. D ISCUSSION Illinois, 2013.
[6] Bruce Hayes, Introductory Phonology, Chichester: John Wiley & Sons,
Finally, the phoneme-based EASMT system is built and 2009.
the output Amharic translated corpus has been evaluated [7] Bezza Tesfaw Ayalew, Natural language processing and Applications,
against the reference corpus. Table VIII shows an example Birmingham: University of Birmingham, 2007.
of translation from English to Amharic against the candidate [8] Mulu Gebreegziabher Teshome and Laurent Besacier (Prof.), A prelim-
translation and the reference translations of the Phoneme-based inary experiment on English-Amharic Statistical Machine Translation
EASMT system. Where, Input is the source English sentence (EASMT), In: Proceedings of the 3rd International Workshop on Spoken
to be translated and output-P is the candidate output or the Languages Technologies for Under-resourced Languages (SLTU). ISBN:
978-1-86822-615-3. 7-9 May 2012, Cape Town, South Africa. pp. 36-41.
phoneme-based translated target Amharic sentence. Output-
[9] Baruch Podolsky, Historical Phonetics of Amharic, Tel Aviv, Israel:
S is the Amharic transcription of the output-P to syllable Tel Aviv University, 1991.
to make it easier for comprehension. Although the output [10] M. Creutz and K. Lagus, Unsupervised morpheme segmentation and
sentence indicated in the Table as Output-S is not exactly morphology induction from text corpora using Morfessor 1.0. Techni-
similar with the manually translated sentence in Ref erence, cal Report A81, Publications in Computer and Information Science,
some of the matching words are segmented in the output Helsinki, Finland: Helsinki University of Technology, 2005.
phoneme-based translation.
For example, the word [st] [k@sost] for upto three is
segmented to [ st] [k@ sost], [yl] [b@mayb@lt] for
not exceeding is segmented to [ yl] [b@ mayb@lt]
and [est] [Psratna] for imprisonment and to [est
] [Psrat Pna]. The last word in the phonemic translation
contains the conjunction and and this sufx is also success-

Phoneme-Based English-Amharic Statistical Machine Translation

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Phoneme-Based English-Amharic Statistical Machine Translation

Încărcat de

Drepturi de autor:

Formate disponibile

Phoneme-based English-Amharic Statistical Machine

Hindu-Arabic 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100 10000

S-ar putea să vă placă și