Documente Academic
Documente Profesional
Documente Cultură
Michael Gasser
Indiana University
Bloomington, Indiana, USA
gasser@cs.indiana.edu
3.1 System architecture within the FST responsible for the stem. The stem
The full morphology processing system (see Fig- FST is composed from a sequence of five simpler
ure 1) consists of analysis and generation FSTs for FSTs representing stem-specific alternation rules
each language. For Amharic and Tigrinya there and the legal sequences of consonants and vowels
are separate lexical and “guesser” FSTs for each making up the stem. For the lexical, but not the
processing direction. Verbs (all three languages) guesser, FSTs, a further FST containing the lexi-
and nouns (Amharic and Oromo only) are handled con of known roots is part of the stem cascade.
by separate FSTs. Amharic and Oromo have sep- Prefix and suffix FSTs are concatenated onto
arate FSTs for verb segmentation, as opposed to the stem FST to create the full verb morphotactic
grammatical analysis, and Amharic has a separate FST. The remaining FSTs (15 of them for Amharic
FST for orthography-to-phonology conversion. verbs, 17 for Tigrinya verbs) implement alterna-
Each of these FSTs in turn results from the com- tion rules that apply to the word as a whole, includ-
position of a cascade of simpler FSTs, each re- ing allomorphic rules and general phonological or
sponsible for some aspect of morphology. The orthographic rules.
most complicated cases are Amharic and Tigrinya The figure shows the analysis of the Amharic
verbs, which we discuss in more detail in what verb Î≈t˜Ô¤wm y@mmat1ss@dd@b@wm ‘who
follows and illustrate in Figure 2. At the most (she) is also not insulted’. The word is input
abstract (lexical) end is the heart of the system, in the Ge’ez orthography used for Amharic and
the morphotactic FST. Most of the complexity is Tigrinya. This writing system fails to indicate con-
1
Amharic adjectives, which behave much like nouns, are sonant gemination as well as the epenthetic vowel
grouped with nouns in the system. 1, which is introduced to break up some conso-
nant clusters in both languages. Gemination is ex- file. The functions anal word and anal file
tremely important for natural speech synthesis in take input words and output a root or stem and
Amharic and Tigrinya, so it is crucial to be able to a grammatical analysis. In Figure 2, the output
restore it in text-to-speech applications. There is in of anal word is outlined in green. The input
fact relatively little ambiguity with respect to gem- word has the root sdb ‘insult’ and the citation form
ination, but gemination is so tied up with the mor- °˜Ô¤; has a third person singular feminine sub-
phology that a relatively complete morphological ject; is in the imperfective tense/aspect; is rela-
analyzer is necessary to perform the restoration. tivized, negative, and definite; and has the con-
HornMorpho has this capacity. junctive suffix -m.
The word is first romanized to yematsede- For Amharic and Oromo, there are two
bewm.2 At this stage, none of the consonants additional analysis functions, seg word and
is geminated, and the epenthetic vowel is miss- seg file, which segment input verbs (also
ing in the romanized form. Processing is then nouns for Oromo) into sequences of morphemes.
handled by the single analysis FST, but to un- In the example in Figure 2, the output of
derstand what goes on, it is still to convenient to seg word is shown in red. The constituent mor-
think of the process in terms of the equivalent cas- phemes are separate by hyphens, and the stem is
cade of simpler FSTs operating in sequence. The enclosed in brackets. The root and template for the
first FST in the cascade performs orthographic- stem are separated by a plus sign. The template
to-phonological conversion, resulting in all possi- notation 11e22e3 indicates that the first and sec-
ble pronunciations of the input string, including ond root consonants are geminated and the vowel
the correct one with the appropriate consonants e is inserted between the first and second and sec-
geminated. This form and the other surviving ond and third root consonants.
strings are processed by the intervening phonolog- For Amharic only, there are further functions,
ical FSTs, each responsible for an alternation rule. phon word and phon file, which convert the
Among the strings that survive as far as the mor- input orthographic form to a phonetic form as
photactic FST is the correct string, yemmatIssed- would be required for text-to-speech applications.
debewm, which is analyzable as the pair of pre- In the figure, the output of phon word is out-
fixes yemm-at, the stem sseddeb, and the pair of lined in blue. Three of the consonants are gemi-
suffixes ew-m. nated, and the epenthetic vowel (romanized as I)
The stem is processed by the stem FST, which has been inserted to break up the cluster tss.
extracts the root, sdb, and various grammatical Below are more examples of the analysis func-
properties, including the fact that this form is pas- tions, as one would call them from the Python in-
sive. The lexical analyzer includes all of the verb terpreter. Note that all of the HornMorpho func-
roots known to the system within the stem FST, tions take a first argument that indicates the lan-
whereas the guesser analyzer includes only infor- guage. Note also that when a wordform is ambigu-
mation about sequences of consonants making up ous (Example 3), the analysis functions return all
possible roots in the language. For example, if possible analyses.
the consonant sequence s,d,b in the original word
were replaced by a fictitious root mbz, the guesser Example 1 anal word (Tigrinya)
>>> anal word(’ti’, ’bÈݵŒ¹’)
analyzer (but not the lexical analyzer) would posit Word: bÈݵŒ¹
this as the root of the word. The final output analy- POS:verb, root:<gTm>, cit:ƒÝ°Œ
ses are shown at the top of Figure 2. The three pos- subject: 3, sing, masc
sibilities correspond to three different HornMor- object: 1, plur
grammar: imperf, recip, trans, rel
pho functions, discussed in the next section. preposition: bI
3.2 Functions
Each of the functions for morphological analy- Example 2 seg word (Oromo)
sis has two versions, one for analyzing single >>> seg word(’om’, ’dhukkubdi’)
dhukkubdi: dhukkub-t-i
words, the other for analyzing all of the words in a
2
HornMorpho uses an ASCII romanization scheme devel- There is a single function for generation, gen,
oped by Firdyiwek and Yaqob (1997). which takes a stem or root and a set of grammat-
Example 3 phon word (Amharic) accuracy), 2 errors on the Amharic verbs (99% ac-
>>> phon word(’am’, ’yŒ³‡’) curacy), and 9 errors on the Amharic nouns and
yImetallu yImmettallu
adjectives (95.5% accuracy).
To test the morphological generator, the gen
ical features. For each part of speech, there is a function was run on known roots belonging to all
default set of features, and the features provided of the major verb root classes.3 For each of these
in the function call modify these. In order to use classes, the program was asked to generate 10 to
gen, the user needs to be familiar with the Horn- 25 verbs depending on the range of forms possi-
Morpho conventions for specifying grammatical ble in the class, with randomly selected values for
features; these are described in the program docu- all of the different dimensions, a total of 330 tests.
mentation. For Amharic, the program succeeded on 100% of
With no grammatical features specified, gen re- the tests; for Tigrinya it succeeded on 93%.
turns the canonical form of the root or stem, as in In all cases, the errors were the result of missing
the Oromo example 4 (sirbe is the third person sin- roots in the lexicon or bugs in the implementation
gular masculine past form of the verb). Example of specific phonological rules. These deficiencies
5 is another Oromo example, with additional fea- have been fixed in the most recent version of the
tures specified: the subject is feminine, and the program.
tense/mood is present rather than past. Although more testing is called for, this evalua-
tion suggests excellent coverage of Amharic and
Example 4 gen (Oromo 1) Tigrinya verbs for which the roots are known.
>>> gen(’om’, ’sirb’) Verbs are the source of most of the morphological
sirbe complexity in these languages. Nouns and adjec-
tives, the only other words calling for morpholog-
ical analysis, are considerably simpler. Because
Example 5 gen (Oromo 2) the plural of Tigrinya nouns is usually not pre-
>>> gen(’om’,’sirb’,’[sb=[+fem],tm=prs]’)
sirbiti
dictable, and we have access to only limited lex-
ical resources for the language, we have not yet
incorporated noun analysis and generation for that
4 Evaluation language. For Amharic, however, the system is
apparently able to at least analyze the great major-
Evaluating HornMorpho is painstaking because ity of nouns and adjectives. We treat all Amharic
someone familiar with the languages must care- words other than verbs, nouns, and adjectives as
fully check the program’s output. A useful re- unanalyzed lexemes.
source for evaluating the Amharic and Tigrinya For Oromo, the newest language handled by
analyzers is the word lists compiled by Biniam Ge- HornMorpho, we have not yet conducted a com-
bremichael’s web crawler, available on the Inter- parable evaluation. Any evaluation of Oromo is
net at http://www.cs.ru.nl/˜biniam/ complicated by the great variation in the use of
geez/crawl.php. The crawler extracted double consonants and vowels by Oromo writers.
227,984 unique Tigrinya wordforms and 397,352 We have two alternatives for evaluation: either we
unique Amharic wordforms. make the analyzer more lenient so that it accepts
To evaluate the Amharic and Tigrinya analyzers both single and double vowels and consonants in
in HornMorpho, words were selected randomly particular contexts or we restrict the evaluation to
from each word list, until 200 Tigrinya verbs, 200 texts that have been verified to conform to partic-
Amharic verbs, and 200 Amharic nouns and ad- ular orthographic standards.
jectives had been chosen. The anal word func-
tion was run on these words, and the results were 5 Conclusions and ongoing work
evaluated by a human reader familiar with the lan-
For languages with complex morphology, such as
guages. An output was considered correct only if
Amharic, Tigrinya, and Oromo, almost all com-
it found all legal combinations of roots and gram-
putational work depends on the existence of tools
matical structure for a given wordform and in-
for morphological processing. HornMorpho is a
cluded no incorrect roots or structures. The pro-
3
gram made 8 errors on the Tigrinya verbs (96% The Amharic noun generator has not yet been evaluated.
first step in this direction. The goal is software References
that serves the needs of developers, and it is ex-
Aklilu, A. (1987). Amharic-English Dictionary.
pected that the system will evolve as it is used
Kuraz Printing Press, Addis Ababa.
for different purposes. Indeed, some features of
the Amharic component of the system have been Amsalu, S. and Demeke, G. A. (2006). Non-
added in response to requests from users. concatenative finite-state morphotactics of
Amharic simple verbs. ELRC Working Papers,
One weakness of the present system results
2(3).
from the limited number of available roots and
stems, especially in the case of Tigrinya. When a Amtrup, J. (2003). Morphology in machine trans-
root is not known, the Tigrinya verb guesser ana- lation systems: Efficient integration of finite
lyzer produces as many as 15 different analyses, state transducers and feature structure descrip-
when in many cases only one of these contains tions. Machine Translation, 18:213–235.
a root that actually exists in the language. How- Beesley, K. R. and Karttunen, L. (2003). Finite
ever, the guesser analyzer itself is a useful tool for State Morphology. CSLI Publications, Stan-
extending the lexicon; when an unfamiliar root is ford, CA, USA.
found in multiple wordforms and in multiple mor- Bitima, T. (2000). A dictionary of Oromo technical
phological environments, it can be safely added to terms. Rüdiger Köpper Verlag, Köln.
the root lexicon. We have explored this idea else-
where (Gasser, 2010). Firdyiwek, Y. and Yaqob, D. (1997). The sys-
tem for Ethiopic representation in ASCII. URL:
A more significant weakness of the analyzers
citeseer.ist.psu.edu/56365.html.
for all three languages is the handling of ambi-
guity. Even when a root or stem is known, there Gasser, M. (2009). Semitic morphological analy-
are often multiple analyses, and the program pro- sis and generation using finite state transducers
vides no information about which analyses are with feature structures. In Proceedings of the
more likely than others. We are currently working 12th Conference of the European Chapter of the
on extending the weighted FST framework to ac- ACL, pages 309–317, Athens, Greece.
commodate probabilities as well as feature struc- Gasser, M. (2010). Expanding the lexicon for a
tures on transitions so that analyses can be ranked resource-poor language using a morphological
for their likelihood. analyzer and a web crawler. In Proceedings of
Although Amharic and Tigrinya have very sim- the Seventh International Conference on Lan-
ilar verb morphology, they are handled by com- guage Resources and Evaluation (LREC’10),
pletely separate FSTs in the current implementa- Valletta, Malta.
tion. In future work we will be addressing the Gragg, G. (1982). Oromo dictionary. Michigan
question of how to share components of the sys- State University Press, East Lansing, MI, USA.
tem across related languages and how to build on Kaplan, R. M. and Kay, M. (1994). Regular mod-
existing resources to extend the system to handle els of phonological rule systems. Computa-
related Semitic (e.g., Tigre, Silt’e) and Cushitic tional Linguistics, 20:331–378.
(e.g., Somali, Sidama) languages of the region.
Karttunen, L., Kaplan, R. M., and Zaenen, A.
Finally, HornMorpho is designed with de- (1992). Two-level morphology with compo-
velopers in mind, people who are likely to be sition. In Proceedings of the International
comfortable interacting with the program through Conference on Computational Linguistics, vol-
the Python interpreter. However, morphological ume 14, pages 141–148.
analysis and generation could also be of interest to
Koskenniemi, K. (1983). Two-level morphology:
the general public, including those who are learn-
a general computational model for word-form
ing the languages as second languages. We are
recognition and production. Technical Report
currently experimenting with more user-friendly
Publication No. 11, Department of General Lin-
interfaces. As an initial step, we have created
guistics, University of Helsinki.
a web application for analyzing and generating
Tigrinya verbs, which is available here: http: Zacarias, E. (2009). Memhir.org dictionaries
//www.cs.indiana.edu/cgi-pub/ (English-Tigrinya, Hebrew-Tigrinya dictionar-
gasser/L3/morpho/Ti/v/anal/. ies). Available at http://www.memhr.org/dic/.