Sunteți pe pagina 1din 37




Morphological analyzer is one of the most important basic tools in automatic

processing of any human language. It analyses the naturally occurring word forms in a
sentence and identifies the root word and its features. In spite of its significance,
Dravidian languages do not have any morphological analyzers available in public
domain. The absence of such a tool for research severely impedes the development of
language technologies and applications like natural language interfaces, machine
translation, etc in these languages. In this thesis, Tamil morphological analyzer is used
for preprocessing Tamil sentences.

6.1.1 Morphology in Language

Grammar of any Language can be broadly divided into morphology and syntax. The
term morphology was coined by August Schleicher in 1859. Morphology deals with the
words and their construction. Syntax deals with how to put the words together in some
order to make meaningful sentences. Morphology is the field within linguistics that
studies the internal structure of words. While words are generally accepted as being the
smallest units of syntax, it is clear that in most languages, words can be related to other
words by rules. Morphology is attempts to formulate rules that model the knowledge of
the speakers of those languages. Morphemes are the smaller elements of which words
are built. Two broad classes of morphemes are stems and affixes. Affixes that are added
to the base to denote relations of words are morphemes. Morphemes can either be free
(they can stand alone, i.e. they can be words in their own right) e.g. dog, or they can be
bound (they must occur as part of a word) e.g. the plural suffix s on dogs.

6.1.2 Computational Morphology

Computational morphology deals with developing theories and techniques for

computational analysis and synthesis of word forms. By computational analysis of
morphology, one can extract any information encoded in a word and bring it out so that
later layers of processing can make use of it.

6.1.3 Morphological Analyzer

Morphological analysis segments the words into lemma and morpho-lexical

information. It is a primary step for various types of text analysis of any language.
Morphological analyzers are used in search engines for retrieving the documents from
the keyword [169]. Morphological analyzer increases the recall of search engines. It is
also used in speech recognizer, lemmatization, Information Retrieval/Extraction,
Summarization, spell and grammar checker and machine translation.

A word is defined as a sequence of characters delimited by spaces, punctuation

marks, etc. in case of the written text. There is no difficulty in identifying words in the
written text entered into the computer because one simply has to look for the delimiters.
A word can be of two types: simple and compound. A simple word consists of a root or
stem together with suffixes or prefixes. A compound word (also called a conjoined
word) can be broken up into two or more independent words. Each of the constituent
words in a compound word is either a compound word or a simple word and may be
used independently as a word. On the other hand, the root and the affixes, which are
constituents of a simple word, are not all independent words and cannot occur as
separate words in the text.

Constituents of a simple word are called morphemes or meaning units. The

overall meaning of a simple word comes from the morphemes and their relationships.
Similarly, in case of a compound word, its meaning follows from its constituent words
and their inter-relationships. It should be noted that one has taken a pragmatic position
regarding words. Anything that is identifiable using the delimiters is a word. This is a
convenient position to take from the processing viewpoint. Similar is the case with the
definition of compound words.

With the above definition, an analyzer of words in a sentence does not have to
do much work in identifying a word. It simply has to look for the delimiters. Having
identified the word, it must determine whether it is a compound word or simple word.
If it is a compound word, it must first break it up into its constituent simple words
before proceeding to analyze them. The former is called as sandhi analyzer and the
later is morphological analyzer, both of which are important parts of a word analyzer.

The detailed linguistic analysis of a word can be useful for NLP. However, most
NLP researchers have concentrated on other aspects, like grammatical analysis,
semantic interpretation etc. As a result, NLP systems use rather simple morphological
analyzers. A generator does the reverse of an analyzer. Given a root and its features (or
affixes), a morphological generator generates a word. Similarly, a sandhi generator can
take the output of a morphological generator, and group simple words into compound
words, where possible.

Examples for Morphological Analysis;

books = book+Noun+PluraL(or)book+Verb+Pres+3SG.
stopping = stop+Verb+Cont
happiest = happy+Adj+Superlative
went = go+Verb+Past

Examples for Morphological Generator;

book+Noun+Plural = books
stop+Verb+Cont = stopping
happy+Adj+Superlative = happiest
go+Verb+Past = went

6.1.4 Role of Morphological Analyzer in NLP

Morphological analyzer plays an important role in the field of Natural Language

Processing. Figure 6.1 shows the Role of Morphological analyzer in NLP. Some of the
applications are,

Spell checker
Search Engines
Information extraction and retrieval
Machine Translation system
Grammar checker
Content analysis
Question Answering system

Automatic sentence Analyzer
Dialoge system
Knowlege representation in learning
Language Teaching
Language based educational exercises

Text Processing

Speech Morphological Machine

Analyzer Translation

Search Engines-

Figure 6.1 Role of Morphological Analyzer in NLP


6.2.1 Tamil Morphology and Language

Tamil Morphology is very rich. It is an agglutinative language, like the other Dravidian
languages. Tamil words are made up of lexical roots followed by one or more affixes.
The lexical roots and the affixes are the smallest meaningful units and are called
morphemes. Tamil words are therefore made up of morphemes concatenated to one
another in a series. The first one in the construction is always a lexical morpheme
(lexical root). This may or may not be followed by other functional or grammatical
morphemes. For instance, a word puththakangkaL in Tamil, can be

meaningfully divided into puththakam and kaL .

In this example, puththakam is the lexical root, representing a real

world entity and kaL is the plural feature marker (suffix). kaL is a

grammatical morpheme that is bound to the lexical root to add plurality to the lexical
root. Unlike English, Tamil words can have a large sequence of morphemes. For

puththakangkaLai = puththakam (book) + kaL (s)

+ ai (ACC. Case Marker).

Tamil nouns can take case suffixes after the plural marker. They can also have
post positions after that. Tamil words consist of a lexical root to which one or more
affixes are attached. Most Tamil affixes are suffixes. Tamil suffixes can be derivational
suffixes, which either changes the part of speech of the word or its meaning, or
inflectional suffixes, which mark categories such as person, number, mood, tense, etc.
The words can be analyzed like the one above by identifying the constituent
morphemes and their features can be identified.

6.2.2 Syntax of Tamil Morphology

Tamil is a consistently head-final language. The verb comes at the end of the clause,
with typical word order Subject Object Verb (SOV). Tamil is also a free word-order
language. Due to this relatively free word-order nature of Tamil language, the Noun
Phrase arguments before a final verb can appear in any permutation, yet it conveys the
same sense of a sentence. Tamil has postpositions rather than prepositions.
Demonstratives and modifiers precede the noun within the noun phrase. Subordinate
clauses precede the verb of the matrix clause.

Tamil is a null subject language. Not all Tamil sentences have subjects, verbs
and objects. It is possible to construct valid sentences that have only a verbsuch as

mudi-wth-athu("Completed") or only a subject and object, without a verb

such as athu en viitu ("That, my house"). Tamil does not have a

copula (a linking verb equivalent to the word is). The word is included in the
translations only to convey the meaning more easily.

6.2.3 Word Formation Rules (WFR) in Tamil

Any new word created by Word Formation Rules (WFR) must be a member of a major
lexical category. The WFR determines the category of the output of the rule. In Tamil,
the grammatical category may change or may not change after the operation of WFR.
The following is the list of inputs and outputs of different kinds of WFR's in the
derivation of simple words in Tamil [170].

1. Noun Noun

[ [vElai ]N + kAran ]suf ]N 'servant'

[ [ ]N + ]suf ]N

[ [thozil ]N + ALi ]suf ]N 'laborer'

[ ]N + ]suf ]N

2. Verb Noun

[ [padi ]V + ppu ]suf ]N 'education'

[ [ ]V + ]suf ]N

[ [ezuthu-]V + thu ]suf ]N 'letter'

[ [ ]V + ]suf ]

[ [kEL ]V + vi ]suf] N 'question'

[ [ ]V + ]suf] N

3. Adjective Noun

[ [walla ]adj + thanam ]suf ]N 'good quality'

[ [ ]adj + ]suf ]

[ [periya]adj + tu ]suf ]N 'big'

[ [ ]adj + ]suf ]N

4. Noun Verb

[ [uyir ]N + ppi ]suf ]V 'to give life'

[ [ ]N + ]suf ]V

5. Adjective Verb

[ [veLLai ]adj + aakku ]suf ]V 'to make (something) white'

[ [ ]adj + ]suf ]V

[ [karuppu ]adj + aakku ]suf ]V 'to make (something) black'

[ [ ]adj + ]suf ]V

6. Verb Verb

[ [cey ]V + vi ]suf ]V 'cause to do'

[ [ ]V + ]suf ]V

[ [wada ]V -thth]suf - u] ]V 'cause to walk'

[ [ ]V - ] suf ]V

[ [vidu ]V + vi ]suf ]V 'to liberate'

[ [ ]V + ]suf ]V

7. Noun Adjective

[ [uyaram ]N + Ana ]suf ]adj 'high'

[ [ ]N + ]suf ]adj

[ [azaku ]N + Ana ]suf ]adj 'beautiful'

[ [ ]N + ]suf ]adj

[ [nErmai ]N + Ana ]suf ]adj 'honest'

[ [ ]N + ]suf ]adj

8. Verb Adverb

[ [ cey]V + tu ]suf ]adv 'having done'

[ [ ]V + ]suf ]adv 'having done'

[ [ ezuth-]V + i ]suf ]adv 'having written'

[[ -]V + ]suf ]adv

[ [ padi]V + ththu ]suf ]adv 'having read'

[ [ ]V + ]suf ]adv

Compound Word forms

Table 6.1 shows the possible combinations for compound word formation.

{ [kalvi] N # [kUdam ]N # }N 'educational institution'

{ [] N # [ ]N # }N

{ [paNi ]N # [puri ]V # }V '(perform) work'

{ [ ]N # [ ]V # }V

[ [ ezuth-]V # [kOl] N #}N 'writing instrument'

{[ ] V # [] N #}N

{ [ periya ]adj # [ wakaram ]N # }N 'big city'

{ [ ]adj # [ ]N # }N

{ [ wERRu] adv # [ iravu] N #} N last night'

{ [ ] adv # [ ] N #} N

Table 6.1 Compound Word-forms Formation

No. Surface form Inflection Compound form

1 Noun Noun Noun

2 Noun Verb Verb

3 Verb Noun Noun

4 Adjective Noun Noun

5 Adverb Verb Verb

6.2.4 Tamil Verb Morphology

Tamil verbs are inflected by means of suffixes. Tamil verbs can be finite or non-finite
forms. Finite verb forms occur in the main clause of the sentence and non-finite forms
occur as the predicate of subordinate or embedded clauses. Morphologically, finite verb
forms are inflected for tense, mood, aspect, person, number and gender.

The simple finite verb forms are given in Table 6.2. First column presents the
PNG (Person-Number-Gender) Tag and the further columns presents present, past and
future tenses respectively. For the word padi (study), various simple finite
inflection forms with tense markers and PNG markers are given in Table 6.2.

PNG_suffix is a portmanteau morpheme that encodes the person, number and

gender all in one. Finite verbs take the form, Verb_stem + Tense +
person_number_gender Tamil recognizes four kinds of non-finite verbs: infinitive,
verbal participle, adjectival participle and conditional. They take the following

Verb_stem + infinitive_suffix (infinitive)

Verb_stem + vp_suffix (verbal participle)

Verb_stem + tense + rp_suffix (adjectival participle)

Verb_stem + conditional_suffix (conditional)

Modal verbs can be defective in that, they cannot take any more inflectional
suffixes, or they can be regular verbs that can get inflected for tense and PNG suffixes.

Verb_stem + infinitive_suffix modal_verb

Verb_stem + infinitive_suffix modal_stem + tense + png_suffix

Table 6.2 Simple Verb Finite Forms

PNG Root-Pres-PNG Root-Past-PNG Root-Fut-PNG

3SE padi-kinR-Ar padi-thth-Ar padi-pp-Ar
3SM padi-kinR-An padi-thth-An padi-pp-An
3SF padi-kinR-AL padi-thth-AL padi-pp-AL
2S padi-kinR-Ay padi-thth-Ay padi-pp-Ay
1P padi-kinR-Om padi-thth-Om padi-pp-Om
1S padi-kinR-En padi-thth-En padi-pp-En
2SE padi-kinR-Ir padi-thth-Ir padi-pp-Ir
3SN padi-kinR-athu padi-thth-athu padi-pp-athu
2PE padi-kinR-IrkaL padi-thth-IrkaL padi-pp-IrkaL
3PE padi-kinR-ArkaL padi-thth-ArkaL padi-pp-ArkaL
3PN padi-kinR-ana padi-thth-ana padi-pp-ana

6.2.5 Tamil Noun Morphology

Tamil nouns (and pronouns) are classified into two super-classes rational and the
"irrational" which include a total of five classes. Humans and deities are classified as
"rational", and all other nouns (animals, objects, abstract nouns) are classified as
irrational. The "rational" nouns and pronouns belong to one of three classes masculine
singular, feminine singular, and rational plural. The "irrational" nouns and pronouns
belong to one of two classes - irrational singular and irrational plural. The plural form
for rational nouns may be used as an honorific, gender-neutral, singular form [132].

Suffixes are used to perform the functions of cases or postpositions. Traditional

grammarians tried to group the various suffixes into eight cases corresponding to the
cases used in Sanskrit. These were the nominative, accusative, and dative, sociative,

genitive, instrumental, locative, and ablative. The various noun forms are given in the
Table 6.3. The table represents the singular and plural forms of the word eli (rat)
with the case markers.

Table 6.3 Noun Case Markers

Case Singular Plural

Nominative eli eli-kaL
Accusative eli-ai eli-kaL-ai
Dative eli-uku eli-kaL-uku
Benfefactive eli-ukk-Aka eli-kaL-ukk-Aka
Instrumental eli-Al eli-kaL-Al
Sociative-Odu eli-Odu eli-kaL-Odu
Sociative-udan eli-udan eli-kaL-udan
Locative eli-il eli-kaL-il
Ablative eli-il-iruwthu eli-kaL-il-iruwthu
Genitive eli-in-athu eli-kaL-in-athu

Noun form without any inflections are called noun stem. Nouns in their stem
forms are singular.

aaciriyarkaL = aaciriyar (teacher) + kaL (pl.marker)

= +

peenaakkaL = peenaa (pen) + kaL (pl.marker)

= +

puththakangkaL = puththakam (book) + kaL (pl.marker)

= +

The examples shown above are a few instances of plural inflection. Creating a
plural form of a noun isnt simply about concatenating kaL. Similarly, in
puththakangkaL, the stem (puththakam) is inflected to puththakangkaL (am in the
stem is replaced by ang, followed by kaL). These differences are due to the Sandhi

changes that take place when the noun stem is concatenated to the kaL morpheme.
Tamil uses case suffixes and post positions for case marking instead of prepositions.
Case markers indicate the relationship between the noun phrases and the verb phrase. It
indicates the semantic role of the noun phrases in the sentence. Genitive case, tells the
relationship between noun phrases. This is expressed by in morpheme. Case suffixes
are concatenated to the nouns in their stem form or after the plural morpheme if its a
plural noun.

noun_stem + {kaL} + case_suffix

e.g. kaththiyAl (with a knife) = kaththi (knife) + Al (with)

= +

Post positions are of two kinds: bound and free. In case of bound post positions,
they occur with their respective governing case suffixes. In such a case, the
Morphotactics would be,

noun_stem + {kaL} + case_suffix + bound_post_position

e.g. vIddiliruwthu (from the house) = vIdu (house) + il + iruwthu (from)

= + +

Sometimes the post positions follow a blank space after the case suffix as
another word. Free post positions follow noun stems without any case suffixes.
However they are written as another word and do not concatenate with the noun.
Basically, there are eight cases in Tamil. Verbs can take the form of nouns when
followed by nominal suffixes. Nominalized verb forms are an example of derivational
Morphology in Tamil. They occur in the following format.

Verb_stem + tense + nominal_suffix

e.g. ceythavar (one who did ) = cey (do) + th (past) + avar

(3rd person singular honorific)

= + +

6.2.6 Tamil Morphological Analyzer

Tamil language is morphologically rich and agglutinative. Such a morphologically rich

language needs deep analysis at the word level to capture the meaning of the word from
its morphemes and its categories. Each root is affixed with several morphemes to
generate a word. In general, Tamil language is postpositionally inflected to the root
word. Each root word can take more than ten thousand inflected word forms. Tamil
language takes both lexical and inflectional morphology. Lexical morphology changes
the meaning of the word and its class by adding the derivational and compounding
morphemes to the root. Inflectional morphology changes the form of the word and adds
additional information to the word by adding the inflectional morphemes to the root.

6.2.7 Challenges in Tamil morphological Analyzer

The morphological structure of Tamil is quite complex since it inflects to person,

gender, and number markings and also combines with auxiliaries that indicate aspect,
mood, causation, attitude etc in verb. A single verb root can inflect more than ten
thousand word forms including auxiliaries. Noun root inflects with plural, oblique,
case, postpositions and clitics. A single noun root can inflect more than five hundred
word forms including postpositions. The root and morphemes have to be identified and
tagged for further language processing at word level.

The structure of verbal complex is unique and capturing this complexity in a

machine analyzable and generatable format is a challenging job. The formation of the
verbal complex involves arrangement of the verbal units and the interpretation of their
combinatory meaning. Phonology also plays its part in the formation of verbal complex
in terms of morphophonemic or Sandhi rules which account for the shape changes
due to inflection.

Understanding of verbal complexity involves identifying the structure of simple

finite verbs and compound verbs. By understanding the nature of the verbal complexity,
it is possible to evolve a methodology to recognize the verbal complexity. In order to
analyze the verbal forms in which the inflection vary from one set of verbs to another, a
classification of Tamil verbs based on tense markers is evolved. The inflection includes
finite, infinite, adjectival, adverbial and conditional forms of verbs.

Compared to verb morphological analysis, noun morphological analysis is
relatively easy. Noun can occur separately or with plural, oblique, case, postpositions
and clitics suffixes. A corpus was developed with all morphological feature
information. So the machine by itself captures all morphological rules, including
Sandhi and morphotactic rule.


Morphological analyzer is the second stage of pre-processing Tamil language

sentences. In first stage, Tamil sentences are tagged by Tamil POS Tagger tool. The
system developed for Tamil morphological analyzer consists of five modules (Figure


Minimized POS Tagger

Noun/Verb Pronoun Proper Noun Other word

Analyzer Analyzer Analyzer Class Analyzers

Morphologically Annotated

Figure 6.3 General Framework for Morphological Analyzer System

The five modules are,

1. Minimized POS Tagger

2. Noun/Verb Analyzer

3. Pronoun Analyzer

4. Proper Noun Analyzer

5. Other Word Class Analyzers

The input to the morphological system is a POS tagged sentence. In the first
module, POS Tagged sentence is refined according to the simplified POS tagset given
in Table 6.4. The refined POS tagged sentence is split according to the simplified POS
tags. The second module morphologically analyzes the Noun (<N>) and Verb (<V>)
forms. The third and fourth modules morphologically analyze Pronoun (P) and Proper
nouns (PN). Other word classes are analyzed in the fifth stage. This module considers
the POS tag as morphological features.

Table 6.4 Minimized POS Tagset



6.4.1 Morphological Analyzer using Machine Learning

The morphological analyzer identifies root and suffixes of a word. Generally, rule
based approaches are used for morphological analysis which are based on a set of rules
and dictionary that contains root words and morphemes. In rule based approach, a
particular word is given as an input to the morphological analyzer and if the
corresponding morphemes or root word is missing in the dictionary, then the rule based
system fails. Here, each rule depends on the previous rule. So if one rule fails, it affects
the entire set of rules which follows.

Recently, machine learning approaches are found to be dominating the Natural
Language Processing field. Machine learning is a branch of Artificial Intelligence (AI)
concerned with the design of algorithms that learn from the examples. Machine
learning algorithms can be supervised or unsupervised. The input and corresponding
output data are used in supervised learning. In unsupervised learning, only input
samples are used. The goal of machine learning approach is to use the given examples
and find out generalization and classification rules automatically. All the rules
including complex spelling rules can be handled by this method. Morphological
Analyzer based on machine learning approaches does not require any hand coded
morphological rules. It only needs morphologically segmented corpora. H.Poon
(2009) [189] reported the first log-linear model for unsupervised morphological
segmentation. For Arabic and Hebrew language, it outperforms the state-of-the-art
systems by a large margin. The sequence labeling is a significant generalization of the
supervised classification problem. One can assign a single label to each input element
in a sequence. The elements to be assigned are typically like parts of speech or
syntactic chunk labels [171]. Many tasks are formalized as sequence labeling problems
in various fields such as natural language processing and bioinformatics. There are two
types in sequence labeling approaches [171].

Raw labeling.

Joint segmentation and labeling.

In raw labeling, each element gets a single tag whereas in joint segmentation
and labeling, whole segments get a single label. In a morphological analyzer, sequence
is usually a word and, a character, is an element. As mentioned earlier, in
morphological analyzer, input is a word and output is root and inflections. Input word is
denoted as W, and, root word and inflections are denoted by R and I respectively.

[W]Noun/Verb = [R] Noun/Verb + [I] Noun/Verb

In turn, notation I can be expressed as i1+ i2+. + in where n refers to the

number of inflections or morphemes. Further W is converted into a set of characters.
Morphological analyzer accepts a sequence of characters as input and generates a
sequence of characters as output. Let X be the finite set of input characters and Y be the
finite set of output characters. If the input string is x, it is segmented as x1x2....xn
where each xn X. Similarly, if y is an output string, it is segmented as y1y2...yn and yn
Y where n is the number of segments.

Inputs: x = (x1, x2, x3, xn)

Labels: y = (y1, y2, y3, yn)

The main objective of sequence labeling approach is predicting y from the given
x. In training data, the input sequence x is mapped with output sequence y. Now
the morphological analyzer problem is transformed into a sequence labeling problem.
The information about the training data is explained in the following sub sections.
Finally the morphological analysis is redefined as a classification task which is solved
by using sequence labeling methodology.

6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer

Data formulation plays the key role in supervised machine learning approaches. The
first step involved in the corpora development for morphological analyzer is classifying
paradigms for verbs and nouns. The classification of Tamil verbs and nouns are based
respectively on tense markers and case markers. Each paradigm will inflect with the
same set of inflections. The second step is to collect the list of root words for all
paradigms. Paradigm Classification

Paradigm provides information about all the possible word forms of a root word in a
particular word class. Tamil noun and verb paradigm classification is done based on its
case and tense markers respectively. Number of paradigms for each word class
(noun/verb) is defined. For the sake of computational data modeling, Tamil verbs were
classified into 32 paradigms [13]. Nouns are classified into 25 paradigms to resolve the
challenges in noun morphological analysis. Based on the paradigm, the root words are
grouped into its paradigm. Table 6.5 shows the number of paradigms and inflections of
verb and noun which are handled in the system. Total represents the total number of
inflections that are handled in this analyzer system. Noun and verb paradigm list is
shown in Tables 6.6 and 6.7.

Table 6.5 Number of Paradigms and Inflections

Word forms
Inflections Auxiliaries Postpositions Total
Verb 32 164 67 -- 10988
Noun 25 30 -- 290 320

Table 6.6 Noun Paradigms Word-forms

Noun word forms

The Morphological System for noun handles more than three hundred word
forms including postpositions. Traditional grammarians group the various suffixes into
8 cases corresponding to the cases used in Sanskrit. These were the nominative,
accusative, and dative, sociative, genitive, instrumental, locative, and ablative. The
sample word forms which are used in this thesis are shown in Table 6.8. Remaining
word forms are included in Appendix B.

Table 6.7 Verb Paradigms

Table 6.8 Noun Word Forms

(narampinai) (narampathu)
(narampai) (narampinathu)
(narampinathai) (narampnkaN)
(narampOdu) (narampathukaN)
(narampinOdu) (narampukkAka)
(narampinaththOdu) (narampudaiya)
(narampAl) (narampinudaiya)
(narampukku) (narampil)
(narampiRku) (narampinil)
(narampin) (narampudan)

Verb word forms

Verbs also morphologically deficient i.e. some verbs do not take all the suffixes
meant for verbs. Verb is an obligatory part of a sentence except copula sentences.
Verbs can be classified into different types based on morphological, syntactic and
semantic characteristics. Based on the tense suffixes, verbs can be classified into weak
verb, strong verbs and medium verbs. Based on the form and function, verbs can be
classified into finite verb (ex. va-ndt-aan 'come_PAST_he') and non-finite verb (ex. va-
ndt-a 'come_PAST_RP' and va-ndt-u 'come_PAST_VPAR'). Depending the non-finite
whether non-finite form occur before noun or verb, they can be classified as adjectival
or relative participle form (ex. vandta paiyan 'the boy who came') and adverbial or
verbal participle form (ex. vandtu poonaan 'having come he went'). The Morphological
system for verb handles more than ten thousand word forms including auxiliaries and
clitics. The sample verb forms which are used in this research are shown in Table 6.9.
Remaining word forms are given in Appendix B.

Table 6.9 Verb Word Forms

(padi) (padikkinREn)
(padiththAn) (padikkinROm)
(padiththAL) (padippAn)
(padiththAr) (padippAL)
(padiththArkaL) (padippAr)
(padiththathu) (padippArkaL)
(padiththana) (padippathu)
(padiththAy) (padippana)
(padiththIr) (padippAy)
(padiththIrkaL) (padippIr)
(padiththEn) (padippIrkaL)
(padiththOm) (padippEn)
(padikkiRAn) (padippOm)
(padikkiRAL) (padikkum)
(padikkiRAr) (padiththa)
(padikkiRArkaL) (padikkinRa)
(padikkinRathu) (padikkAtha)
(padikkinRana) (padiththavan)
(padikkinRAy) (padiththavaL)
(padikkinRIr) (padiththavar)
(padikkinRIrkaL) (padiththathu)

182 Morphemes

Noun morphemes

The Morphological analyzer system for noun handles 92 morphemes in the Morpho-
lexical Tagging (Phase II). The morphemes which are used in this thesis are given in
Table 6.10.
Table 6.10 Noun Morphemes

Verb morphemes

The Morphological system for verb handles 170 morphemes in the Morpho-
lexical tagging (Phase II). The morphemes which are used in this analyzer are shown in
Table 4.11.

Table 6.11 Verb Morphemes

Ambigious Morphemes of Noun and Verb

A morpheme may have one or more morpho-syntactic categories. This leads to

the ambiguity in morphemes. The ambiguous morphemes of noun and verb are shown
in Table 6.12.

Table 6.12 Verb/Noun Ambiguous Morphemes

Ambiguous Tags
<NOM_kkal> <VERB_ROOT>
<PPO> <Noun_ROOT>
<Benefactive> <ADV_Suffix>
<Sandhi> <Oblique>

185 Data Creation for Noun/Verb Morphological Analyzer

The data creation for the first phase of Noun/Verb Morphological analyzer system is
done by the following stages.





Preprocessing is an important step in data creation. It is involved in training

stage as well as decoding stage. Figure 6.4 shows the preprocessing steps involved in
the development of corpora. Morphological corpus which is used for machine learning
is developed by the following steps. Romanization, Segmentation and Alignment


The input word forms are converted to Romanized forms using the Unicode to
Roman mapping. Romanization is done for easy computational processing. In Tamil,
syllable (Compound characters) exists as a single character, where one cannot separate
vowel and consonant. So, for this separation, Tamil graphemes are converted into
Roman forms. Tamil roman mapping is given in Appendix A.


After Romanization, each and every word in the corpora is segmented based on
the Tamil grapheme and additionally, each syllable in the corresponding word is further
segmented into consonants and vowels. To the segmented syllable, postfix C and
V to the consonant and vowel respectively. It is named as C-V representation i.e.
ConsonantVowel representation. The C-V representation is given only for input data.
In the output data, morpheme boundaries are indicated by * symbol.


The segmented words are aligned vertically as segments using the gap between

Figure 6.4 Preprocessing Steps
Mapping and Bootstrapping

The aligned input segments are consequently mapped with output segments in
the mapping stage. Bootstrapping is done to increase the training data size. Sample data

format for the word padiththAn is given in Table 6.13. First column

represents the input data and the second one represents output data. * indicates the
morpheme boundaries.

Table 6.13 Sample Data Format

p-C p
a-V a
d-C D
i-V i*
th Th
th-C th*
n n*

187 Issues in Data Creation

Mapping mismatch segments

Mismatching is the main problem which occurs in mapping the input characters
with output characters. Mismatching occurs in two cases, i.e., either the input units are
larger or smaller than those of the output units. The mismatching problem is solved by
inserting a null symbol $ or combining two units based on the morpho-phonemic
rules and further the input segments are mapped with output segments. After mapping,
machine learning tool is used for training the data.

In case 1, the input sequence is having more number of segments (14) than the
segments (13) in the output sequence. Tamil verb, padikkayiyalum is having 14
segments in input sequence but in output, only 13 segments are present. The first
occurrence of y(8th Segment) in the input sequence becomes null due to the morpho-
phonemic rule. So there is no segment to map the y segment in input sequence. For
this reason, in training, the input segment y is mapped with $ symbol ($ indicates
null) in output sequence. Now the input and the output segments are matched equally.

Case 1:
Input Sequence:
P-C | a-V | d-C | i-V | k | k-C | a-V | y-C | i-V | y-C |a-V | l-C | u-V | m
(14 segments)
Mismatched Output Sequence:
p | a | d | i* | k | k | a* | i | y | a | l* | u | m*
(13 segments)
Corrected Output Sequence:
p | a | d | i* | k | k | a* | $ | i | y | a | l* | u | m* (14 segments)

In case 2, the input sequence is having less number of segments than the output
sequence. Tamil verb OdinAn is having 6 segments in input sequence but output has 7
segments. Using morpho-phonemic rule, the segment d-C(2nd Segment) in the input
sequence is mapped to two segments d &u*(2nd and 3rd Segments) in the output
sequence. For this reason, in training, d-C is mapped with du*. Now the input and
the output segments are equalized and thus the problem of sequence mismatching is

Case 2:
Input Sequence:
O | d-C | i-V | n-C | A-V | n (6 segments)

Mismatched Output Sequence:

O | d | u* | i | n* | A | n (7 segments)

Corrected Output Sequence:

O | du* | i | n* | A | n (6 segments)

6.4.3 Morphological Tagging Framework using SVMTool Support Vector Machine (SVM)

Support Vector Machine (SVM) approaches have been around since the mid 1990s,
initially as a binary classification technique, with later extensions to regression and
multi-class classification. Here, Morphological analyzer problem is converted into a
classification problem. These classifications can be done through supervised machine
learning algorithms [12]. Support Vector Machine is a machine learning algorithm for
binary classification, which has been successfully applied to a number of practical

problems, including NLP [12]. Let {( x1 , y1 ),......,( xN , yN )} be the set of N training

examples, where each instance xi is a vector in R and yi {1, +1} is the class label.

SVM is attractive because it has an extremely well developed statistical learning

theory. SVM is based on strong mathematical foundations and results in simple, yet
very powerful, algorithms. SVMs are learning systems that use a hypothesis space of
linear functions in a high dimensional feature space, trained with a learning algorithm
from optimization theory that implements a learning bias derived from statistical
learning theory. SVMTool

The SVMTool is an open source generator of sequential taggers based on Support

Vector Machine [12]. Originally, SVMTool was developed for POS tagging, but, here
this tool is used in morphological analysis. The SVMTool software package consists of
three main components, namely the model learner (SVMTlearn), the tagger
(SVMTagger) and the evaluator (SVMTeval). SVM models (weight vectors and biases)
are learned from a training corpus using the SVMTlearn.

Different models are learned for the different strategies. Given a training set of
annotated examples, it is responsible for the training of a set of SVM classifiers. So as
to do that, it makes use of SVMlight an implementation of Vapniks SVMs in C,
developed by Thorsten Joachims (2002). Given a text corpus (one token per line) and
the path to a previously learned SVM model (including the automatically generated
dictionary), it performs tagging of a sequence of characters. Finally, given a correctly
annotated corpus, and the corresponding SVMTool predicted annotation, the
SVMTeval component displays tagging results. SVMTeval evaluates the performance
in terms of accuracy. Implementation of Morphological Analyzer System

Exiting Tamil morphological analyzers are explained in Chapter 2. Jan Hajic
(1998) [190] developed morphological tagging for inflectional languages using
exponential probabilistic model based on automatically selected features. The
parameters of the model are computed using simple estimates.
Using the machine learning approach, the morphological analyzer for Tamil is
developed. Separate engines are developed for noun and verb. Noun morphological
analyzer can handle inflected noun forms and postpositionally inflected nouns. The
verb analyzer handles all the verb forms like finite, infinite and auxiliaries.
Morphological analyzer is redefined as a classification task. Classification problem is
solved by using the SVM. In this machine learning approach, two training models are
developed for morphological analysis. These two models are grouped as Model-I
(segmentation model) and Model-II (morpho-syntactic tagging model). First model
(Model-I) is trained using the sequence of input characters and their corresponding
output labels. This trained Model-I is used for predicting the morpheme boundaries.
Second model (Model-II) is trained using sequence of morphemes and their
grammatical categories. This trained Model-II is used for assigning grammatical classes
to each morpheme. Figure 6.5 illustrates the three phases involved in the process of
morphological analyzer.


Morpheme Segmentation.

Morpho syntactic tagging.


The word that has to be morphologically analyzed is given as the input to the
pre-processing phase. The word primarily undergoes Romanization process. The
romanized word is segmented based on Tamil graphemes. Tamil grapheme consists of
vowel, consonant and syllable. The syllables are broken into vowel and consonant. To
these consonant and vowel, C and V are suffixed.

Segmentation of morpheme

In segmentation of morpheme process, words are segmented into morpheme

according to their morpheme boundary. The input sequence is given to the trained
Model-I. The trained model predicts each label to the input segments. This output
sequence is aligned as a morpheme segments using alignment program.


Morph Morpheme
Analyzer Alignment

Figure 6.5 Implementation of Noun/Verb Morph Analyzer

Morpho-syntactic tagging

The segmented morpheme sequence is given to the trained Model-II. It predicts

grammatical categories to the each segment (morphemes) in the sequence.

The morphological analyzer for Tamil pronoun is developed by using pattern matching
approach. Personal pronouns are playing an important role in Tamil language therefore
they need very special attention while generating as well as analyzing. Figure 6.6
shows the implementation of pronoun morphological analyzer. Morphological
processing of Tamil pronoun word form is handled independently by using pronoun
morphological analysis system. Morphological analysis of pronoun is based on the
pattern matching and pronoun root word. Structure of the pronoun word form is used
for creating a pattern file. Pronoun word structure is divided into four stages. They are,

iii. PPO CL

These stages are explained in Figure 6.6.


Case Clitic



Figure 6.6 Structure of Pronoun Word-form

Example for the Structure of Pronoun

Pronoun word class is a closed class word, so it is easy to collect all root words of
pronoun. In pronoun morphological system the word form is treated from left to right.
Generally in morphological analysis systems handles the word from right to left but
here limited vocabulary of pronoun makes to formulate a system from left to right. The
pronoun word form is Romanized using Unicode to roman mapping this Romanized
word is first compared with pronoun stem file. The pronoun stem file is consists of all
the stems and roots of pronoun words. If the roman form is matched with any entry in
the pronoun stem file then the matched part of the roman form is replaced with the
value of the corresponding entry. After this process the remaining part is compared
with three different suffix files. In this comparison, the matched part is replaced with
corresponding value of the suffix element. Finally the root word is converted into
Unicode form. Figure 6.7 shows the implementation of Pronoun Morph Analyzer

Pronoun Pattern Pronoun + MLI


Figure 6.7 Implementation of Pronoun Morph Analyzer


Step1: Check whether the input word is present in the dictionary.

Step2: If present go to Step3.Else go to Step 4.
Step3: Retrieve the Root and Morpho lexical information (MLI).
Step4: Assign, the input word as root word and null to the MLI.
Step5: The final output is a combination of Root word and MLI

The morphological analyzer for Proper Noun is developed by using the suffixes. Figure
6.8 shows the implementation of Proper Noun Morph analyzer. Proper noun word form
is taken as input for proper noun morphological analysis. Proper noun word form is
taken from minimized POS Tagged sentence. It is identified from a POS tag <PN>.
Initially proper noun word form is converted into roman form for easy computation.
This Roman conversion is done by using simple key pair mapping of Roman and Tamil
characters. This mapping program recognizes each Tamil character unit and replace
with corresponding roman character.

This Roman form is given to the proper noun analyzer system. System
compares the word with the suffix which is predefined. First, it identifies the suffix and
replaced with the corresponding information in the proper noun suffix data set. The
suffix data set is created using various proper noun inflection and their end characters.
For example from a table 6.14, the word sithamparam() is end with

m(), and the other word pANdisEri ( ) is end with ri() , the
possible inflections of both words are given in table. Morphological changes are
differing for the proper noun based on the end characters. So end characters are used in
creating rules. From the various inflections of the word-form the suffix is identified and
the remaining part is stem. This suffix is mapped to the original morphological
information. This algorithm replaces the encountered suffix with the morphological
information in a suffix table.


Step1: Suffix of Input word is identified using Suffix Table.

Step2: Identified suffix is stripped from the word
Step3: Suffix striping also gives the stem of the word.
Step4: Based on suffix, the stem is converted into root word.
Step5: Morpho lexical information is identified for the suffix.
Step6: The final output is a combination of Root word and Morpho lexical information.

Input Word

Stem ?


Split the Suffix and Stem Suffix


Stem Suffix
Convert into Morpho-Lexical
Lemma Information

Lemma + MLI

Figure 6.8 Implementation of Proper Noun Morph Analyzer

Table 6.14 Example for Proper Noun Inflections



Efficiency of the system is compared in this sub section. Various machine learning
tools are also compared using the same morphologically annotated data. The system
accuracy is estimated at various levels, which are briefly discussed below.

Training Data
a Vs Accuraacy

In Figuure 6.9, X ax
xis represennts training data
d and Y axis represeents accuracyy.
om the grapph, it is fou
Fro und that M
Morphologicaal Analyzer accuracy inncreases witth
inccrease in thee volume off training daata. Accuraccies are calcculated from
m 10k to3000k
traiining corpuss size.


c 80
r 60
c 50

10k 25k 40k 50k 75k 100k 125k 150k 175k 200k 300k

Figure 66.9 Trainingg Data Vs A


Taagged and Untagged

U Acccuracies

In the sequence
s bassed morphollogical system, output is obtained inn two differennt
stages using thhe trained models.
m First stage takes a sequence of characterr as input annd
givves untaggedd morphemees as output using the trrained Model-I. It also represents as
moorpheme ideentification. In the secoond stage, these morphemes are tagged usinng
traiined Model--II. Accuraccies of the uuntagged an
nd tagged m
morphemes for
f verbs annd
un are shown
n in Table 6.15.
ble 6.15 Taggged Vs Untaagged Accu
Tab uracies

Accuracyy Verb Nou

d(Model-I) 93.56% 94.334%
Model-II) 91.73% 92.22 %

Word level and character level accuracies

Accuracies are compared with word level as well as character level. Two
thousand three hundred verb data and one thousand seven hundred and fifty noun data
are taken randomly from POS Tagged corpus for testing the system. Table 6.16 shows
the number of words as well as the characters in the whole testing data set as well as the
efficiency of prediction.

Table 6.16 Number of Words and Characters and level of Efficiencies

Words Characters Words Characters
Testing data 2300 20627 1750 10534
Predicted correctly 2071 19089 1639 9645
Efficiency 90.4% 92.5% 91.5% 93.6%

The percentage of Word level efficiencies and Character level efficiencies are
calculated by the following formulae.

Number of words split correctly

Word level efficiency
Total number of words in Testing set

Number of characters tagged correctly

Character level efficiency
Total number of characters in Testing set

Sentence level accuracies

The POS tagged sentences are given to the Morphological analyzer tool.
Therefore, the accuracy of POS tagging affects the performance of the analyzer. Here,
1200 POS tagged sentences consisting of 8358 words were taken for testing the
Morphological system. Table 6.17 shows the Sentence level accuracy of Morphological
analyzer system. For other categories of simplified POS tags, part-of-speech
information is considered as morphological information.

Table 6.17 Sentence Level Accuracies

Input Correct Output Percentage
N 2821 2642 93.65
V 2794 2543 91.00
P (Pronoun) 562 543 96.61
PN 279 258 92.47
O (Others) 1902 1817 95.53
Overall Accuracy 93.86


English language sentences are preprocessed using existing parser and developed rules
(Chapter-4). For Tamil, POS tagger (Chapter-5) and Morphological analyzers (Chapter-
6) are used to preprocess the sentences. Preprocessing in Tamil sentences is same as the
factorization of Tamil sentences. Table 6.18 shows the example of English and Tamil
preprocessed sentence.

Table 6.18 Preprocessed English and Tamil Sentence

Preprocessed English Sentence Preprocessed Tamil Sentence
I | i | PN | prn_i | |P| null
my | my | PN | PRP$ ||P| poss
home | home | N |NN_to | |N| DAT
vegetables | vegetable | N | NNS ||N|PL
bought | buy | V | VBD_1S. ||V|PAST_1S


This chapter explains the development of Morphological analyzer for Tamil language
using Machine learning approach. Capturing the agglutinative structure of Tamil words
by an automatic system is a challenging job. Generally, rule based approaches are used
for building morphological analyzer. Tamil morphological analyzer for noun and verb
is developed using the new and state of the art machine learning approach.
Morphological analyzer problem is redefined as a classification problem. This approach
is based on sequence labeling and training by kernel methods that captures the non

linear relationships of the morphological features from training data samples in a better
and simpler way. SVM based tool is used for training the system with the size of 6 lakh
morphologically tagged verbs and nouns. The same methodology is implemented for
other Dravidian languages like Malayalam, Telugu, and Kannada. Tamil Pronouns and
Proper nouns are handled using separate analyzer system. Other word classes need not
to be further analyzed for morphological features. So, POS tag information is
considered as the morphological information.