Documente Academic
Documente Profesional
Documente Cultură
Grammar of any Language can be broadly divided into morphology and syntax. The
term morphology was coined by August Schleicher in 1859. Morphology deals with the
words and their construction. Syntax deals with how to put the words together in some
order to make meaningful sentences. Morphology is the field within linguistics that
studies the internal structure of words. While words are generally accepted as being the
smallest units of syntax, it is clear that in most languages, words can be related to other
words by rules. Morphology is attempts to formulate rules that model the knowledge of
the speakers of those languages. Morphemes are the smaller elements of which words
are built. Two broad classes of morphemes are stems and affixes. Affixes that are added
to the base to denote relations of words are morphemes. Morphemes can either be free
(they can stand alone, i.e. they can be words in their own right) e.g. dog, or they can be
bound (they must occur as part of a word) e.g. the plural suffix s on dogs.
163
6.1.3 Morphological Analyzer
With the above definition, an analyzer of words in a sentence does not have to
do much work in identifying a word. It simply has to look for the delimiters. Having
identified the word, it must determine whether it is a compound word or simple word.
If it is a compound word, it must first break it up into its constituent simple words
before proceeding to analyze them. The former is called as sandhi analyzer and the
later is morphological analyzer, both of which are important parts of a word analyzer.
164
The detailed linguistic analysis of a word can be useful for NLP. However, most
NLP researchers have concentrated on other aspects, like grammatical analysis,
semantic interpretation etc. As a result, NLP systems use rather simple morphological
analyzers. A generator does the reverse of an analyzer. Given a root and its features (or
affixes), a morphological generator generates a word. Similarly, a sandhi generator can
take the output of a morphological generator, and group simple words into compound
words, where possible.
books = book+Noun+PluraL(or)book+Verb+Pres+3SG.
stopping = stop+Verb+Cont
happiest = happy+Adj+Superlative
went = go+Verb+Past
book+Noun+Plural = books
stop+Verb+Cont = stopping
happy+Adj+Superlative = happiest
go+Verb+Past = went
Spell checker
Search Engines
Information extraction and retrieval
Machine Translation system
Grammar checker
Content analysis
Question Answering system
165
Automatic sentence Analyzer
Dialoge system
Knowlege representation in learning
Language Teaching
Language based educational exercises
Text Processing
Tools
Search Engines-
IR/IE
Tamil Morphology is very rich. It is an agglutinative language, like the other Dravidian
languages. Tamil words are made up of lexical roots followed by one or more affixes.
The lexical roots and the affixes are the smallest meaningful units and are called
morphemes. Tamil words are therefore made up of morphemes concatenated to one
another in a series. The first one in the construction is always a lexical morpheme
(lexical root). This may or may not be followed by other functional or grammatical
morphemes. For instance, a word puththakangkaL in Tamil, can be
world entity and kaL is the plural feature marker (suffix). kaL is a
166
grammatical morpheme that is bound to the lexical root to add plurality to the lexical
root. Unlike English, Tamil words can have a large sequence of morphemes. For
instance,
Tamil nouns can take case suffixes after the plural marker. They can also have
post positions after that. Tamil words consist of a lexical root to which one or more
affixes are attached. Most Tamil affixes are suffixes. Tamil suffixes can be derivational
suffixes, which either changes the part of speech of the word or its meaning, or
inflectional suffixes, which mark categories such as person, number, mood, tense, etc.
The words can be analyzed like the one above by identifying the constituent
morphemes and their features can be identified.
Tamil is a consistently head-final language. The verb comes at the end of the clause,
with typical word order Subject Object Verb (SOV). Tamil is also a free word-order
language. Due to this relatively free word-order nature of Tamil language, the Noun
Phrase arguments before a final verb can appear in any permutation, yet it conveys the
same sense of a sentence. Tamil has postpositions rather than prepositions.
Demonstratives and modifiers precede the noun within the noun phrase. Subordinate
clauses precede the verb of the matrix clause.
Tamil is a null subject language. Not all Tamil sentences have subjects, verbs
and objects. It is possible to construct valid sentences that have only a verbsuch as
copula (a linking verb equivalent to the word is). The word is included in the
translations only to convey the meaning more easily.
167
6.2.3 Word Formation Rules (WFR) in Tamil
Any new word created by Word Formation Rules (WFR) must be a member of a major
lexical category. The WFR determines the category of the output of the rule. In Tamil,
the grammatical category may change or may not change after the operation of WFR.
The following is the list of inputs and outputs of different kinds of WFR's in the
derivation of simple words in Tamil [170].
1. Noun Noun
[ [ ]N + ]suf ]N
[ ]N + ]suf ]N
2. Verb Noun
[ [ ]V + ]suf ]N
[ [ ]V + ]suf ]
[ [ ]V + ]suf] N
3. Adjective Noun
[ [ ]adj + ]suf ]
168
[ [ ]adj + ]suf ]N
4. Noun Verb
[ [ ]N + ]suf ]V
5. Adjective Verb
[ [ ]adj + ]suf ]V
[ [ ]adj + ]suf ]V
6. Verb Verb
[ [ ]V + ]suf ]V
[ [ ]V - ] suf ]V
[ [ ]V + ]suf ]V
7. Noun Adjective
[ [ ]N + ]suf ]adj
[ [ ]N + ]suf ]adj
169
[ [nErmai ]N + Ana ]suf ]adj 'honest'
[ [ ]N + ]suf ]adj
8. Verb Adverb
[ [ ]V + ]suf ]adv
Table 6.1 shows the possible combinations for compound word formation.
Examples:
{ [ ]N # [ ]V # }V
170
Table 6.1 Compound Word-forms Formation
Tamil verbs are inflected by means of suffixes. Tamil verbs can be finite or non-finite
forms. Finite verb forms occur in the main clause of the sentence and non-finite forms
occur as the predicate of subordinate or embedded clauses. Morphologically, finite verb
forms are inflected for tense, mood, aspect, person, number and gender.
The simple finite verb forms are given in Table 6.2. First column presents the
PNG (Person-Number-Gender) Tag and the further columns presents present, past and
future tenses respectively. For the word padi (study), various simple finite
inflection forms with tense markers and PNG markers are given in Table 6.2.
171
Modal verbs can be defective in that, they cannot take any more inflectional
suffixes, or they can be regular verbs that can get inflected for tense and PNG suffixes.
Tamil nouns (and pronouns) are classified into two super-classes rational and the
"irrational" which include a total of five classes. Humans and deities are classified as
"rational", and all other nouns (animals, objects, abstract nouns) are classified as
irrational. The "rational" nouns and pronouns belong to one of three classes masculine
singular, feminine singular, and rational plural. The "irrational" nouns and pronouns
belong to one of two classes - irrational singular and irrational plural. The plural form
for rational nouns may be used as an honorific, gender-neutral, singular form [132].
172
genitive, instrumental, locative, and ablative. The various noun forms are given in the
Table 6.3. The table represents the singular and plural forms of the word eli (rat)
with the case markers.
Noun form without any inflections are called noun stem. Nouns in their stem
forms are singular.
= +
= +
The examples shown above are a few instances of plural inflection. Creating a
plural form of a noun isnt simply about concatenating kaL. Similarly, in
puththakangkaL, the stem (puththakam) is inflected to puththakangkaL (am in the
stem is replaced by ang, followed by kaL). These differences are due to the Sandhi
173
changes that take place when the noun stem is concatenated to the kaL morpheme.
Tamil uses case suffixes and post positions for case marking instead of prepositions.
Case markers indicate the relationship between the noun phrases and the verb phrase. It
indicates the semantic role of the noun phrases in the sentence. Genitive case, tells the
relationship between noun phrases. This is expressed by in morpheme. Case suffixes
are concatenated to the nouns in their stem form or after the plural morpheme if its a
plural noun.
Post positions are of two kinds: bound and free. In case of bound post positions,
they occur with their respective governing case suffixes. In such a case, the
Morphotactics would be,
Sometimes the post positions follow a blank space after the case suffix as
another word. Free post positions follow noun stems without any case suffixes.
However they are written as another word and do not concatenate with the noun.
Basically, there are eight cases in Tamil. Verbs can take the form of nouns when
followed by nominal suffixes. Nominalized verb forms are an example of derivational
Morphology in Tamil. They occur in the following format.
174
6.2.6 Tamil Morphological Analyzer
175
Compared to verb morphological analysis, noun morphological analysis is
relatively easy. Noun can occur separately or with plural, oblique, case, postpositions
and clitics suffixes. A corpus was developed with all morphological feature
information. So the machine by itself captures all morphological rules, including
Sandhi and morphotactic rule.
POS
Tagged
Sentence
Morphologically Annotated
Sentence
2. Noun/Verb Analyzer
3. Pronoun Analyzer
176
4. Proper Noun Analyzer
The input to the morphological system is a POS tagged sentence. In the first
module, POS Tagged sentence is refined according to the simplified POS tagset given
in Table 6.4. The refined POS tagged sentence is split according to the simplified POS
tags. The second module morphologically analyzes the Noun (<N>) and Verb (<V>)
forms. The third and fourth modules morphologically analyze Pronoun (P) and Proper
nouns (PN). Other word classes are analyzed in the fifth stage. This module considers
the POS tag as morphological features.
The morphological analyzer identifies root and suffixes of a word. Generally, rule
based approaches are used for morphological analysis which are based on a set of rules
and dictionary that contains root words and morphemes. In rule based approach, a
particular word is given as an input to the morphological analyzer and if the
corresponding morphemes or root word is missing in the dictionary, then the rule based
system fails. Here, each rule depends on the previous rule. So if one rule fails, it affects
the entire set of rules which follows.
177
Recently, machine learning approaches are found to be dominating the Natural
Language Processing field. Machine learning is a branch of Artificial Intelligence (AI)
concerned with the design of algorithms that learn from the examples. Machine
learning algorithms can be supervised or unsupervised. The input and corresponding
output data are used in supervised learning. In unsupervised learning, only input
samples are used. The goal of machine learning approach is to use the given examples
and find out generalization and classification rules automatically. All the rules
including complex spelling rules can be handled by this method. Morphological
Analyzer based on machine learning approaches does not require any hand coded
morphological rules. It only needs morphologically segmented corpora. H.Poon et.al
(2009) [189] reported the first log-linear model for unsupervised morphological
segmentation. For Arabic and Hebrew language, it outperforms the state-of-the-art
systems by a large margin. The sequence labeling is a significant generalization of the
supervised classification problem. One can assign a single label to each input element
in a sequence. The elements to be assigned are typically like parts of speech or
syntactic chunk labels [171]. Many tasks are formalized as sequence labeling problems
in various fields such as natural language processing and bioinformatics. There are two
types in sequence labeling approaches [171].
Raw labeling.
In raw labeling, each element gets a single tag whereas in joint segmentation
and labeling, whole segments get a single label. In a morphological analyzer, sequence
is usually a word and, a character, is an element. As mentioned earlier, in
morphological analyzer, input is a word and output is root and inflections. Input word is
denoted as W, and, root word and inflections are denoted by R and I respectively.
The main objective of sequence labeling approach is predicting y from the given
x. In training data, the input sequence x is mapped with output sequence y. Now
the morphological analyzer problem is transformed into a sequence labeling problem.
The information about the training data is explained in the following sub sections.
Finally the morphological analysis is redefined as a classification task which is solved
by using sequence labeling methodology.
Data formulation plays the key role in supervised machine learning approaches. The
first step involved in the corpora development for morphological analyzer is classifying
paradigms for verbs and nouns. The classification of Tamil verbs and nouns are based
respectively on tense markers and case markers. Each paradigm will inflect with the
same set of inflections. The second step is to collect the list of root words for all
paradigms.
Paradigm provides information about all the possible word forms of a root word in a
particular word class. Tamil noun and verb paradigm classification is done based on its
case and tense markers respectively. Number of paradigms for each word class
(noun/verb) is defined. For the sake of computational data modeling, Tamil verbs were
classified into 32 paradigms [13]. Nouns are classified into 25 paradigms to resolve the
challenges in noun morphological analysis. Based on the paradigm, the root words are
grouped into its paradigm. Table 6.5 shows the number of paradigms and inflections of
verb and noun which are handled in the system. Total represents the total number of
inflections that are handled in this analyzer system. Noun and verb paradigm list is
shown in Tables 6.6 and 6.7.
179
Table 6.5 Number of Paradigms and Inflections
Word forms
Paradigms
Inflections Auxiliaries Postpositions Total
Verb 32 164 67 -- 10988
Noun 25 30 -- 290 320
6.4.2.2 Word-forms
The Morphological System for noun handles more than three hundred word
forms including postpositions. Traditional grammarians group the various suffixes into
8 cases corresponding to the cases used in Sanskrit. These were the nominative,
accusative, and dative, sociative, genitive, instrumental, locative, and ablative. The
sample word forms which are used in this thesis are shown in Table 6.8. Remaining
word forms are included in Appendix B.
180
Table 6.7 Verb Paradigms
(narampinai) (narampathu)
(narampai) (narampinathu)
(narampinathai) (narampnkaN)
(narampOdu) (narampathukaN)
(narampinOdu) (narampukkAka)
(narampAlAna)
(narampinaththOdu) (narampudaiya)
(narampinAl)
(narampAl) (narampinudaiya)
(narampukku) (narampil)
(narampiRku) (narampinil)
(narampin) (narampudan)
181
Verb word forms
Verbs also morphologically deficient i.e. some verbs do not take all the suffixes
meant for verbs. Verb is an obligatory part of a sentence except copula sentences.
Verbs can be classified into different types based on morphological, syntactic and
semantic characteristics. Based on the tense suffixes, verbs can be classified into weak
verb, strong verbs and medium verbs. Based on the form and function, verbs can be
classified into finite verb (ex. va-ndt-aan 'come_PAST_he') and non-finite verb (ex. va-
ndt-a 'come_PAST_RP' and va-ndt-u 'come_PAST_VPAR'). Depending the non-finite
whether non-finite form occur before noun or verb, they can be classified as adjectival
or relative participle form (ex. vandta paiyan 'the boy who came') and adverbial or
verbal participle form (ex. vandtu poonaan 'having come he went'). The Morphological
system for verb handles more than ten thousand word forms including auxiliaries and
clitics. The sample verb forms which are used in this research are shown in Table 6.9.
Remaining word forms are given in Appendix B.
(padi) (padikkinREn)
(padiththAn) (padikkinROm)
(padiththAL) (padippAn)
(padiththAr) (padippAL)
(padiththArkaL) (padippAr)
(padiththathu) (padippArkaL)
(padiththana) (padippathu)
(padiththAy) (padippana)
(padiththIr) (padippAy)
(padiththIrkaL) (padippIr)
(padiththEn) (padippIrkaL)
(padiththOm) (padippEn)
(padikkiRAn) (padippOm)
(padikkiRAL) (padikkum)
(padikkiRAr) (padiththa)
(padikkiRArkaL) (padikkinRa)
(padikkinRathu) (padikkAtha)
(padikkinRana) (padiththavan)
(padikkinRAy) (padiththavaL)
(padikkinRIr) (padiththavar)
(padikkinRIrkaL) (padiththathu)
182
6.4.2.3 Morphemes
Noun morphemes
The Morphological analyzer system for noun handles 92 morphemes in the Morpho-
lexical Tagging (Phase II). The morphemes which are used in this thesis are given in
Table 6.10.
Table 6.10 Noun Morphemes
183
Verb morphemes
The Morphological system for verb handles 170 morphemes in the Morpho-
lexical tagging (Phase II). The morphemes which are used in this analyzer are shown in
Table 4.11.
184
Ambigious Morphemes of Noun and Verb
185
6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer
The data creation for the first phase of Noun/Verb Morphological analyzer system is
done by the following stages.
Preprocessing
Mapping
Bootstrapping
Preprocessing
Romanization
The input word forms are converted to Romanized forms using the Unicode to
Roman mapping. Romanization is done for easy computational processing. In Tamil,
syllable (Compound characters) exists as a single character, where one cannot separate
vowel and consonant. So, for this separation, Tamil graphemes are converted into
Roman forms. Tamil roman mapping is given in Appendix A.
Segmentation
After Romanization, each and every word in the corpora is segmented based on
the Tamil grapheme and additionally, each syllable in the corresponding word is further
segmented into consonants and vowels. To the segmented syllable, postfix C and
V to the consonant and vowel respectively. It is named as C-V representation i.e.
ConsonantVowel representation. The C-V representation is given only for input data.
In the output data, morpheme boundaries are indicated by * symbol.
Alignment
The segmented words are aligned vertically as segments using the gap between
them.
186
Figure 6.4 Preprocessing Steps
Mapping and Bootstrapping
The aligned input segments are consequently mapped with output segments in
the mapping stage. Bootstrapping is done to increase the training data size. Sample data
format for the word padiththAn is given in Table 6.13. First column
represents the input data and the second one represents output data. * indicates the
morpheme boundaries.
I/P O/P
p-C p
a-V a
d-C D
i-V i*
th Th
th-C th*
A-V A
n n*
187
6.4.2.5 Issues in Data Creation
Mismatching is the main problem which occurs in mapping the input characters
with output characters. Mismatching occurs in two cases, i.e., either the input units are
larger or smaller than those of the output units. The mismatching problem is solved by
inserting a null symbol $ or combining two units based on the morpho-phonemic
rules and further the input segments are mapped with output segments. After mapping,
machine learning tool is used for training the data.
In case 1, the input sequence is having more number of segments (14) than the
segments (13) in the output sequence. Tamil verb, padikkayiyalum is having 14
segments in input sequence but in output, only 13 segments are present. The first
occurrence of y(8th Segment) in the input sequence becomes null due to the morpho-
phonemic rule. So there is no segment to map the y segment in input sequence. For
this reason, in training, the input segment y is mapped with $ symbol ($ indicates
null) in output sequence. Now the input and the output segments are matched equally.
Case 1:
Input Sequence:
P-C | a-V | d-C | i-V | k | k-C | a-V | y-C | i-V | y-C |a-V | l-C | u-V | m
(14 segments)
Mismatched Output Sequence:
p | a | d | i* | k | k | a* | i | y | a | l* | u | m*
(13 segments)
Corrected Output Sequence:
p | a | d | i* | k | k | a* | $ | i | y | a | l* | u | m* (14 segments)
In case 2, the input sequence is having less number of segments than the output
sequence. Tamil verb OdinAn is having 6 segments in input sequence but output has 7
segments. Using morpho-phonemic rule, the segment d-C(2nd Segment) in the input
sequence is mapped to two segments d &u*(2nd and 3rd Segments) in the output
sequence. For this reason, in training, d-C is mapped with du*. Now the input and
the output segments are equalized and thus the problem of sequence mismatching is
solved.
188
Case 2:
Input Sequence:
O | d-C | i-V | n-C | A-V | n (6 segments)
Support Vector Machine (SVM) approaches have been around since the mid 1990s,
initially as a binary classification technique, with later extensions to regression and
multi-class classification. Here, Morphological analyzer problem is converted into a
classification problem. These classifications can be done through supervised machine
learning algorithms [12]. Support Vector Machine is a machine learning algorithm for
binary classification, which has been successfully applied to a number of practical
examples, where each instance xi is a vector in R and yi {1, +1} is the class label.
N
6.4.3.2 SVMTool
Different models are learned for the different strategies. Given a training set of
annotated examples, it is responsible for the training of a set of SVM classifiers. So as
to do that, it makes use of SVMlight an implementation of Vapniks SVMs in C,
developed by Thorsten Joachims (2002). Given a text corpus (one token per line) and
the path to a previously learned SVM model (including the automatically generated
dictionary), it performs tagging of a sequence of characters. Finally, given a correctly
annotated corpus, and the corresponding SVMTool predicted annotation, the
SVMTeval component displays tagging results. SVMTeval evaluates the performance
in terms of accuracy.
Exiting Tamil morphological analyzers are explained in Chapter 2. Jan Hajic et.al
(1998) [190] developed morphological tagging for inflectional languages using
exponential probabilistic model based on automatically selected features. The
parameters of the model are computed using simple estimates.
Using the machine learning approach, the morphological analyzer for Tamil is
developed. Separate engines are developed for noun and verb. Noun morphological
analyzer can handle inflected noun forms and postpositionally inflected nouns. The
verb analyzer handles all the verb forms like finite, infinite and auxiliaries.
Morphological analyzer is redefined as a classification task. Classification problem is
solved by using the SVM. In this machine learning approach, two training models are
developed for morphological analysis. These two models are grouped as Model-I
(segmentation model) and Model-II (morpho-syntactic tagging model). First model
(Model-I) is trained using the sequence of input characters and their corresponding
output labels. This trained Model-I is used for predicting the morpheme boundaries.
Second model (Model-II) is trained using sequence of morphemes and their
grammatical categories. This trained Model-II is used for assigning grammatical classes
to each morpheme. Figure 6.5 illustrates the three phases involved in the process of
morphological analyzer.
Pre-processing.
190
Morpheme Segmentation.
Preprocessing
The word that has to be morphologically analyzed is given as the input to the
pre-processing phase. The word primarily undergoes Romanization process. The
romanized word is segmented based on Tamil graphemes. Tamil grapheme consists of
vowel, consonant and syllable. The syllables are broken into vowel and consonant. To
these consonant and vowel, C and V are suffixed.
Segmentation of morpheme
Trained
Model-I
Input
Preprocessing
Morph Morpheme
Analyzer Alignment
Postprocessing
Output:
<ROOT>
<PAST >
<3SM>
Trained
Model-II
Morpho-syntactic tagging
191
6.5 MORPHOLOGICAL ANALYZER FOR PRONOUN USING
PATTERN MATCHING
The morphological analyzer for Tamil pronoun is developed by using pattern matching
approach. Personal pronouns are playing an important role in Tamil language therefore
they need very special attention while generating as well as analyzing. Figure 6.6
shows the implementation of pronoun morphological analyzer. Morphological
processing of Tamil pronoun word form is handled independently by using pronoun
morphological analysis system. Morphological analysis of pronoun is based on the
pattern matching and pronoun root word. Structure of the pronoun word form is used
for creating a pattern file. Pronoun word structure is divided into four stages. They are,
i. PRN ROOT
ii. CASE CLITIC
iii. PPO CL
iv. CLITIC
PP
Case Clitic
Clitic
Pronoun
Clitic
192
Example for the Structure of Pronoun
Pronoun word class is a closed class word, so it is easy to collect all root words of
pronoun. In pronoun morphological system the word form is treated from left to right.
Generally in morphological analysis systems handles the word from right to left but
here limited vocabulary of pronoun makes to formulate a system from left to right. The
pronoun word form is Romanized using Unicode to roman mapping this Romanized
word is first compared with pronoun stem file. The pronoun stem file is consists of all
the stems and roots of pronoun words. If the roman form is matched with any entry in
the pronoun stem file then the matched part of the roman form is replaced with the
value of the corresponding entry. After this process the remaining part is compared
with three different suffix files. In this comparison, the matched part is replaced with
corresponding value of the suffix element. Finally the root word is converted into
Unicode form. Figure 6.7 shows the implementation of Pronoun Morph Analyzer
Steps
193
6.6 MORPHOLOGICAL ANALYZER FOR PROPER NOUN
USING SUFFIXES
The morphological analyzer for Proper Noun is developed by using the suffixes. Figure
6.8 shows the implementation of Proper Noun Morph analyzer. Proper noun word form
is taken as input for proper noun morphological analysis. Proper noun word form is
taken from minimized POS Tagged sentence. It is identified from a POS tag <PN>.
Initially proper noun word form is converted into roman form for easy computation.
This Roman conversion is done by using simple key pair mapping of Roman and Tamil
characters. This mapping program recognizes each Tamil character unit and replace
with corresponding roman character.
This Roman form is given to the proper noun analyzer system. System
compares the word with the suffix which is predefined. First, it identifies the suffix and
replaced with the corresponding information in the proper noun suffix data set. The
suffix data set is created using various proper noun inflection and their end characters.
For example from a table 6.14, the word sithamparam() is end with
m(), and the other word pANdisEri ( ) is end with ri() , the
possible inflections of both words are given in table. Morphological changes are
differing for the proper noun based on the end characters. So end characters are used in
creating rules. From the various inflections of the word-form the suffix is identified and
the remaining part is stem. This suffix is mapped to the original morphological
information. This algorithm replaces the encountered suffix with the morphological
information in a suffix table.
Steps
194
Input Word
Word
No
Suffix
Stem ?
Yes
Stem Suffix
Convert into Morpho-Lexical
Lemma Information
Lemma + MLI
Morphological
Output
Root
Root+ACC
Root+LOC
Root+DAT
Root+ABL
Root+UM
Efficiency of the system is compared in this sub section. Various machine learning
tools are also compared using the same morphologically annotated data. The system
accuracy is estimated at various levels, which are briefly discussed below.
195
Training Data
a Vs Accuraacy
In Figuure 6.9, X ax
xis represennts training data
d and Y axis represeents accuracyy.
om the grapph, it is fou
Fro und that M
Morphologicaal Analyzer accuracy inncreases witth
inccrease in thee volume off training daata. Accuraccies are calcculated from
m 10k to3000k
traiining corpuss size.
Accuraacy
100
90
A
c 80
c
70
u
r 60
a
c 50
y
40
30
10k 25k 40k 50k 75k 100k 125k 150k 175k 200k 300k
TrainingData
In the sequence
s bassed morphollogical system, output is obtained inn two differennt
stages using thhe trained models.
m First stage takes a sequence of characterr as input annd
givves untaggedd morphemees as output using the trrained Model-I. It also represents as
a
moorpheme ideentification. In the secoond stage, these morphemes are tagged usinng
traiined Model--II. Accuraccies of the uuntagged an
nd tagged m
morphemes for
f verbs annd
nou
un are shown
n in Table 6.15.
ble 6.15 Taggged Vs Untaagged Accu
Tab uracies
196
Word level and character level accuracies
Accuracies are compared with word level as well as character level. Two
thousand three hundred verb data and one thousand seven hundred and fifty noun data
are taken randomly from POS Tagged corpus for testing the system. Table 6.16 shows
the number of words as well as the characters in the whole testing data set as well as the
efficiency of prediction.
VERB NOUN
Category
Words Characters Words Characters
Testing data 2300 20627 1750 10534
Predicted correctly 2071 19089 1639 9645
Efficiency 90.4% 92.5% 91.5% 93.6%
The percentage of Word level efficiencies and Character level efficiencies are
calculated by the following formulae.
The POS tagged sentences are given to the Morphological analyzer tool.
Therefore, the accuracy of POS tagging affects the performance of the analyzer. Here,
1200 POS tagged sentences consisting of 8358 words were taken for testing the
Morphological system. Table 6.17 shows the Sentence level accuracy of Morphological
analyzer system. For other categories of simplified POS tags, part-of-speech
information is considered as morphological information.
197
Table 6.17 Sentence Level Accuracies
WORD COUNT
Categories
Input Correct Output Percentage
N 2821 2642 93.65
V 2794 2543 91.00
P (Pronoun) 562 543 96.61
PN 279 258 92.47
O (Others) 1902 1817 95.53
Overall Accuracy 93.86
6.9 SUMMARY
This chapter explains the development of Morphological analyzer for Tamil language
using Machine learning approach. Capturing the agglutinative structure of Tamil words
by an automatic system is a challenging job. Generally, rule based approaches are used
for building morphological analyzer. Tamil morphological analyzer for noun and verb
is developed using the new and state of the art machine learning approach.
Morphological analyzer problem is redefined as a classification problem. This approach
is based on sequence labeling and training by kernel methods that captures the non
198
linear relationships of the morphological features from training data samples in a better
and simpler way. SVM based tool is used for training the system with the size of 6 lakh
morphologically tagged verbs and nouns. The same methodology is implemented for
other Dravidian languages like Malayalam, Telugu, and Kannada. Tamil Pronouns and
Proper nouns are handled using separate analyzer system. Other word classes need not
to be further analyzed for morphological features. So, POS tag information is
considered as the morphological information.
199