Documente Academic
Documente Profesional
Documente Cultură
Adaptor Grammars
Author 1
Affiliation 1
email1@domain1.com
Abstract
In this paper we learn the complex agglutinated morphology of Indian languages using adapter grammars and linguistic rules
of morphology. Adapter grammars a compositional Bayesian framework for grammatical inference, where we define a morphological grammar for agglutinated languages and morphological boundaries are
inferred from a corpora of plain text. Once
it produces morphological segmentation,
regular expressions of sandhi rules and orthography applied to achieve final segmentation. We test our algorithm in the case of
three complex languages from the Dravidian family and evaluate the results comparing to other state of the art unsupervised
morphology learning systems.
Introduction
Author 2
Affiliation 2
email2@domain2.com
it produce poor result because of lack of knowledge of orthography and morphological complexities, such as sandhi,a morpho-phonemic change
happens in word or morpheme boundaries at time
of concatenation. In section1.1, we briefly discuss
morphological properties and orthography of popular languages in Dravidian family that make unsupervised learning difficult. There are some efforts to test Dravidian languages on state of art
system but it gives poor results, such as vasudevanlittle and (Bhat, 2012). These studies give idea
a rule and statistics based model that can work
well on these langauges.
Recent research in morphology learning shift
to semi supervised learning and it produce better
results than fully unsupervised learning, such as
(Kohonen et al., 2010a) and and (Kohonen et al.,
2010b)inspring from these works We propose a
semi supervised morphological processing system
based on Adapter grammars and linguistic rule
to deal with the complex orthography of the languages. Our system uses a method is combination
of statistical and rule based method.
Adapter grammars are Bayesian non parametric
models that can handle linguistic structure formalism. They are non parametric version of Probabilistic Context Free Grammar. They are designed
for unsupervised structure learning. It is successfully used in various natural language processing
applications such as In section 1.2, we give an informal definition of Adapter grammars and inference procedure.
We use adapter grammars to learn model of
morphology and once the model produce output
we use regular expressions created from morphological rules to refine the results. The major idea
is that as these languages are agglutinated, suffixes
are stacked to together to create a large word sequence. It indicates as the length of the word increase more number of morphemes are present in
the word.
Adapter Grammar
Adapter
Grammar
is
a
7tuple(N, W, R, S, , A, C), where (N, W, R, S, )
is a PCFG. In this PCFG N stands for non
terminals and W stands for terminals, S N is
a start symbol, R stands for rule set and stands
for rules probability. r is the probability of rule
r R. A N is a set of adapted non terminals
and C is a vector of adapters index by elements
of A, such that CX is an adapter for adapter non
terminal X A.
CX maps trees of base distribution HX , who
support TX . TX is a set of sub trees whose root is
X N . In adapter grammar HX is determined by
PCFG rules of expanding X and probability distribution , for more information (Johnson et al.,
2006).Various non parametric Probabilistic processes can be used as adapter like Dirichlet Process. Johnson use Dirichlet Process as adapter for
word segmentation of Sesotho (Johnson, 2008).
Adapter have been applied to various NLP tasks
such as: Word segmentation (Johnson, 2008),
named entity recognition (Elsner et al., 2009), and
machine transliteration (Wong et al., 2012).
For testing our method, we have extracted a corpus of five million words of each languages from
Wikipedia and news paper websites. Scripts of the
languages converted to 8 bit Extended-ASCII to
deal with complex orthography. The conversion
a script as follows. A Malayalam word .tarcal in
Kannada
Malayalam
Tamil
Morfessor-base
NPY
Morfessor-CAP
Adapter grammr and rule
Undivide
48.1
66.8
66.8
66.8
66.8
60.4
58.0
58.0
58.0
58.0
53.5
62.1
62.1
62.1
62.1
47.3
60.3
60.3
60.3
60.3
60.0
59.6
59.6
59.6
59.6
52.9
59.9
59.9
59.9
59.9
We have presented a semi supervised morphology learning technique that uses statistical measures and linguistic rules. The result of the proposed method outperforms other state of art unsupervised morphology learning techniques. Another important aspect of our experiments is we
tested Adapter grammars in the real world data of
highly agglutinated and complex languages. We
also use large amount of data to train the model,
1
where in other previous work experiments are carried out on toy corpus.
Acknowledgments
Do not number the acknowledgment section. Do
not include this section when submitting your paper for review.
References
Kenneth R Beesley. 1998. Arabic morphology using
only finite-state operations. In Proceedings of the
Workshop on Computational Approaches to Semitic
languages, pages 5057. Association for Computational Linguistics.
Suma Bhat. 2012. Morpheme segmentation for kannada standing on the shoulder of giants. In 24th International Conference on Computational Linguistics, page 79.
Mathias Creutz and Krista Lagus. 2005. Unsupervised
morpheme segmentation and morphology induction
from text corpora using Morfessor 1.0. Helsinki
University of Technology.
Micha Elsner, Eugene Charniak, and Mark Johnson.
2009. Structured generative models for unsupervised named-entity clustering. In Proceedings of
Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the
Association for Computational Linguistics, pages
164172. Association for Computational Linguistics.
John Goldsmith. 2001. Unsupervised learning of the
morphology of a natural language. Computational
linguistics, 27(2):153198.
Sharon Goldwater, Mark Johnson, and Thomas L Griffiths. 2005. Interpolating between types and tokens
by estimating power-law generators. In Advances in
neural information processing systems, pages 459
466.
Harald Hammarstrom and Lars Borin. 2011. Unsupervised learning of morphology. Computational Linguistics, 37(2):309350.
Mark Johnson, Thomas L Griffiths, and Sharon Goldwater. 2006. Adaptor grammars: A framework for
specifying compositional nonparametric bayesian
models. In Advances in neural information processing systems, pages 641648.
Mark Johnson. 2008. Unsupervised word segmentation for sesotho using adaptor grammars. In Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and
Phonology, pages 2027. Association for Computational Linguistics.