Yarowsky & Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment

Thus, while this paper will discuss our imple- (e) While not essential to the execution of the
Minimally Supervised Morphological Analysis mentation of a stand-alone probabilistic analyzer algorithm, a list of common function words of
by Multimodal Alignment and retraining process in Steps 2 and 3, the chal- the given language is useful to the extraction
lenge of large-coverage inflection-root alignment of context similarity features.
David Yarowsky and Richard Wicentowski expressed in Step 1 is the core of this work.
(f) If available, the various distance/similarity
Department of Computer Science 1.1 Required and Optional Resources tables generated by this algorithm on previ-
Johns Hopkins University ously studied languages can be useful as seed
In further clarification of the task description, the information, especially if these languages are
Baltimore, MD 21218 morphological induction described in this paper closely related (e.g. Spanish and Italian).
Email:{yarowsky,richardw}@cs.jhu.edu assumes and is based on only the following limited
set of (often optional) available resources: 2 Related Work
Abstract The target output of Step 1 is an inflection-root (a) A table (such as Table 2) of the inflectional There is a rich tradition of supervised and unsu-
mapping such as shown in Table 1, with optional parts of speech of the given language, along pervised learning in the domain of morphology.
This paper presents a corpus-based al- columns giving the hypothesized stem change and with a list of the canonical suffixes for each Rumelhart and McClelland (1986), Pinker and
gorithm capable of inducing inflectional suffix analysis and part of speech. part of speech. These suffixes not only serve Prince (1988) and Sproat and Egedi (1991) per-
morphological analyses of both regu- as mnemonic tags for the POS labels, but formed early studies on the fully supervised learn-
lar and highly irregular forms (such as stem
they can also be used to obtain a noisy ing of the English past tense from paired training
brought→bring) from distributional pat- root change suffix inflection pos data, using approximate models of phonological
take ake → ook + took vbd
set of candidate examples for each part of
terns in large monolingual text with speech.1 distance.
take e→ +ing taking vbg
no direct supervision. The algorithm take → +s takes vbz Brent (1993, 1999), de Marcken (1995), Kaza-
combines four original alignment models take e→ +en taken vbn (b) A large unannotated text corpus. kov (1997) and Goldsmith (2000) have each fo-
based on relative corpus frequency, con- skip →p +ed skipped vbd cused on the problem of unsupervised learning of
textual similarity, weighted string simi- defy y→i +ed defied vbd (c) A list of the candidate noun, verb and adjec- morphological systems as essentially a segmenta-
larity and incrementally retrained inflec- defy y → ie +s defies vbz tion task, yielding a morphologically plausible and
defy → +ing defying vbg
tive roots of the language (typically obtain-
tional transduction probabilities. Start- able from a dictionary), and any rough mech- statistically motivated partition of stems and af-
jugar gar → eg +a juega vpi3s
ing with no paired <inflection,root> ex- jugar gar → eg +an juegan vpi3p anism for identifying the candidate parts of fixes. Brent and de Marcken both have used a
amples for training and no prior seeding jugar ar → +amos jugamos vpi1p speech of the remaining vocabulary based on minimum description length framework, with the
of legal morphological transformations, tener ener → ien +en tienen vpi3p aggregate models of context or tag sequence, primary goal of inducing lexemes from bound-
accuracy of the induced analyses of 3888 Table 1 not morphological analysis. Our concurrent aryless speech-like streams. Goldsmith specif-
past-tense test cases in English exceeds work (Cucerzan and Yarowsky, 2000) focuses ically sought to induce suffix paradigm classes
99.2% for the set, with currently over This transformational model is not, as given, on the problem of bootstrapping approxi- (e.g. NULL.ed.ing vs. e.ed.ing vs. e.ed.es.ing
80% accuracy on the most highly irreg- sufficient for languages with prefixal, infixal and mate tag probability distributions by mod- vs. ted.tion) from raw text. However, handling
ular forms and 99.7% accuracy on forms reduplicative morphologies. But it is remarkably elling relative word-form occurrence proba- of irregular words was largely excluded from this
exhibiting non-concatenative suffixation. productive across Indo-European languages in its bilities across indicative lexical contexts (e.g. work, as Goldsmith assumed a strictly concatena-
current form and can be extended to other affixa- “the <noun> are” and “been <vbg> the”), tive morphology without models for stem changes.
tional schema when appropriate. among other predictive variables, with the Morphology induction in agglutenative lan-
1 Task Definition For many applications, once the vocabulary list guages such as Turkish and Finnish presents a
goal of co-training with the models presented
This paper presents an original and successful al- achieves sufficiently broad coverage this alignment here. It is not necessary to select the part of problem similar to parsing or segmenting a sen-
gorithm for the nearly unsupervised induction of table effectively becomes a morphological analyzer speech of a word in any given context, only tence, given the long strings of affixations allowed
inflectional morphological analyzers, with a focus simply by table lookup (independent of necessary provide an estimate of the candidate tag dis- and the relatively free affix order. Voutilainen
on highly irregular forms not typically handled by contextual ambiguity resolution). While the gen- tributions across a full corpus. The source of (1995) has approached this problem in a finite-
other morphology induction algorithms. It is use- erative or analytical system in Step 2 remains use- these candidate tag estimates is unimportant, state framework, and Hakkani-Tür et al. (2000)
ful to consider this task as three separate steps: ful for previously unseen words, such words are however, and the lists can be quite noisy. have done so using a trigram tagger, with the as-
typically quite regular and most of the difficult Their major function is to partially limit the sumption of a concatenative affixation model.
1) Estimate a probabilistic alignment between substance of the lemmatization problem can often potential alignment space from unrestricted The two-level model of morphology (Kosken-
inflected forms and root forms in a given lan- be captured by a large inflection/POS/root map- word-to-word alignments across the entire vo- niemi, 1983) has been extremely successful in
guage ping table and a simple transducer to handle resid- cabulary. manually capturing the morphological processes
2) Train a supervised morphological analysis ual forms. This is of course not the case for agglu- of the world’s languages. The context sensitive
learner on a weighted subset of these aligned tinative languages such as Turkish or Finnish, or (d) The current implementation assumes a list of stem-change models used in this current paper
pairs. for very highly inflected languages such as Czech, the consonants and vowels of the language. have been partially inspired by this framework.
where sparse data becomes an issue. But for many 1
For example, a two-level equivalent capturing
3) Use the result of Step 2 as either a stand- languages, and to a quite practical degree, inflec- The lists need not be exhaustive, and any missing happy + er = happier is y:i ⇔ p:p , quite similar
irregular suffixes (e.g. +t) can be captured via a stem
alone analyzer or a probabilistic scoring com- tional morphological analysis and generation can change and null suffix (e.g. send: d→t + ⇒ sent), in spirit and function to our probabilistic model
ponent to iteratively refine the alignment in be viewed as an alignment task in a broad cover- similar to the representation of take: ake→ook + ⇒ P (y→i|...app, +er). Theron and Cloete (1997)
Step 1. age wordlist. took). sought to learn a 2-level rule set for English, Xhosa
0.3
took/take (-0.35)
Part of Speech VB VBD VBZ VBG VBN sang/sing (0.17) quency of P OSi in the corpus. The expected fre-
+ed +en 0.25
quency of a viable past-tense candidate for sing
Canonical + (+t) +s +ing +ed singed/singe (1.5)
Suffixes + (+t) 0.2 should also be estimatable from the frequency of
English : any of the other inflectional variants.
+
Examples jump jumped jumps jumping jumped 0.15
Assuming that earlier iterations of the algo-
(not used in announce announced announces announcing announced rithm had filled the sing lemma slots for vbg and
0.1
training) take took takes taking taken vbz in Table 5 with regular inflections, VV BD
BG and
V BD
V Z could also be used as estimators. Figure 3
Part of Speech VRoot VPI1s VPI2s VPI3s VPI1p VPI2p VPI3p 0.05 taked/take(-10.5)
singed/sing (-4.9) sang/singe (5.1)
Spanish:
Canonical +ar +o +as +a +amos +áis +an shows the histogram for the estimator log( VV BD
BG ).
3
Suffixes +er +es +e +emos +éis +en 0
-10 -5 0 5 10
+ir +imos +ı́s log(VBD/VB) Lemma VB VBD VBG VBZ VBN
Figure 1: Using the log( VVBD
B )
Estimator to rank Word SING sing ? singing sings ?
Table 2
potential vbd/vb pairs in English Freq ? 1204 ? 1381 344 ?
and Afrkaans by supervision from O(4000) aligned lion word collection of newswire text the relative Table 5
inflection-root pairs extracted from dictionaries. frequency distribution of sang/sing is 1427/1204 we make the simplifying assumption that the
Single character insertion and deletions were al- (or 1.19/1), which indicates a reasonably close frequency ratios between inflections and roots
0.25
lowed, and the learned rules supported both pre- frequency match, while the singed/sing ratio is (largely an issue of tense and usage) is not sig- took/taking (0.6)
sang/singing (1.0)
fixation and suffixation. Their supervised learning 0.007/1, a substantial disparity. nificantly different between regular and irregular
0.2
approach could be applied directly to the aligned V BD morphological processes.
VBD,VB VB ln( VVBD
B )
pairs induced in this paper. Table 4 and Figure 2 illustrate that this sim-
sang/sing 1427/1204 1.19 0.17 plifying assumption is supported empirically. De-
0.15
Finally, Oflazer and Nirenburg (1999) have de- singed/sing 9/1204 0.007 -4.90
singed/singeing (2.2)
veloped a framework to learn two-level morpho- spite large lemma frequency differences between
singed/singe 9/2 4.5 1.50 regular and irregular English verbs, their relative
0.1
logical analyzers from interactive supervision in sang/singe 1427/9 158.5 5.06
a Elicit-Build-Test loop under the Boas project. tense ratios for both VVBD V BG
B and V B are quite sim- 0.05
All VBD/VB .85 -0.16 ilar in terms of their means and density functions. taked/taking (-9.5)
Humans provide as-needed feedback regarding er- singed/singing (-5) sang/singeing(7.3)
rors and omissions. Recently applied to Polish, Table 3 V BD V BG 2
VerbType VB VB Avg. Lemma Freq 0
-12 -10 -8 -6 -4 -2 0 2 4 6 8 10
the model also assumes concatenative morphol- However, simply looking for close relative fre-
Regular .847 .746 861 log(VBD/VBG)
ogy and treats non-concatenative irregular forms quencies between an inflection and its candidate
Irregular .842 .761 17406 Figure 3: Using the log( VV BD
BG )
Estimator to rank
through table lookup. root is inappropriate, given that some inflections
Table 4 potential vbd-vbg matches in English
Thus there is a notable gap in the research are relatively rare and expected to occur much less
literature for induction of analyzers for irregu- frequently than the root form. 0.4 We are also not limited to using only a single
Regular Verbs
lar morphological processes, including significant Thus in order to be able to rank the sang/sing 0.35
Irregular Verbs estimator. In fact, there are considerable robust-
stem changing. The algorithm described below and singed/sing candidates effectively, it is nec- ness advantages to be gained by taking the average
0.3
directly addresses this gap, while successfully in- essary to be able to quantify how well each fits of estimators, especially for highly inflected lan-
ducing more regular analyses without supervision (or deviates from) expected frequency distribu- 0.25 guages where the observed frequency counts may
as well. tions. To do so, we use simple non-parametric 0.2 be relatively small. To accomplish this in a general
statistics to calculate the probability of a particu- framework, we first estimate the hidden variable
0.15
of total lemma frequency (LF ˆ ) via a confidence-
3 Lemma Alignment by Frequency lar VVBD
B ratio by examining how frequently other
such ratios in a similar range have been seen in
0.1 weighted average of the observed P OSi frequency
Similarity and a globally estimated PLF
ˆ
the corpus. Figure 1 illustrates such a histogram OSi model. Then all
0.05
The motivating dilemma behind our approach to (based on the log of the ratios to focus more at- 0 subsequent P OSi frequency estimations can be
morphological alignment is the question of how tention on the extrema). The histogram is then -6 -4 -2 0 2 4 6 8 made relative to PLF OSi
ˆ , or a somewhat advanta-
ln(VBD/VB)
one determines that the past tense of sing is sang smoothed and normalized as an approximation of geous variant, log( P OSi ), with this distribu-
Figure 2: Distributional similarity between regu- OSi
LF −P
and not singed. The pairing sing→singed requires the probability density function for this estimator tion illustrated in Figure 4. Another advantage of
(log( VVBD lar and irregular forms for vbd/vb
only simple concatenation with the canonical suf- B )), which we can then use to quantify this consensus approach is that it only requires T
fix, +ed, and singed is indeed a legal word in our to what extent a given candidate log( VVBD B ), such Thus we initially approximate the VBD/VB ra- rather than T 2 estimators, especially important as
vocabulary (the past tense of singe). And while as log(sang/sing)=.17, fits our empirically moti- tios from an automatically extracted (and noisy) the inflectional tagset T grows quite large in some
few irregular verbs have a true word occupying the vated expectations. The relative position of the set of verb pairs exhibiting simple and uncon- languages.
slot that would be generated by a regular mor- candidate pairings on the graph suggests that this tested suffixation with the canonical +ed suffix. Also, one can alternately conduct the same
phological rule, a large corpus is filled with many estimator is indeed informative given the task of This distribution is re-estimated as alignments im- frequency-distribution-based ranking experiments
spelling mistakes or dysfluencies such as taked (ob- ranking potential root-inflection pairings. prove, but a single function continues to predict
3
served with a frequency of 1), and such errors can However, estimating these distributions frequency ratios of unaligned (largely irregular) Using this estimate, we predict a frequency
wreak havoc in naı̈ve alignment-based methods. E(VBD)=1567, which is an overestimate relative to
presents a problem given that the true alignments word pairs from the observed frequency of previ-
the true 1427. In contrast, the distribution for VV BD
How can we overcome this problem? Rel- (and hence frequency ratios) between inflections ously aligned (and largely regular) ones. is considerably more noisy, given the problems with
BZ
ative corpus frequency is one useful evidence are not assumed to be known in advance. Thus Furthermore, we are not just limited to using VBZ forms being confused with plural nouns. This
source. Observe in Table 3, that in an 80 mil- to approximate this distribution automatically, the ratio P OSi /V B to predict the expected fre- latter measure yields a underestimate of 1184.
0.4
0.35
4 Lemma Alignment by Context 5 Lemma Alignment by Weighted scores and stemchange+suffix analysis).6 From
Similarity Levenshtein Distance this output, we cluster the observed stem changes
0.3
by the variable-length root context in which they
0.25 A second powerful measure for ranking the poten- The third alignment similarity function considers were applied, as illustrated in Table 7.
0.2 tial alignments between morphologically related overall stem edit distance using a weighted Leven-
0.15 forms is based on the contextual similarity of the shtein measure. In morphological systems world- Root Stem Matching
candidate forms. For this measure, we computed wide, vowels and vowel clusters are relatively mu- Context Change Suffix Count Examples
0.1
traditional cosine similarity between vectors of table through morphological processes, while con- ..ray → +ed 5 spray, stray,...
0.05
weighted and filtered context features. While this sonants in general tend to have a lower proba- ...ay → +ed 13 play, spray,...
0
measure also gives relatively high similarity to se- ...oy → +ed 3 annoy, enjoy,...
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 bility of change during inflection. Rather than ...ey → +ed 5 obey, key,...
log(VBD|LF-VBD)
mantically similar words such as sip and drink, it treating all string edits as equal, a cost matrix ...fy y→i +ed 21 beautify,...
Figure 4: Using the log( V BD ) Estimator to Rank is rare even for synonyms to exhibit more simi- of the form shown in Table 6 is utilized, with ini- ...ry y→i +ed 7 carry,...
¬VBD
Potential vbd-Lemma matches in English lar and idiosyncratic argument distributions and tial distance costs δ1 =v+v, δ2 =vclust+vclust, ...dy y→i +ed 4 bloody,...
selectional preferences than inflectional variants δ3 =c+c and δ4 =c+v, initially set to (0.5, 0.6, ...y y→i +ed 43 carry,...
of the same word (e.g. sipped, sipping and sip). ...y → +ed 21 spray,...
1.0, 0.98), a relatively arbitrary assignment re- ...y → +ing 83 carry, spray,...
over suffixes rather than tags. For example, A primary goal in clustering inflectional variants flecting this tendency. However, as subsequent ...e e→ +ed 728 dance,...
+ED V BD of verbs is to give predominant vector weight to
log( +IN G ) yields a similar estimator to log( V BG ), algorithm iterations proceed, this matrix is re- ...e e→ +ing 783 dance, take,...
but with somewhat higher variance.4 the head-noun objects and subjects of these verbs. estimated with empirically observed character-to- ...e → +ing 1 singe
However, to make this work maximally language character stem-change probabilities from the al- ...ke ake → ook + 3 take, shake,...
Finally, these frequency-based alignment mod- ...ke ake → oke + 1 wake
independent we approximated these positions by gorithm’s current best weighted alignments.
els can be informative even for more highly in- ...ke ke → de + 1 make
a small set of extremely simple regular expressions
flected languages. Figure 5 illustrates an estimate a o ue m n ... ...ay y → id + 2 lay, pay
over parts of speech, initially including closed- y → id
of the empirical distribution of the VVBIN
P I3P
F part- a 0 δ1 δ2 δ4 δ4 ...
...y + 2 lay, pay
class pos’s and residual content words (cw), e.g.:
of-speech frequency ratios in Spanish, with this o δ1 0 δ2 δ4 δ4 ... Table 7
CWsubj (AUX|NEG)* Vkeyword DET? CW* CWobj . ue δ2 δ2 0 δ4 δ4 ...
estimator strongly favoring the correct but irreg- First note that because the triple of <root>
These expressions will clearly both extract sig- m δ4 δ4 δ4 0 δ3 ...
ular juegan/jugar alignment rather than its ortho- + <stemchange> + <suffix> uniquely deter-
nificant noise and fail to match many legitimate n δ4 δ4 δ4 δ3 0 ...
graphically similar competitors. ... ... ... ... ... ... ... mine a resulting inflection, one can effectively
contexts, but as they can be applied to a poten-
compute P (inflection | root, suffix, POS) by
tially unlimited monolingual corpus, the signal- Table 6
P (stemchange | root, suffix, POS), i.e. for
0.3 to-noise ratio is tolerable. Ideally, one would also
More optimally, the initial state of this ma- any root=γα, suffix=+σ and inflection=γβσ,
aim to identify automatically which set of pat-
0.25 juegan/jugar (-0.4) trix could be seeded with values partially bor- P (γβσ|γα, +σ, P OS) = P (α → β|γα, +σ,POS).
terns are appropriate for a given language, but
rowed from previously trained matrices from other Using statistics such as shown in Table 7, it is
0.2 this could be accomplished in subsequent itera-
related languages. Alternately, the initial dis- thus possible to compute the generation (or align-
tions of the algorithm by taking previously ex-
0.15 tances could be set partially sensitive to phonolog- ment) probability for an inflection given root and
tracted <inflection,root> pairs and testing which
ical similarities, with dist(/d/,/t/) < dist(/d/,/f/) suffix using the simple interpolated backoff model
set of extraction patterns is most effective in
0.1
for example, although this particular distinction in (1) where λi is a function of the relative sam-
juegan/juzgar (2.3) maximizing the mean context-similarity of the
juegan/juntar (3.9) emerges readily via iterative re-estimation from ple size of the conditioning event, and lastk (root)
0.05
juegan/jogar(4.8) <inflection,root> relative to non-pairs. Similar
the baseline model. indicates the final k characters of the root.
techniques can be used to weight the relative im-
portance of contextual positions.5
0
-8 -6 -4 -2 0 2 4 6
P ( inflection | root, suffix, POS)
log(VPI3P/VINF)
For similar reasons, it is useful in subsequent 6 Lemma Alignment by = P ( α → β | root, suffix, POS)
Figure 5: Using the log( VVPIN
) Estimator to I3P
F iterations of the algorithm to apply the current Morphological Transformation ≈ λ1 P ( α → β | last3 (root), suffix, POS)
Rank Potential vbpi3p-vinf pairs in Spanish analysis modules towards lemmatizing the contex- Probabilities + (1 − λ1 )(λ2 P ( α → β | last2 (root), suffix, POS)
+ (1 − λ2 )(λ3 P ( α → β | last1 (root), suffix, POS)
tual feature sets. This has the effect of both con- + (1 − λ3 )(λ4 P ( α → β | suffix, POS)
densing the contextual signal, and removing po- The goal of this research is not only to extract an + (1 − λ4 )P ( α → β )
4
This measure also frees one from any need to tentially distracting correlations with inflectional accurate table of inflection-root alignments, but (1)
do part-of-speech distribution estimation. However, forms in context. also to generalize this mapping function via a gen-
when optional variant suffixes (such as +ed and +en)
We only backoff to the extent necessary. Fur-
erative probabilistic model. The following section thermore, note that for English (and most inflec-
exist in the canonical suffix set, performance can 5
Another important concept in context similarity
be improved by modeling this distribution separately
describes the creation of this model, as well as how tions in Spanish), the stem changes observed when
measures for morphology that differs from other word
for verbs with and without observed distinct +EN the context-sensitive probability of each morpho- adding suffixes are independent of part of speech
+ED
clustering measures is the need to downweight or elim-
forms, as the relative distribution of log( +IN G
) and inate context words such as subject pronouns that logical transformation can be used as the fourth (i.e. +s behaves the same on suffixation for both
+ED alignment similarity measure.
log( ROOT ) change somewhat substantially in these strongly correlate with only one or a few inflectional
6
cases. One does not know in advance, however, forms. Giving such words too much weight can cause At each iteration of the algorithm, this prob- If only the pairs are given, with no stem-
whether a given test verb belongs to either set. Thus different verbs of the same person/number to appear abilistic mapping function is trained on the ta- change+suffix analysis, this analysis can be generated
the initial frequency similarity score should be based more similar to each other than do the different inflec- deterministically by removing the longest matching
on the average of both estimators until the presence tions of the same verb. Filtering based on high cross- ble output of the previous iteration, equivalent to canonical suffix from the inflection and generating the
or absence of the distinct variant form in the lemma lemma distributional entropy for a given context word the information in Table 1 (e.g. <root,inflection> minimal α → β + σ transformation capturing the re-
can be assertained on subsequent iterations. can help eliminate these counter-productive features. pairs with optional part-of-speech tags, confidence maining stem difference.
nouns and verbs), so these probabilities can often better self-learned lemmatization of the modelled Candidate Roots for the English inflection TOOK (1st iteration):
Overall Similarity Context Frequency Levenshtein MorphTrans MorphTrans
be further simplified by deleting the conditioning context words.
(Iteration 1) Similarity Similarity Similarity Similarity (1) Similarity (C)
variable POS, as illustrated in (2).
take .00162 3.8 1 take .849 take .072 toot .333 toot .002593 take .465578
P ( solidified | solidify, +ed, VBD) 7 Lemma Alignment by Model turn .00081 8.7 2 turn .546 tell .028 tool .333 tool .002593 toot .001296
= P ( y→i | solidify, +ed, VBD) Combination and the Pigeonhole tell .00063 15.9 3 toke .482 turn .016 toe .310 tong .000096 tool .001296
≈ P ( y→i | solidify, +ed) Principle test .00041 19.6 4 tide .357 talk .014 take .290 tone .000096 tong .000048
≈ λ1 P ( y→i | ify, +ed) talk .00051 21.0 5 tower .332 test .001 top .236 ... ... tone .000048
+ (1 − λ1 )(λ2 P ( y→i | fy, +ed)
(2) tie .00044 26.7 6 touch .324 teach .001 toil .236 take .000006 tout .000048
As shown empirically below, no one model is suffi-
+ (1 − λ2 )(λ3 P ( y→i | y, +ed)
+ (1 − λ3 )(λ4 P ( y→i | +ed)
ciently effective on its own. We applied traditional
classifier combination techniques to merge the Candidate Roots for the English inflection SHOOK (1st iteration):
+ (1 − λ4 )P ( y→i ) Overall Similarity Context Frequency Levenshtein MorphTrans MorphTrans
four model’s scores, scaling each to achieve com-
We have further generalized these variable- (Iteration 1) Similarity Similarity Similarity Similarity (1) Similarity (C)
patible dynamic range. The Frequency, Leven-
length context models via a full hierarchically- shake .00149 5.5 1 shake .854 share .073 shoo .500 shoot .002593 shake .465578
shtein and Context similarity models retain equal shoot .00126 9.3 2 shave .323 ship .068 shoot .333 shoo .002593 shoot .001296
smoothed trie architecture, allowing robust spe- relative weight as training proceeds, while the ship .00104 16.3 3 shape .210 shift .062 shoe .310 shock .000096 shoo .001296
cialization to very long root contexts if sample Morphological Transformation similarity model shatter .00061 18.9 4 shore .194 shop .060 shake .290 short .000096 shock .000048
sizes are sufficient. increases in relative weight as it becomes better shop .00094 19.8 5 shower .184 shake .058 shop .236 shout .000095 short .000048
trained. shut .00081 20.6 6 shoot .162 shut .052 shout .236 ... ... shove .000048
6.1 Baseline Model for Morphological shun .00039 20.7 7 shock .154 shoot .051 show .236 shake .000003 shore .000048
Table 8 demonstrates the combined measures in
Transformation Probabilities
action, showing the relative rankings of candidate
Candidate Roots for the Spanish inflection JUEGAN (1st iteration):
On the first iteration, no inflection/root pairs are roots for the inflections took, shook and juegan by Overall Similarity Context Frequency Levenshtein MorphTrans
available for estimating the above models. As the four similarity models after the first iteration (Iteration 1) Similarity Similarity Similarity Similarity (1)
prior knowledge is not available regarding α → β (in Columns 2-4). The overall consensus similar- jugar .0024 1 jugar .88 jugar .063 jugar .50 jugar .00129
stem-change probabilities, an assumption is made ity measure at the end of Iteration 1 is shown in juzgar .0006 2 juntar .38 juzgar .015 juzgar .29 jogar .00129
that the cost of each is proportional to the pre- Column 1.7 jurar .0002 4 jurar .26 jogar .009 juntar .25 juntar .00004
viously described Levenshtein distance between α Note that even though only one of the four esti- jogar .0000 5 justificar .22 juntar .004 jurar .18 juzgar .00004
and β, with the cost of a change increasing geo- mators independently ranked shake as the most Table 8
metrically as the distance from the end of the root likely root of shook, after only the first itera- The extent to which such overlaps should be pe- analysis algorithm was free to consider alignment
increases. The rate of this cost increase ultimately tion the consensus choice is correct. The fi- nalized depends of course on the likelihood of vari- to any word in the corpus which had been identi-
depends on the tendency of the language to allow nal column of Table 8 shows the retrained Mor- ant forms in the morphology, but for Spanish and fied as a potential root verb by the part-of-speech
word-internal spelling changes (as in Spanish or phTrans similarity measure after convergence. English the probability of seeing variant forms is tagging process or occurrence in a dictionary-
Arabic), or strongly favor changes at the point of Based on training evidence from the confidently relatively small. derived rootlist, not just those roots in the test set.
affixation (as in English). aligned pairs took/take, shook/shake and under- It is thus a more challenging evaluation than test-
We exploited the pigeonhole principle in two
took/undertake from previous iterations, the prob- ways simultaneously. The first is a greedy algo- ing simple alignment accuracy between two clean
6.2 Model Improvement by Iterative
ability of ake→ook has increased significantly, fur- rithm, in which candidates are aligned in order and extraneous-entry-free wordlists.
Re-estimation
ther increasing the confidence in the overall align- of decreasing score, and when the the first-choice Table 9 shows the performance of several of the
The primary goal of iterative retraining is to re- ments at convergence (not shown), but not chang- root for a given inflection has already been taken investigated similarity measures. Frequency simi-
fine the core morphological transformation model, ing the previously correct ranking in these cases. by another inflection of the same part of speech, larity (FS), enhanced Levenshtein (LS), and Con-
which not only serves as one of the four similarity The final alignment constraint that we pursued the algorithm continues until a free slot is found. text similarity (CS) alone achieve only 10%, 31%
models, but is also a primary deliverable of the was based on the pigeonhole principle. This prin- The exception is when the highest ranking free and 28% overall accuracy respectively. However,
learning process. ciple suggests that for a given part of speech, a form is several orders of magnitude lower than the the hypothesis that these measures model inde-
As subsequent iterations proceed, the stem- root should not have more than one inflection first choice; here the first-choice alignment is as- pendent and complementary evidence sources is
change probability models are retrained on the nor should multiple inflections in the same part sumed to be correct, but a variant form. supported by the roughly additive combined ac-
output of the prior iteration, weighting each train- of speech share the same root. There are, of curacy of 71.6%.8
ing example with its alignment confidence, and course, exceptions to this tendency, such as trav- 8 Empirical Evaluation The final performance of the full converged
filtering out α → β changes without a minimum elled/traveled and dreamed/dreamt, which are ob-
Current empirical evaluation of this work focuses CS+FS+LS+MS model at 99.2% accuracy on the
level of support to help reduce noise. The final served as variant forms of their respected roots.
on its accuracy in analyzing the often highly ir- full test set, and 99.7% accuracy on inflections re-
stem-change probabilities then are an interpola- 7
In addition to the consensus similarity score in regular past tense of English verbs. Consistent quiring analysis beyond simple concatenative suf-
tion with the trained model Pj and the initial
subcolumn 2, subcolumn 3 shows the average of the fixation, is quite remarkable given that the algo-
baseline (P0 ) model described in Section 6.1: ranks of the candidate root given the inflection and the with prior empirical studies in this field, evalua-
tion was performed on a test set of 3888 inflected rithm had absolutely no <inflection,root> exam-
P ( α → β | root, suffix, POS) ranks of the candidate inflection given the root. This
ples as training data, and had no prior inventory
= λj P0 ( α → β | suffix) bidirectional average ranking score favors cases where words, including 128 highly irregular inflections,
attraction between root and inflection is mutual, and 1877 cases where the past tense was formed by of stem changes available, with only a slight sta-
+ (1 − λj ) Pj ( α → β | root, suffix, POS) disfavors cases where higher ranked competition exists simple concatenative suffixation, and 1883 inflec- 8
for a root’s attentions, effectively capturing a weak In fact, in many cases the consensus ranking
The Levenshtein distance models are re- form of the pigeonhole principle. Thus it was used tions exhibiting a non-concatenative stem change choice is correct when each independent model’s first
estimated as observed in Section 5, while the con- as the primary ranking criteria (over raw similarity such as gemination or elision. choice is wrong, actually yielding a small synergistic
text similarity model can be improved through score). In execution, for each test inflected form, the super additivity.
True CS+FS+LS+MS CS+FS+LS CS+FS LS only
Combination # of All Highly Simple Non- Word Root (Convg) Score (Itr 1) (Itr 1) (Itr 1) (Itr 1)
of Similarity Iter- Words Irregular Concat. Concat. got get go 1.30 go go go gut
Models ations (3888) (128) (1877) (1883) knew know know 1.35 know know know know
took take take 1.50 take take take toot
FS (Frequency Sim) (Iter 1) 9.8 18.6 8.8 10.1 blew blow blow 1.80 blow blow blow blow
LS (Levenshtein Sim) (Iter 1) 31.3 19.6 20.0 34.4 became become become 2.35 become become become become
CS (Contesxt Sim) (Iter 1) 28.0 32.8 30.0 25.8 made make make 2.40 make make make mate
CS+FS (Iter 1) 32.5 64.8 32.0 30.7 clung cling cling 2.55 cling cling cling cling
CS+FS+LS (Iter 1) 71.6 76.5 71.1 71.9 drew draw draw 2.65 draw draw draw draw
swore swear swear 2.80 swear swear swear store
CS+FS+LS+MS (Iter 1) 96.5 74.0 97.3 97.4 wore wear wear 3.10 wear wear wear wire
CS+FS+LS+MS (Convg) 99.2 80.4 99.9 99.7 came come come 3.55 come come come come
thought think think 3.60 think think think think
Table 9 flung fling fling 4.60 fling fling fling fling
tistical bias in favor of shorter stem changes with The remaining errors typically are due to sparse brought bring bring 5.35 bring bring bring brighten
smaller Levenshtein distance, and with the mini- statistics for the lower frequency irregular forms. strove strive strive 5.85 strive strive straddle strive
stuck stick stick 6.00 stick stick stabilize stock
mal search-simplifying assumption in all the mod- Mappings such as slew↔slay are particularly dif- swept sweep sweep 6.20 sweep sweep sweep swap
els that candidate alignments must begin with a ficult because, with a corpus frequency of only 4, shone shine shine 6.55 shine shine shine shine
the same V ∗ C ∗ prefix. there is too little data to estimate a good context woke wake wake 6.95 wake wake wind wake
Given a starting point where all single charac- profile or an effectively discriminatory frequency clove cleave cleave 7.35 cleave cleave cleave close
profile. Enlarging the raw corpus size should im- bore bear bear 7.75 bear bar bear bare
ter X→Y changes at the point of suffixation are meant mean mean 8.20 mean mean manage mount
equally likely, by the end of the first iteration the prove performance in both of these cases. lent lend lend 9.25 lend lend lend lend
processes of elison (e→), gemination (e.g. →d in slew slay slit 10.06 slit slight slight slow
the context of d), and y→i shift (in the context of 9 Conclusion struck strike strike 11.60 strike strike strike strut
a preceding consonant, not vowel) all emerge with bought buy buy 12.20 buy buy buy bound
This paper has presented an original algorithm ca- bit bite bite 13.60 bite bite betray bet
high probability in their appropriate contexts, and dove dive dive 17.25 dive dive dash dive
pable of inducing the accurate morphological anal-
low probability elsewhere. burnt burn burp 17.30 burp burp burp burn
ysis of even highly irregular verbs starting with
Inducing analyses of the highly irregular inflec- went go want 18.29 want want want want
no paired <inflection,root> examples for train- caught catch catch 18.35 catch cut catch cough
tions is of course much more difficult, but even ing and no prior seeding of legal morphological dealt deal deal 21.45 deal deal disagree deal
the context and frequency-based models (CS+FS) transformations. It does so by treating morpho-
alone actually perform relatively well here, given logical analysis predominantly as an alignment Table 10
that the typically higher frequency of irregular task in a large corpus, performing the effective References Workshop notes of the ECML/Mlnet Workshop on
forms cause the context signatures to be better collaboration of four original similarity measures Empirical Learning of NLP Tasks.
estimated and the observed frequency discrepan- M.R. Brent, 1993. Minimal generative models: A
based on expected frequency distributions, con- middle ground between neurons and triggers. Pro- K. Koskenniemi, 1983. A general computation model
cies to be typically more significant.9 text, morphologically-weighted Levenshtein simi- for word-form recognition and production. Publica-
ceedings of the 15th Annual Conference of the Cog-
Table 10 shows how each of the models perform larity and an iteratively bootstrapped model of af- nitive Science Society, pages 28–36. tion no.11, Dept. of General Linguistics. Helsinki:
on a randomly-selected 30% of the highly irregu- University of Helsinki.
fixation and stem-change probabilities. This con- M.R. Brent, 1999. An efficient, probabilistically
lar forms, with correctly selected roots identified K. Oflazer and S. Nirenburg, 1999. Practical boot-
stitutes a significant achievement in that previ- sound algorithm for segmentation and word discov- strapping of morphological analyzers. Proceedings
in bold. The residual errors are primarily of three ous approaches to morphology acquisition have ery. Machine Learning, 34, pages 71–106. of the Conference on Natural Language Learning,
types: Two inflections, went and ate, were not either focused on unsupervised learning of quasi- S. Cucerzan and D. Yarowsky, 2000. Language inde- Bergen, Norway.
alignable with their correct roots due to differ- regular concatenative morphology, or added cov- pendent minimally supervised induction of lexical S. Pinker and A. Prince, 1988. On language and con-
ent first character. The largest class of errors are erage of irregular forms that required fully su- probabilities. Proceedings of ACL’00, Hong Kong. nectionism: Analysis of a parallel distributed pro-
due to the pigeonhole principle strongly disfavor- pervised learning. Thus this essentially unsuper- C. de Marcken, 1995. Unsupervised language acqui- cessing model of language acquisition. In S. Pinker
ing two inflections from sharing the same root.10 vised learning of highly irregular forms is quite sition. PhD dissertation. MIT. and J. Mehler (eds.), Connections and Symbols.
MIT Press.
novel. Collectively, the algorithm achieves over J. Goldsmith, 2000. Unsupervised Learn-
9 D. Rumelhart and J. McClelland, 1986. On learn-
Yet despite these relatively lower performance 80% accuracy on the most highly irregular forms, ing of the Morphology of a Natural Language.
scores, this essentially unsupervised learning outper- ing the past tense of English verbs. In J. McClel-
and 99.7% accuracy on analyses requiring some http://humanities.uchicago.edu/faculty/goldsmith/ land, D. Rumelhart, and the PDP Research Group,
forms the fully supervised Rumelhart and McClelland Linguistica2000/Paper/paper.html.
past-tense learning algorithm, which achieved only stem change, outperforming a prior fully super- Parallel distributed processing: Explorations in the
63% accuracy on irregualar forms when trained on 506 vised learning algorithm on both measures. D.Z. Hakkani-Tür, K. Oflazer, G. Tür, 2000. Statis- microstructure of cognition, Volume 2. MIT Press.
verbs pairs. tical Morphological Disambiguation for Agglutina- P. Theron and I. Cloete, 1997. Automatic acquisition
10
This was previously noted in the case of dream ↔ tive Languages. In Proceedings of COLING 2000. of two-level morphological rules. Proceedings of the
dreamed and dreamt, or burned ↔ burned and burnt, tage is slightly in its favor (with 59 misalignments Fifth Conference on Applied Natural Language Pro-
L. Karttunen, 1993. Finite state constraints. In John
with the higher probability analysis typically occupy- avoided for 50 problems created), but the cost of this cessing, Washington, pp. 103-110.
Goldsmith (ed.) The Last Phonological Rule, pages
ing the root slot and the lower probability form typi- approach is borne heavily by the irregular verbs, and A. Voutilainen, 1995. Morphological disambiguation.
173–194. Chicago: University of Chicago Press.
cally forced to seek alignment elsewhere. Indeed, the a probabilistic model of when variant forms should be In F. Karlsson, A. Voutilainen, J. Heikkila, and A.
pigeonhole principle is the most problematic of all the expected/allowed is necessary to fix these cases while D. Kazakov, 1997. Unsupervised learning of naive Antilla (eds.) Constraint grammar - A language
alignment principles used, as it creates nearly as many preserving the advantages of the principle in down- morphology with genetic algorithms. In W. Daele- independent system for parsing unrestricted text.
problems as it fixes. The overall performance advan- weighting clashing analyses in the more regular verbs. mans, A. van den Bosch, and A. Weijtera (eds.)

Yarowsky & Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Yarowsky & Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment

Încărcat de

Drepturi de autor:

Formate disponibile

Thus, while this paper will discuss our imple- (e) While not essential to the execution of the

S-ar putea să vă placă și

Yarowsky &amp; Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Yarowsky &amp; Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment

Încărcat de

Drepturi de autor:

Formate disponibile

Thus, while this paper will discuss our imple- (e) While not essential to the execution of the

S-ar putea să vă placă și

Yarowsky & Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment

Yarowsky & Wicentowski Minimally Supervised Morphological Analysis by Multi Modal Alignment