Sunteți pe pagina 1din 4

Introducing Minerva – A new system for Machine Translation

Jonathan Levin
Minerva@hisown.com

Abstract Minerva offers significant advancements


compared to Perseus. With an improved
This paper introduces a novel system for ma- morphological engine, it correctly deduces more
chine translation, code-named "Minerva", and forms than its "sibling". Additionally, it contains
briefly discusses its design. The source lan- an elaborate rule-based system to tackle Latin
guage is currently Latin, with several target grammar, providing correct grammatical
language options. Our approach handles mor-
disambiguation of senses whenever possible.
phology and grammar alike, also attempting
semantic disambiguation and context determi- Future work will make a foray into semantics, by
nation. While there exist few systems to trans- integrating Minerva with Princeton University's
late Latin, Minerva already displays notewor- "WordNet", thereby attempting to leap over the
thy results that surpass those of the leading most challenging hurdle yet – full word sense
one. We further claim that our system can, disambiguation. It is important to emphasize that,
upon maturity, be adapted in such ways that while the semantic module is at the time of
can potentially allow it to handle other lan- writing far from complete, it will not be a
guages, as well. statistical, corpus based approach, but will
attempt true context determination by mimicking
1 Introduction the same associative process in the human brain.
Machine Translation (MT) is one of the most
challenging areas in NLP research. Many sys-
2 A brief primer to Latin
tems exist, all vying to build a better "Mouse-
Trap", and improve accuracy to a much desired A full discussion of Latin Morphology and
asymptotic 100%. While most of these systems grammar is well beyond the scope of this short
focus on contemporary spoken languages, we paper. However, we next explain some impor-
have decided to tackle the challenge of translat- tant attributes of the language, as well as the rea-
ing Latin (as a source language) into theoreti- soning behind its choice.
cally any target language. Our system, code-
named "Minerva", is the result. 2.1 Why Latin?
Latin is traditionally treated as a "dead" lan-
At the time of writing, the leading system for guage. Even though its terms and idioms perme-
Latin MT is "Perseus", developed by Tufts ate popular domains such as medicine and law,
University and heavily used by Classicists and indeed even everyday English, its use in the
worldwide. This system, however, falls short of 21st century is extremely limited to specific cir-
its vast promise – it focuses solely on the cles, most notably classicists.
morphological aspects, often correctly
determining the various possible senses, but We prefer to think of Latin as a "frozen"
making no real attempt to disambiguate them1 . language, rather than a "dead" one. "Frozen", as
Consequently, it is nearly useless in attempting it has, in effect, developed little in the past 1,500
word combinations and/or full sentences, and years: As opposed to many of its descendant
only offers single word lookups at a time. Romance languages, Latin vocabulary remains
small, with a relatively straightforward grammar,
1
That said, the newest version (4.0) also offers a cor-
predictable exceptions and (what we find to be) a
pus based "guesstimate" of the most applicable sense, reasonable degree of semantic ambiguity. In that,
but solely based on statistics input by human users.
it proves itself to be a worthwhile candidate for b. Grammatical Agreement
automated translation.
Latin relies heavily on agreement between
Minerva was thus designed with Latin as the words, in gender, case and number. This makes it
source language in mind. However, by its very easy to perform grammatical disambiguation in
design the system is modular enough so that the many cases, especially where adjectives are
input module may later be replaced with a differ- separated from their corresponding nouns.
ent one, possibly targeting English, French, or
even non Romance languages such as Hebrew or
Japanese. c. Clauses
2.2 Latin-specific challenges Latin uses punctuation sparsely. Commas can-
a. Word Order and declensions not be relied upon in determining clauses, which
can also be introduced by myriad prepositions
Latin, like Russian, relies heavily on declensions. and/or pronouns. Examples are the prodosis and
As a result, the traditional word order rules from apodosis of conditionals, e.g:
other languages (Subject-Verb-Object (SVO) in
si vis pacem para bellum
English/Romance, SOV in most others) do not (“if you want peace, prepare for war”)
apply. While most sentences loosely adopt SOV,
sentences can be constructed in any order, so Even without punctuation (in between ‘pacem‘
there is no real difference between: and ‘para’) The use of the conditional (si) and
verb (‘para’) enable a clear separation into two
Puer amat puellam clauses (“si vis pacem” and “para bellum”)
Puer puellam amat
Puellam puer amat d. Idiomatic expressions
Puer: Puer (boy) n., nominative
Puellam: Puella (girl), n., accusative In many cases, well known idioms can be used
Amat: amare (love), v., 3rd person sg. Present, indicative/active to perform both grammatical and semantic dis-
ambiguation. For example:
And all mean "The boy loves the girl", due to
the use of declensions. The loose word order is quibuscum continenter bellum gerunt
(Caesar, "Gallic War", 1.4)
often used in Latin for emphasis on a particular
part of the sentence, especially in poetry. wherein "bellum" by itself could be either an
adjective ("beautiful") or a noun ("war") and "ge-
This proves to be both a blessing and a curse. runt", while not grammatically ambiguous, is a
On the one hand, the use of declensions can be verb of many potential interpretations ("to bear
used in the process of grammatical disambigua- about, bear, carry, wear, have, hold, sustain").
tion. On the other, some declined forms are often The combination of both, however, is uniquely
identical. These include very commonly encoun- determined to be "wage war".
tered forms2, and in those cases Latin provides an
even greater challenge than other languages with 3 System Design
a rigid word order. Because of declensions, Latin
does not have any determiners and uses preposi- Minerva is highly modular, and treats MT as a
tions sparsely, thus making common grammati- sequence of independent stages, each imple-
cal tasks exceptionally challenging. mented separately:

# Stage Language Dependent


1 Morphological Analysis Yes
2 Grammatical Analysis Yes
3 Semantic Analysis No
4 Target Language Generation Yes

2
E.g. Nominative (subject) and Accusative (direct ob- It's important to emphasize, that while the lan-
ject) for neuter, as well as Dative (indirect object) and Abla- guage dependent modules are currently imple-
tive (means) for feminine, as well as many others. mented in Latin (1,2) and English/French (4), we
surmise there is no insurmountable challenge in rules. The Grammatical Analysis module thus
adapting them to other languages, as well. In this implements a classical rule-based system.
sense, our claim can be expanded to say that the
semantic analysis and disambiguation is so lan- Grammar rules are defined in a simple, yet ef-
guage agnostic that the system should ideally be fective language that makes use of conditionals
able to deduce senses and context irrespective of and word attributes to form pattern matching ex-
choice of the languages involved. pressions. Ambiguities in word senses are han-
dled by means of reducing multiple senses to a
3.1 Morphological Analysis simple character representation, upon which a
regular expression may be applied and tested.
Minerva performs morphological analysis by
following an XML tree that correctly describes Rules are defined in one of several classes
the Latin syntax. The tree is organized according (mandatory, common, unusual), and evaluated by
to endings, and its hierarchical structure easily order. Additionally, when a given meaning of a
leads to determining possible meanings of end- word is discarded due to the application of a rule,
ings. Rather than follow a greedy approach, all the system keeps track of its decision.
possible endings are considered (including par-
tial ones) from which the base form candidate is A "mandatory" rule class is one wherein the
proposed. A dictionary lookup ensues, and – if system will try to enforce the rule, eliminating
the candidate is found – it is added to the possi- the possible senses of the word in case which fail
ble meanings of the word. to match it. In a way, it mimics human expecta-
tion. Much like in English one would expect cer-
Additionally, Minerva maintains separate lists tain parts of speech to follow others (say,
of exceptional verbs and noun forms, to handle nouns/adjectives after determiners), so too does
the numerous (yet finite) cases wherein words Minerva anticipate certain declensions to follow
are declined in non-standard, or alternate forms. prepositions, and such. If a mismatched manda-
This stage is fully debugged and tested, and is tory rule results in the elimination of the last pos-
thus on par with "Perseus", and even exceeds it – sible sense of a given word, Minerva rejects the
as it correctly recognizes many forms the former sentence as ungrammatical.
does not.
A "common" rule class is one wherein the sys-
The output of this phase is an array of all the tem expects a common, yet not necessarily strict
possible meanings of a given word. Meanings are pattern to occur in a sentence. These rules are
disambiguated at the grammatical level only, e.g. "nice-to-have", yet can be violated.

si vis pacem, para bellum An "unusual" rule class groups grammatical


(if you want peace, prepare for war) constructs that are not found in common prose or
vis:vis, n/sg/f/gen (force,vigor, etc) speech, but do occur in rare cases, most often
vis:vis, n/sg/f/nom (force,vigor, etc) poetry. These rules are thus tagged to allow their
vis:velle, v/2nd/sg/pres/ind/act (to want) consideration only in special cases, wherein pre-
..
bellum: bellum, n/sg/n/nom (war) vious processing has proven insufficient, or the
bellum: bellum, n/sg/n/acc (war) text source is known to be poetry.
bellum: bellus, adj/sg/m/acc (pretty)
bellum: bellus, adj/sg/n/acc (pretty)
.. Minerva processes rules recursively, and in
order, according to a simple algorithm:
Obviously including wrong meanings, and
choosing to defer semantic ambiguities (in the For ruleClass in Classes
above case “vis” as a noun can specifically mean TryAgain = True
While (TryAgain)
any of force, vigor, power, or energy) to the se- TryAgain = False
mantic analysis stage. For rule in rules[ruleClass]
If isApplicable(rule, sentence)
3.2 Grammatical Analysis apply(rule, sentence)
TryAgain = true
Latin has a surprisingly rich and diverse
grammar, but it mostly falls into well followed
Latin: malus
The application of the current rule set on English: apple#1 (apple%1:13:00::)
our example sentence yields: French: pomme
Hebrew: ‫תפוח‬
Etc.
vis:vis, n/sg/f/gen (force,vigor, etc)
vis:vis, n/sg/f/nom (force,vigor, etc)
vis:velle, v/2nd/sg/pres/ind/act (to want) Once the target language noun, verb or part-
..
bellum: bellum, n/sg/n/nom (war)
of-speech base form is determined, the language
bellum: bellum, n/sg/n/acc (war) generation module correctly conjugates or de-
bellum: bellus, adj/sg/m/acc (pretty) clines it. Finally, word ordering is enforced. Re-
bellum: bellus, adj/sg/n/acc (pretty)
sults leave some to be desired, but are quite satis-
factory.
Note that, at this stage, while we have
eliminated some of the semantic ambiguity
(namely, of “vis”), we still cannot determine
4 Implementation Details
“bellum” to be “war” and not “pretty”.
3.3 Semantic Analysis Minerva is implemented with the following tech-
nologies/platforms:
Semantic analysis is, at this time, not fully im-
plemented. Dictionary meanings, however, are Platform Use
already tied to "WordNet", allowing for the con- Java Main System Logic
MySQL Database, dictionaries
sideration of specific meanings, and future dis-
PHP Web 2.0 Front End
ambiguation. This process, however, is currently
left out of scope of the present implementation The system output is all natively in XML format,
and, thus, this paper. This paper is proposed as a allowing for use and integration with other sys-
short talk submission, which will focus on the tems as a web service. Human readable output is
semantic challenges foreseen. In our example performed by means of XSL rendering, which
sentence, semantic disambiguation will enable converts the output into an X/HTML 1.1 con-
Minerva to deduce that “war” is a more likely formant standard.
meaning for “bellum” than “pretty” is – since the
former is an antonym of “peace”, and the latter is 5 Final Notes
an adjective used substantively.
Minerva is still very much in a nascent stage, and
3.4 Target Language Generation is a work in progress. It has, however, reached a
While the default target language is English, degree of maturity in the sense that it could al-
Minerva maintains a "sense table" as a proof of ready benefit Latin speakers and classicists. It is
concept, mapping English senses into corre- the aim of this paper, and the presentation pro-
sponding ones in Hebrew, French and Japanese posed, to spread the word of its existence to as
(the choices of languages being half-arbitrary, as wide an audience as is possible, so as to boost its
they are the ones known by the authors – but also usage, as well as its evolution.
provide representations of three distinct language
classes). The system’s main drawback, at present, is its
rather limited dictionary. A simple web 2.0 inter-
Many multi-lingual Machine Translation sys- face allows it to query its users (effectively “ask-
tems use an intermediate interlingua stage. Mi- ing for help”) to input base forms of new words
nerva does not do this directly, but rather relies encountered in translation. It is the authors’
on the unique senses of words, as determined by hope, that, with more use, Minerva’s dictionary
the previous stage of semantic analysis. Under will be expanded organically by its users - in-
the assumption the ambiguous senses of the word creasing, its turn – its efficiency and appeal to
can be eliminated, leaving one unique sense, we new users.
claim that there exist mappings of that sense to
any target language (different mappings for dif- References
ferent languages). We further claim that this Tufts, Perseus - http://www.perseus.tufts.edu/
mapping is injective (possibly surjective - i.e. a
Princeton, Wordnet - http://wordnet.princeton.edu/
bijection) onto the target language space. For
example, Latin “malus” would translate as:

S-ar putea să vă placă și