Documente Academic
Documente Profesional
Documente Cultură
Arthur Buliva
arthurbuliva@gmail.com
Abstract: Current popular machine translation toolkits offer few resources, if any, when it comes to African languages. This
shortfall could largely be because the data needed to train them is not readily available for African languages. Wikipedia is an
online resource that can provide a viable source of language data for machine translation resources. We created a proof-of-concept
model for Swahili. Further work needs to be done in order to improve it, remove misplaced tags and scale it for any language
available on Wikipedia.
Keywords: Machine Translation, Parallel Corpus, Machine Learning, Natural Language Processing
3 Case Study
2 Related Work
The general objective of this research was to make a
There are approximately 7000 languages actively named-entity model for Apache OpenNLP1. The research
spoken in the world today (Ethnologue, 2018). Out of made use of freely available parallel data from Wikipedia
these, excellent digital resources exist for English, as well as open tools to develop the model. The
French, Italian, German and Spanish and a few others, motivation behind this is a lack of NER models for
with a few dozen in the mid ranges, leaving many other African languages, with low translation accuracy for
languages under-resourced (Benjamin & Radetzky, Swahili as an example consequence.
2014). Swahili is a language widely spoken in East and The work began by extracting Swahili language
Central Africa, with approximately 100 million speakers articles from Wikipedia that have a corresponding
(Ethnologue, 2018). English is an official language in English language article. The articles were then run
1
Apache OpenNLP is a trademark of The Apache Software Foundation
through Apache OpenNLP and Stanford NER2 in order
to extract named entities. The named entities were then
translated using Wikipedia as well as a Swahili – English
dictionary freely provided by the Kamusi Project. Using
the same online collaboration concepts as does
Wikipedia, the Kamusi Project uses volunteers on the Figure 1 Sample run against existing Apache OpenNLP models
Internet in order to generate content (The Kamusi Project,
2018).
With the list of named entities in English together
with their equivalent Swahili translations, the next task
was to annotate the Swahili Wikipedia sentences. These
annotated texts were used to train Apache OpenNLP in
order to create the model that was desired.
Figure 2 Sample run against existing Apache OpenNLP models
The model was then tested by running through
sample non-Wikipedia Swahili texts in order to
determine the level of accuracy of the model.
4 Technology Descriptions
2
Stanford NER is a trademark of The Stanford University
Languages: Data Mining, Expert Input,
Crowdsourcing, and Gamification. 9th edition
of the Language Resources and Evaluation
Conference.