Sunteți pe pagina 1din 4

Named Entity Recognition Using Wikipedia as a Parallel Corpus

Arthur Buliva
arthurbuliva@gmail.com

Abstract: Current popular machine translation toolkits offer few resources, if any, when it comes to African languages. This
shortfall could largely be because the data needed to train them is not readily available for African languages. Wikipedia is an
online resource that can provide a viable source of language data for machine translation resources. We created a proof-of-concept
model for Swahili. Further work needs to be done in order to improve it, remove misplaced tags and scale it for any language
available on Wikipedia.

Keywords: Machine Translation, Parallel Corpus, Machine Learning, Natural Language Processing

Kenya, but despite this, only an estimated 19% speak it


to any degree (Crystal, 2003), and thus information
1 Introduction dissemination is neither efficient nor effective
(Translators Without Borders, 2015).
Named Entity Recognition (NER) is the process of Machine translation is defined as the use of computer
extracting entity information, such as persons, locations, software to translate text or speech from one natural
time, and organizations, from a given text (Shah, Bo, language to another (Abiola, Adetunmbi, &
Gershman, & Frederking, 2010). Humans can do this Oguntimilehin, 2015). The efforts of doing machine
process pretty well, but computers need lots of data with translation depend on various aspects such as the roots of
contextual information in order to achieve this. “Jet will the word, misspellings, synonyms, names of places and
jet out of Jet in a jet” is easy enough for a human brain to the like. The ability to distinguish real world objects that
know what “jet” is the name of a person, which one is the have fixed labels is what is called Named Entity
name of a place, which one is a verb, and which is an Recognition.
object. Contextual data sources, called annotated texts, Machine translation engines also make use of
are common for languages with readily available collections of texts that may be aligned such that the text
resources on the internet, such as English or Spanish. in one language matches that of another language. Such
Less-resourced languages, such as Swahili, have a much texts are known as parallel corpora or aligned parallel
bigger challenge in getting contextual information. corpora (Shah, Bo, Gershman, & Frederking, 2010).
This paper looks at how to make use of freely Example sources of such corporus data include The
available online data from Wikipedia in order to extract United Nations, where all official documents are
contextual information for Swahili. The approach could published in the six official languages of Arabic,
be used to annotate text in any given language with a Chinese, English, French, Russian and Spanish (United
substantial set of Wikipedia articles. We will show the Nations, 2018); the Bible (The Wycliffe Global Alliance,
challenges faced and show the data in a practical NER 2017); and Wikipedia (Buliva, 2016).
use-case.

3 Case Study
2 Related Work
The general objective of this research was to make a
There are approximately 7000 languages actively named-entity model for Apache OpenNLP1. The research
spoken in the world today (Ethnologue, 2018). Out of made use of freely available parallel data from Wikipedia
these, excellent digital resources exist for English, as well as open tools to develop the model. The
French, Italian, German and Spanish and a few others, motivation behind this is a lack of NER models for
with a few dozen in the mid ranges, leaving many other African languages, with low translation accuracy for
languages under-resourced (Benjamin & Radetzky, Swahili as an example consequence.
2014). Swahili is a language widely spoken in East and The work began by extracting Swahili language
Central Africa, with approximately 100 million speakers articles from Wikipedia that have a corresponding
(Ethnologue, 2018). English is an official language in English language article. The articles were then run

1
Apache OpenNLP is a trademark of The Apache Software Foundation
through Apache OpenNLP and Stanford NER2 in order
to extract named entities. The named entities were then
translated using Wikipedia as well as a Swahili – English
dictionary freely provided by the Kamusi Project. Using
the same online collaboration concepts as does
Wikipedia, the Kamusi Project uses volunteers on the Figure 1 Sample run against existing Apache OpenNLP models
Internet in order to generate content (The Kamusi Project,
2018).
With the list of named entities in English together
with their equivalent Swahili translations, the next task
was to annotate the Swahili Wikipedia sentences. These
annotated texts were used to train Apache OpenNLP in
order to create the model that was desired.
Figure 2 Sample run against existing Apache OpenNLP models
The model was then tested by running through
sample non-Wikipedia Swahili texts in order to
determine the level of accuracy of the model.

4 Technology Descriptions

Using an open source tool called Wikipedia Parallel


Titles (Dyer, 2014), we were able to store parallel Swahili Figure 3 Sample run against existing Apache OpenNLP models
and English sentences in a MySQL database. We then
developed a Java application that retrieves Wikipedia
data from the MySQL database, and extracts named From these runs, it was observed that not all the
entities using serial runs through Stanford NER and then named entities were detected out of the box by Apache
Apache OpenNLP NER extraction algorithms. The OpenNLP.
application was deployed on a Windows 10 platform. All
the source code for the work is available online at
https://github.com/arthurbuliva/swahiliner 5.2 Accuracy of custom NER model

We ran the same sample phrases through Apache


OpenNLP, this time making use of the model that was
5 Results created from Wikipedia. Figures 3, 4 and 5 show the
respective outcomes of the run.
5.1 Status of Swahili NER

Apache OpenNLP does not have NER models for


any African language (Apache OpenNLP, 2018).
Therefore, using English models to try and extract named
entities is quite futile. But that is exactly what we did, in
order to create a control for our prototype. Having
obtained some sample news articles from the BBC
articles against which we then attempted to extract named Figure 4 Sample run against custom Apache OpenNLP model
entities. Figures 1, 2 and 3 depict the result of these
sample runs.

2
Stanford NER is a trademark of The Stanford University
Languages: Data Mining, Expert Input,
Crowdsourcing, and Gamification. 9th edition
of the Language Resources and Evaluation
Conference.

Buliva, A. (2016). Machine Natural Language


Figure 5 Sample run against custom Apache OpenNLP model Translation Using Wikipedia as a Parallel
Corpus - A Focus on Swahili. United States
International University - Africa.

Contributors, W. (2016). Named-entity recognition.


Obtenido de Wikipedia, The Free
Encyclopedia:
https://en.wikipedia.org/w/index.php?title=Na
med-entity_recognition&oldid=840369097
Figure 6 Sample run against custom Apache OpenNLP model
Crystal, D. (2003). The Cambridge Encyclopedia of the
A closer examination of the results from the custom
English Language (Second ed.). Cambridge,
model shows that we are now able to detect all the named
entities, even though some of them are incorrectly UK: Cambridge University Press.
labelled. For instance, a long entity such as Chama cha
Demokrasia na Maendeleo (a Tanzanian political party) Dyer, C. (2014). Wikipedia Parallel Titles. Obtenido de
is correctly detected although it is incorrectly labelled as Github: https://github.com/clab/wikipedia-
a person instead of an organization. parallel-titles

Ethnologue. (2018). Languages of the World. Obtenido


6 Conclusions de SIL International Publications.

Shah, R., Bo, L., Gershman, A., & Frederking, R. E.


Current popular machine translation toolkits offer
few resources, if any, when it comes to African (2010). SYNERGY: A Named Entity
languages. This shortfall could largely be because the Recognition System for Resource-scarce
data needed to train them is not readily available for Languages such as Swahili using Online
African languages. Wikipedia is an online resource that, Machine Translation.
as we have shown, can provide a viable source of
language data for machine translation resources. The Kamusi Project. (2018). Obtenido de
The model we created is only but a proof of concept; http://www.kamusi.org
further work needs to be done in order to improve it,
remove misplaced tags and even scale it for any language The Wycliffe Global Alliance. (2017). Scripture &
available on Wikipedia. Language Statistics 2017. Obtenido de The
Wycliffe Global Alliance:
References http://www.wycliffe.net/statistics
Abiola, O., Adetunmbi, A., & Oguntimilehin, A. (2015).
A Review of the Various Approaches for Text Translators Without Borders. (2015). Does Translated
to Text Machine Translations. International Health-Related Information Lead to Higher
Journal of Computer Applications. Comprehension? A Study of Rural and Urban
Kenyans.
Apache OpenNLP. (2018). OpenNLP Tools Models.
Obtenido de United Nations. (2018). Official Languages. Obtenido
http://opennlp.sourceforge.net/models-1.5/ de United Nations:
http://www.un.org/en/sections/about-
Benjamin, M., & Radetzky, P. (2014). Multilingual un/official-languages/
Lexicography with a Focus on Less-Resourced

S-ar putea să vă placă și