Towards Enhancing Retrieval Effectiveness of Search Engines For Diacritisized Arabic Documents

Inf Retrieval
DOI 10.1007/s10791-008-9081-9
Towards enhancing retrieval effectiveness of search

engines for diacritisized Arabic documents
Bassam H. Hammo
Received: 11 May 2008 / Accepted: 24 November 2008

© Springer Science+Business Media, LLC 2008
Abstract The majority of Arabic text available on the web is written without short
vowels (diacritics). Diacritics are commonly used in religious scripts such as the holy
Quran (the book of Islam), Al-Hadith (the teachings of Prophet Mohammad (PBUH)),
children’s literature, and in some words where ambiguity of articulation might arise.
Internet Arabic users might lose credible sources of Arabic text to be retrieved if they could
not match the correct diacritical marks attached to the words in the collection. However,
typing the diacritical marks is very annoying and time consuming. The other way around,
is to ignore these marks and fall into the problem of ambiguity. Previous work suggested
pre-processing of Arabic text to remove these diacritical marks before indexing. Conse-
quently, there are noticeable discrepancies when searching the web for Arabic text using
international search engines such as Google and yahoo. In this article, we propose a
framework to enhance the retrieval effectiveness of search engines to search for diacritic
and diacritic-less Arabic text through query expansion techniques. We used a rule-based
stemmer and a semantic relational database compiled in an experimental thesaurus to do
the expansion. We tested our approach on the scripts of the Quran. We found that query
expansion for searching Arabic text is promising and it is likely that the efficiency can be
further improved by advanced natural language processing tools.
Keywords Arabic information retrieval · Diacritisized scripts · Query expansion ·

Arabic stemming · Thesaurus
Abbreviations
AIR Arabic information retrieval
AWN Arabic wordNet
PBUH Peace be upon him
CLIR Cross language information retrieval
IE Information extraction
IR Information retrieval
B. H. Hammo (&)
King Abdullah II School for Information Technology, University of Jordan, Amman 11942, Jordan
e-mail: b.hammo@ju.edu.jo
123
Inf Retrieval
LSI Latest semantic indexing

MSA Modern standard Arabic
MT Machine translation
NLP Natural language processing
POST Part of speech tagging
QA Question answering
QARAB Question answering system for Arabic
RDBMS Relational data base management system
QE Query expansion
SQL Structured query language
TS Text summarization
VR Verses retrieved
VRQ Verses relevant to query
VRS Verses retrieved using Stemmer
VRT Verses retrieved using Thesaurus
VRW Verses retrieved using words
VSM Vector space model
WSD Word sense disambiguation
1 Introduction
Arabic belongs to the family of Semitic languages. It differs from Latin languages mor-
phologically, syntactically and semantically. The writing system of Arabic has 25
consonants and three long vowels that are written from right to left and change shapes
according to their position in the word. In addition, Arabic has short vowels (diacritics)
written above and under a consonant to give it its desired sound and hence give a word a
desired meaning. The common diacritics used in Arabic language are listed in Table 1.
In Arabic text, diacritics are not generally indicated in writing. Diacritic-less text (i.e.,
text without diacritic vowels) are commonly used by Arabic community for the everyday
written and printed material such as books, magazines, newspapers and letters. However,
diacritics are heavily used in religious texts that demand strict obedience to pronunciation
rules such in the Quran (the book of Muslims; followers of Islam) and some scripts of
Al-Hadith (teachings of Prophet Mohammad (PBUH)).
In addition, it is very common to use diacritics (fully, partially even randomly) with
classical poetry, children’s literature and in ordinary text when it is ambiguous to read. For
Table 1 Common short vowels (diacritics) used in Arabic text

Diacritic Shape description Example Sound
fatha A small diagonal line appears above a letter ‫َﺩ‬ da

kasra A small diagonal line appears below a letter ‫ِﺩ‬ di
damma A small “comma-like” diacritic placed above a letter ‫ُﺩ‬ du
tanwin A double vowel diacritic appears only at the end ً‫ٌﺩ ٍﺩ ﺩﺍ‬ dan, din, dun
sukun A small circle shape above the letter indicating that the ‫ْﺩ‬ d
consonant is not followed by a vowel
shadda A small diacritic (ّ) indicating a doubled consonant ‫ّﺩ‬ dd
madda A diacritic appears on top of alif indicating a long alif ‫ﺁ‬ aa
123
Inf Retrieval
Table 2 Different interpretation of the Arabic word ‫( ﻛﺘﺐ‬ktb) in the presence of diacritics
Arabic word Transliteration Part of speech English meaning
‫َﻛَﺘ َﺐ‬ kataba 3PSNG (verb) Wrote

‫ُﻛُﺘﺐ‬ kutub Noun Books
‫ُﻛِﺘ َﺐ‬ kutiba Passive (verb) Written
‫َﻛ َّﺖ َﺏ‬ kattaba Verb Make someone to write
instance, a word in Arabic consisting of three consonants like (‫ ك ت ب‬ktb) “to write” can
have many interpretations with the presence of diacritics (Kirchhoff and Vergyri 2005).
For Arabic language speakers, the only way to disambiguate the diacritic-less words is
to locate them within the context. Analysis of 23,000 Arabic scripts showed that there is an
average of 11.6 possible ways to assign diacritics for every diacritic-less word (Debili et al.
2002). Examples of the different meanings associated with the word “ktb” in the presence
of short vowel diacritics are listed in Table 2.
Unfortunately, the bulk of the diacritic-less Arabic scripts available on the Internet
prevent at least two groups of people from accessing their contents. The first group
includes visually impaired people, while the second group includes people with learning
disabilities. Both groups rely on text synthesis and voice recognition applications.
Unfortunately, the success of the Arabic voice applications is highly dependent on the
presence of diacritisized text which enables the systems to pronounce the words correctly.
Many research projects have been carried out to restore diacritics automatically to help
such applications (El-Sadany and Hashish 1988; Gal 2002; Zitouni et al. 2006).
Another group who might be affected by the absence of diacritics includes people trying
to read and understand the teachings of the Quran and Al-Hadith. The meanings of the
Quran verses and Al-Hadith are heavily dependent on the presence of the diacritic vowels.
Most researches in the field of Arabic Information Retrieval (AIR) did not pay much
attention to the problem of searching and retrieving diacritisized text. Most published works
even suggested removing the diacritics at the preprocessing step to unify the content of the
inverted list (Buckwalter 2007). This could be true in the early days, where Arabic text on the
web was very limited. Additionally, in the absence of tools and applications that could handle
this problem seriously, an important new lexical resource for modern standard Arabic (MSA)
known as Arabic WordNet (AWN) may result in rejection (Black and Elkateb 2004; Black
et al. 2006). In addition, ignoring diacritical text may delay progression in some research
areas such as: semantic web (SW) (Abdelali et al. 2003; El-Helw and Aly 2004; Zaidi and
Laskri 2005) and ontology-based information retrieval (Elkateb and Black 2004).
Current international search engines (such as Google,1 Yahoo,2 MSN,3 etc.) and Arabic
search engines (such as Ayna,4 etc.) are not yet mature enough to handle the complexity of
Arabic language. Accordingly, there is no motivation to upload more Arabic scripts on the
web if the content cannot be retrieved. The need to extend the search capabilities of these
search engines or to develop a (full) Arabic search engine becomes mandatory.
In this paper, we propose a framework to enhance the retrieval effectiveness of search
engines to search for diacritic and diacritic-less Arabic text through query expansion
1
http://www.google.com.
2
http://www.yahoo.com.
3
http://www.msn.com.
4
http://www.ayna.com.
123
Inf Retrieval
techniques using a rule-based stemmer and an experimental thesaurus. We carried our

experiments on the scripts of the Quran, as a free open resource of classical Arabic.
2 State of the art
2.1 Arabic information retrieval
A study of the world market, commissioned by the Miniwatts Marketing Group,5 shows
that the number of Arab Internet users in the Middle East and Africa could jump to
32 million in 2008 from 2.5 million in the year 2000. In addition, the growth of Arab
Internet users in the Middle East region (for the same period) is expected to reach about
924% compared to the growth of the world Internet users (see footnote 5 for details). The
conducted research pointed out that 65% of the Internet Arabic speaking users could not
read English pages, which account for 70% of the material available on the Internet.
Unfortunately, efforts to build new search engines to serve the increasing number of
Arabic-speaking users are still very humble. This is mainly due, in the first place, to the
complexity of Arabic language. Another obstacle facing developers of AIR systems is the
lack of adequate resources (i.e., corpora, lexicons, morphological analyzers, part-of-speech
taggers, etc.) that could help in scaling, testing and evaluating the performance of their
implemented systems in the real world.
Generally speaking, AIR engines can be designed based on one of two categories
(Abdelali et al. 2004):
1. Full-form based IR, which has been adopted by most of the commercial engines such
as Google, Yahoo and Ayna.
2. Morphology-based IR. Systems developed in this context are experimental systems to
improve the performance of AIR. Different approaches to improve the performance of
AIR have also been addressed. These approaches include: using light stemmers
(Larkey et al. 2002; Semmar et al. 2006), part of speech taggers (Khoja 2001; Diab
2007), using thesauri (Xu et al. 2002) and using ontology (Abdelali et al. 2003).
In this work, we used the rule based morphological analyzer developed by Khoja (1999)
to extract Arabic stems. We also compiled an experimental thesaurus based on semantic
relations extracted from the Quran.
2.2 Cross language information retrieval
As of today, the most dominant language of the web is English. The majority of the
Internet users around the world (80% of the world’s population) cannot read English pages
and accordingly they are prevented from making benefit of the credible source of infor-
mation available for free on the web. The approach of Cross-Language Information
Retrieval (CLIR) allows a user to formulate a query in his own language and retrieve
documents in one or several other languages.
CLIR appeared in the literature since the last decade (Grefenstette 1996; Oard 2000).
However, the development of CLIR systems is very limited because of the high development
cost and complexity. Most of the CLIR systems are based on thesauri, which are very
expensive to implement, stemming, word boundary identification, and defining lists of stop-
words in the target language (Larkey and Connell 2005). Several multilingual IR systems
5
http://www.internetworldstats.com.
123
Inf Retrieval
have already been studied for many spoken languages other than English. Examples of these
languages include Arabic, French, Chinese, Japanese and Spanish. The following is a quick
survey on the different approaches, which have been tried to tackle the CLIR problem:
1. Machine translation and dictionary-based technique. In this technique, queries are
translated using a dictionary into a language in which a document may be found (Hull
and Grefenstette 1996; Oard 1998; Pirkola et al. 2001; Hedlund et al. 2004). An
example of this approach is a multilingual search engine, called TITAN (Hayashi et al.
1997), where an electronic bilingual (English-Japanese) dictionary helped Japanese
users to search the web using their own language.
2. Controlled vocabulary technique. This traditional technique is based on indexing all
documents of the collection using fixed terms (descriptors) which are also used for
queries. In multilingual IR, these descriptors are translated and mapped to each other
in thesauri (French et al. 2001; Kampas 2004).
3. Transliteration/Transcription technique. Transliteration is the process of converting
the characters of an alphabetical or syllabic script to the characters of a conversion
alphabet. For example, documents written in non-Roman scripts such as Arabic,
Chinese, Hebrew, Japanese, etc. are transliterated or transcribed into Roman
characters. This approach has been mainly applied to identify proper names in
different languages (Gey et al. 2002; Virga and Khudanpur 2003).
4. Corpus-based technique. This technique is based on analyzing large collection of text
(corpora) and automatically extracting the information needed on which the translation
will be based (Talvensaari et al. 2007).
5. Latent semantic indexing (LSI) technique. LSI is based on using automatic statistical
algorithms to improve the retrieval of relevant documents. LSI allows a user to retrieve
documents by concepts and meaning even when they do not share any words with the
query (Landauer and Littman1990; Dumais et al. 1996).
2.3 Arabic cross language information retrieval
Arabic CLIR has been attempted by many research groups. The following is a short survey
of some of the researches related to Arabic CLIR. Aljlayl et al. (2002), built an Arabic–
English IR system based on a machine translation approach. AbdulJaleel and Larkey
(2003), proposed a statistical transliteration approach for Arabic–English IR. Grefenstette
et al. (2005), described the changes required to modify their cross language IR system,
which has been designed for European languages to integrate Arabic language. Abdelali
et al. (2006), described how precision can be improved in query expansion using LSI.
Finally, Semmar and Fluhr (2007), presented a new approach to align Arabic–French
sentences retrieved from a parallel corpus based on a cross-language IR system. This
approach is basically based on building a database of sentences of the target text and
considering each sentence of the source text as a query to that database.
3 Experimenting with current international search engines
Despite attempts to increase the existing repositories of Arabic content available on the
Internet, the content is still close to 0.2% of the total worldwide Internet content.
According to latest statistics by the Internet Society6 regarding the distribution of
6
http://www.isoc.org.
123
Inf Retrieval
languages on the Internet, there are only 100 million Arabic web pages covering topics in
business, science, politics, religion, and email messages. It is then obvious why most
developers design their tools and applications to deal with the English scripts, where few
attempts have been tried to process Arabic scripts.
Google and Yahoo are the most dominant international search engines used to search and
retrieve documents in multi-languages. Attempts to tackle the problems of international web
retrieval systems to handle non-English natural languages have also appeared in the litera-
ture. As an example, Lazarinis (2007) created a methodology for identifying some of the
deficiencies of searching the web using Greek language. Another research articles explored
the inefficiencies of international search engines in searching for Russian, French, Hungarian
and Hebrew queries (Bar-Ilan and Gutman 2005), Chinese queries (Moukdad and Cui 2005),
Arabic queries (Moukdad 2004; Al-Maskari et al. 2007) and Polish queries (Sroka 2000).
In this experiment we show the discrepancy of the two engines when it comes to search
for diacritic and diacritic-less Arabic documents. First, we tried to search Google and
Yahoo for verses from the Quran. For each query listed in Table 3, we carried a full-form
search using words with diacritics (exactly as they are typed in the Quran) and a second run
using the diacritic-less form for the same queries.
Every time we ran an experiment, we used the Advanced Search Preferences Language
Tools (available in each search engine), to change the language setting. The settings that
we applied include: searching the web, Arabic pages, and English pages. The results of this
experiment are listed in Table 3.
It is very obvious here that for the examined search preferences, Google retrieved
different results for diacritic and diacritic-less queries. In most cases the returned docu-
ments were a mixture of diacritic and diacritic-less results. A lot of good documents were
easily missed because of the absence of diacritics. However, the case with Yahoo was
different; it simply ignores the diacritics in most cases and returned “almost” the same
results for diacritic and diacritic-less queries.
Next, we tried Google and Yahoo to search for Arabic words where the diacritics reflect
some semantics. A list of the words that we have tested is given in Table 4. Again we did a
full-form match and we fixed the search preferences to searching the web. Both engines
failed to distinguish the different meanings of the word “ktb”, for example, in the presence
of diacritics.
Table 3 Results from Google and Yahoo search engines for searching diacritic and diacritic-less verses
from the Quran using different preference language settings
Q# Query Pages from Google® Pages from Yahoo®
Web Arabic English Web Arabic English
1 4,050,000 4,950,000 198,000 10,400,000 9,320,000 276,000

2 2,170,000 4,950,000 141,000 10,400,000 9,320,000 276,000
3 410,000 311,000 5,410 409,000 373,000 13,100
4 ‫ﻟﻠﻤﺘﻘﻴﻦ‬ 284,000 359,000 1,980 409,000 373,000 13,100
5 1,150,000 1,110,000 16,800 1,170,000 1,120,000 33,800
6 249,000 125,000 726 118,000 100,000 579
7 11,700 10,100 666 23,600 20,400 384
8 ‫ﺍﺯﻭﺍﺝ ﻣﻄﻬﺮﻩ‬ 1,340 3,000 4 23,700 20,400 362
9 30,500 22,800 1,590 272,000 250,000 2,020
10 ‫ﺍﻟﻤﻮﻣﻨﺎﺕ‬ 10,800 5,860 217 272,000 250,000 2,020
123
Inf Retrieval
Table 4 Results from Google and Yahoo search engines for searching diacritisized Arabic words
Query Meaning Google® web pages Yahoo® web pages
‫َﻛَﺘ َﺐ‬ He wrote 113,000 24,700,000

‫ُﻛﺘ ْﺐ‬ Books 617
‫ﻛﺘﺐ‬ To write 14,200,000
‫َﺣ ّﺪﺍﺩ‬ Steel man 119 2,120,000
‫ِﺣﺪﺍﺩ‬ Mourning 2,100
‫ﺍﻟﺘﺠﺎﺭﺓ ﺍﻟﺤﺮﺓ‬ Free trade 111,000 283,000
‫ﺍﻟﺘﺠﺎﺭ ُﺓ ﺍﻟﺤﺮ ُﺓ‬ Free trade 110,000
‫ﺍﻻﺣﺘﺒﺎﺱ ﺍﻟﺤﺮﺍﺭﻱ‬ Greenhouse effect 131,000 304,000
‫ﺱ ﺍﻟﺤﺮﺍﺭ ُﻱ‬ ُ ‫ﺍﻻﺣﺘﺒﺎ‬ 134,000
‫ﻫﺒﻮﻁ ﺳﻌﺮ ﺻﺮﻑ ﺍﻟﺪﻭﻻﺭ‬ Falling U.S. dollar exchange rate 1,040 373
‫ﺻﺮ ِﻑ ﺍﻟﺪﻭﻻ ِﺭ‬ َ ‫ﻫﺒﻮ ُﻁ ِﺳﻌ ِﺮ‬ 103
‫ﺣﻘﻮﻕ ﺍﻻﻧﺴﺎﻥ‬ Human rights 2,130,000 5,280,000
‫ُﺣﻘﻮ ُﻕ ﺍﻹﻧﺴﺎ ِﻥ‬ 1,980,000
We found that Google returned mixed results, while Yahoo ignored the diacritics. After
all, it is obvious that both search engines need to handle this problem more efficiently.
However, it is preferable at the end to allow users to enter Arabic words without diacritics
while at the same time allowing the retrieval of those words with vowel diacritics for the
purposes of disambiguation. In the remaining sections we present our framework and
discuss the results of our experiments to solve this problem.
4 The proposed experimental search engine
In previous work, we designed and implemented an experimental search engine to

experiment with AIR (Hammo et al. 2002). Later, this engine has been modified to support
passage retrieval for an open-domain Arabic question answering system, called QARAB
(Hammo et al. 2004).
Fig. 1 Processing modules and data flow of the extended information retrieval engine
123
Inf Retrieval
In this work, we extended the capabilities of our experimental system to tackle the problem
of retrieving diacritic and diacritic-less Arabic documents. Like most of the experimantal
search engines, our system is based on the famous vector space model (VSM) (Salton and
McGill 1983; Salton 1989). It takes a query in Arabic language and attempts to provide a
ranked list of documents, based on a similarity measure (cosine measure), that are close
enough to the user’s query. The main components of the extended model include a: Tokenizer,
Stemmer and Thesaurus modules. The data flow of our system is depicted in Fig. 1.
5 Experimental setup
5.1 The study domain
To test the performance of our extended model for retrieving diacritisized Arabic text, we
have chosen the Quran. The scripts of the Quran are diacritisized to preserve the pro-
nunciation and the meaning of its words, which can be totally changed in the absence of
diacritics. Our approach is simply based on searching the text of the Quran without being
worried about typing the diacritics. We improved the search process through automatic
query expansion using a rule-based stemmer and a thesaurus. Users need only to type in the
words they want to search for while their queries are automatically augmented with all
morphological variation of the query’s words to expand the search. Next, we investigated
the effectiveness of applying a thesaurus of semantic classes to expand the search. The
obtained results are promising and open directions for enhancing the capabilities of search
engines and other applications, such as, question answering, information extraction from
the Quran and other Arabic resources available on the Internet.
5.2 Data acquisition
The Quran chapters (suras) are split into verses (ayat). The Tokenizer tokenizes the verses
into words (tokens), while a rule-based stemmer peals the common affixes (prefixes, infixes
and suffixes) from the word to simplify it to a root form. In our extended model we provide
four types of indexes:
1. The diacritic index: contains the original words from of the Quran.
2. The diacritic-less index: contains the words of the Quran after removing their
diacritics.
3. The stem index: contains the roots obtained automatically from the rule-based
stemmer.
4. The thesaurus index: contains semantic word classes to expand the query during the
search.
The Quran consists of (114) chapters (suras), where each surah is generally known by a
name. Also the Quran contains a total of (6,236) verses (ayat), a total of (77,845) words
and a total of (1,767) distinct roots.
5.3 The data model of the extended search engine
IR system for English language was implemented and tested using a relational database
management system (RDBMS) (Lundquist et al. 1997). It has been argued that designing
an IR system based on RDBMS retains the benefits of providing fast and sophisticated
123
Inf Retrieval
Fig. 2 Extended relational database model to experiment with our search engine
retrieval, being portable across different platforms as well as being able to use the security
and integrity features built in the RDBMS itself (Lundquist et al. 1997). In this work, we
adapted the idea of integrating relational database model with an IR system to store and
manipulate the Quran scripts. The data model of our extended engine is depicted in Fig. 2.
The IR system has been coded in Java, while the database has been designed using SQL
server. The system has been tested on a Pentium IV machine running Windows XP.
5.4 Processing the scripts of the Quran
5.4.1 Preprocessing the scripts
The scripts of the Quran undergo a preprocessing phase before extracting the words and
building the indexes. The system automatically triggers a tokenizer module to split the
chapters into verses at the verse Unicode boundary marker (‫)۝‬, which marks the end of
each verse in the Quran. Other marks such as (۞ ۩) are used to organize the Quran into
parts and sections have also to be removed. Words in the Quran are bounded by white
spaces, which make it very easy to identify words and verify their correctness. Finally, a
set of letters and marks such as: ( ۜ ۚ ۙ ۘ ۗ ۖ ) which are used for reciting the Quran also
must be removed before the tokenization process can take place. At the end of the toke-
nization process the words are ready to be stemmed and indexed.
5.4.2 Building the inverted lists (indexes)
5.4.2.1 The diacritic index This index contains the diacritisized full-form words of the
Quran (and their frequencies) as they were obtained from the tokenizer without any further
processing (i.e., keeping their diacritics intact). This index is not used directly for
searching, because typing the diacritics is not easy as well as missing a diacritic vowel
leads to unsuccessful match for most of the time. Instead, the words of this index that share
the same roots of the query’s bag-of-words are automatically added to the original query to
expand the search during query expansion (QE).
5.4.2.2 The diacritic-less index This index has the distinct words of the Quran after they
have been processed by removing their diacritical marks and unifying the alef-hamza
123
Inf Retrieval
character to alef (i.e., converting ‫ آأإ‬to ‫)ا‬. The size of the diacritic-less index is about
(22.6%) less than the diacritic index. This index is the primary index to be used by our
search engine to answer users’ queries. At least one morphological form of each word of
the Quran is located in this index. The other forms are obtained from the diacritic index
during the QE process.
5.4.2.3 The stem index Words obtained from the tokenizer are automatically stemmed
before they can be added to the stem index. We used a rule-based stemmer, written in Java,
after getting the permission of Khoja (1999). A total of 1,767 distinct roots (including
proper names appearing in the Quran) were identified. Each stem in the index is associated
with a 1-to-m relationship to entities in the diacritic index, the diacritic-less index and the
thesaurus index, respectively. The size of the root index is about (88.2%) less than the
diacritic-less index. An example of this relationship is explained in Fig. 3.
5.4.2.4 The thesaurus index A thesaurus is a structure that manages the complexities of
terminology in language and provides semantic relationships (such as synonymy) between
terms. Building a thesaurus can be done manually or automatically by collecting key words
of documents and classify and organize them into a thesaurus. In our work, we benefited
from the ongoing work of Kubaisi (2006) to construct a thesaurus for the Quran. In his
work, he grouped words that carry similar meanings in semantic classes to help people
understanding the Quran in a better way. We have compiled these semantic classes into an
experimental thesaurus. So far, the index contains (100) semantic groups, where each
group is made of 3–6 synonyms. The average number of words in the thesaurus index is
around (500) words. Experiments show that using thesauri increase the recall and some-
times this could be at the expense of precision (Xu et al. 2002).
5.4.3 Using the stemmer
Stemming is the task of correlating several words onto one base form. It has a relatively
low processing cost and uses morphological heuristics to remove affixes from words before
indexing. It reduces the index size, and usually it improves the results slightly (Strzal-
kowski and Vauthey 1992). This makes stemming very attractive for many natural
language processing (NLP) applications such as: IR, information extraction (IE), question
answering (QA), machine translation (MT), text summarizations (TS), etc.
Arabic stemming is more complicated than English stemming. Major words of the
Arabic language are constructed from the three consonant roots by following fixed
Diacritic-less Stem Index Diacritic Index
Thesaurus
Fig. 3 Diagram to explain the association between the entries of the three indexes with the root index
123
Inf Retrieval
patterns. Patterns include prefixes, infixes and suffixes to indicate number, gender and
tense. Arabic stemming is the process of removing all affixes from a word to extract its
root. A stemmer for Arabic, for example, should identify the string, kateb ‫( ﻛﺎﺗﺐ‬writer),
ketab ‫( ﻛﺘﺎﺏ‬book), maktabah ‫( ﻣﻜﺘﺒﻪ‬library), maktab ‫( ﻣﻜﺘﺐ‬office), as one base form ktb ‫ﻛﺘﺐ‬
(he wrote).
In our research, we used a rule-based stemmer, developed by Khoja (1999), to experi-
ment with Arabic passage retrieval and QA (Hammo et al. 2004). In this experiment, the
stemmer performed reasonably with accuracy closed to (95%). We observed that most of
the failing cases were due to stemming proper names such as the names of Prophets, angels,
ancient cities, places and people, numerals, as well as words with doubled characters
(represented using the diacritic shada. To verify the correctness of the stems, we compared
the generated stems with a list of manually extracted and verified roots of Al-Quran (Khadir
2002). We corrected the mistaken ones and added the missing ones. Finally, the diacritic
index, the diacritic-less index and the thesaurus index were linked to the stem index using
their stem-id fields. An example of the association between indexes is given in Fig. 3.
In the above diagram, the root (‫ ﺟﻮﻉ‬jawaa (make someone hungry)) is associated with
the words ((hunger ‫)ﺍﻟﺠﻮﻉ‬, (getting hungry ‫)ﺗﺠﻮﻉ‬, (make someone hungry ‫)ﺟﻮﻉ‬, and (and
hunger ‫))ﻭﺍﻟﺠﻮﻉ‬. Also it is associated with the same dicritisized form of the words
available in the diactritic index. Finally, the root has an association with the semantic
class (hunger ‫)ﺍﻟﺠﻮﻉ‬, which contains the synonyms (‫ ﻣﺨﻤﺼﺔ‬،‫ ﻣﺴﻐﺒﺔ‬،‫)ﺧﺼﺎﺻﺔ‬. In the next
section, we explain how this association enhanced the effectiveness and the performance
of our search engine.
6 Experiments and results
6.1 Data set and test collection
In this paper, we used the scripts of the Quran and a collection of 40 diacritic-less queries
obtained from Arabic native speakers. Each person has been asked to provide 4 queries that
he could remember from the Quran. The list (without modifications) is given in Table 5.
6.2 Experimental results
6.2.1 Searching the diacritic-less index using full-form words
In this experiment, we tested our system using the full-form words (i.e., as they appear in the
queries without modifications). Table 6 shows the results of searching the diacritic-less index
for the queries listed in Table 5. The results of running this experiment are given in Fig. 4.
The system failed in the cases where an exact match is not satisfied. A sample of what
were really found in the diacritic-less index that could not satisfy some of the queries is
given in Table 7. It is obvious that failures, in most cases, were due to missing either the
diacritics or the affixes (i.e., prefixes, infixes, suffixes) that are attached to the original
words.
6.2.1.1 Discussion and findings In most cases, the system failed to find results that satisfy
the full-form of the query’s bag-of-words. For instance, (Q# 4, 5, 13–14, 17, 18, 20, 22,
24–26, 29–31, 35–37 and 39) failed to find any match in the diacritic-less index. The
following two examples explain the results of the search engine during this experiment:
123
Inf Retrieval
Table 5 List of test data

Q# Query
collected from Arabic native
speakers
1 ‫ﻗﺒﺲ‬
2 ‫ﺯﻳﻨﺔ‬
3 ‫ﺍﺛﺎﺭ‬
4 ‫ﺍﺛﺎﺙ‬
5 ‫ﺍﺳﻒ‬
6 ‫ﺑﺮﻫﺎﻥ‬
7 ‫ﺍﻟﺨﺎﺷﻌﻴﻦ‬
8 ‫ﻟﻌﺐ‬
9 ‫ﺣﺪﻳﺚ‬
10 ‫ﺍﺿﺤﻚ‬
11 ‫ﺯﻭﺝ‬
12 ‫ﺣﺠﺎﺏ‬
13 ‫ﻣﺴﺘﻮﺭ‬
14 ‫ﻇﻬﻮﺭ‬
15 ‫ﺟﻮﻉ‬
16 ‫ﺍﻟﺼﻴﺎﻡ‬
19 ‫ﺍﻟﺼﺤﻒ‬
20 ‫ﻛﻨﻮﺩ‬
17 ‫ﺍﻧﺼﺖ‬
18 ‫ﺗﺮﺗﻴﻞ‬
21 ‫ﻋﺠﻮﺯ‬
22 ‫ﺍﻟﺴﻴﺪ‬
23 ‫ﺍﻟﻄﻴﺐ‬
24 ‫ﻣﺴﺎﻓﺮ‬
25 ‫ﻳﺘﻮﻩ‬
26 ‫ﺗﺠﺎﻭﺯ‬
27 ‫ﻃﺮﻳﻖ‬
28 ‫ﺍﻟﺴﺠﻦ‬
29 ‫ﻗﻠﻊ‬
30 ‫ﻫﺰﻳﻤﺔ‬
31 ‫ﺍﻟﺨﻀﻮﻉ‬
32 ‫ﺍﻟﻌﺰﻡ‬
33 ‫ﺭﻭﺿﺔ‬
34 ‫ﺍﻟﻨﺼﺮ‬
35 ‫ﺛﻤﻦ‬
36 ‫ﺟﺪﺍﺭ‬
37 ‫ﺣﺼﺎﺩ‬
38 ‫ﻣﻠﺢ‬
39 ‫ﺗﻔﻜﻴﺮ‬
40 ‫ﺍﻟﻘﻮﺓ‬
Example 1 Query # 13: ‫( ﻣﺴﺘﻮﺭ‬covered).

The system could not find an exact match for this query. However, the diacritic-less
index includes the words: ‫( ﺳﺘﺮﺍ‬shield) and the word ‫( ﻣﺴﺘﻮﺭﺍ‬covered) as shown in
123
Inf Retrieval
Table 6 Results obtained from

Q# Query VR*
searching the Quran for the data
set using full-form queries
1 ‫ﻗﺒﺲ‬ 1
2 ‫ﺯﻳﻨﺔ‬ 6
3 ‫ﺍﺛﺎﺭ‬ 1
4 ‫ﺍﺛﺎﺙ‬ 0
5 ‫ﺍﺳﻒ‬ 0
6 ‫ﺑﺮﻫﺎﻥ‬ 4
7 ‫ﺍﻟﺨﺎﺷﻌﻴﻦ‬ 1
8 ‫ﻟﻌﺐ‬ 3
9 ‫ﺣﺪﻳﺚ‬ 10
10 ‫ﺍﺿﺤﻚ‬ 1
11 ‫ﺯﻭﺝ‬ 5
12 ‫ﺣﺠﺎﺏ‬ 4
13 ‫ﻣﺴﺘﻮﺭ‬ 0
14 ‫ﻇﻬﻮﺭ‬ 0
15 ‫ﺟﻮﻉ‬ 2
16 ‫ﺍﻟﺼﻴﺎﻡ‬ 3
17 ‫ﺍﻧﺼﺖ‬ 0
18 ‫ﺗﺮﺗﻴﻞ‬ 0
19 ‫ﺍﻟﺼﺤﻒ‬ 3
20 ‫ﻛﻨﻮﺩ‬ 0
21 ‫ﻋﺠﻮﺯ‬ 4
22 ‫ﺍﻟﺴﻴﺪ‬ 0
23 ‫ﺍﻟﻄﻴﺐ‬ 5
24 ‫ﻣﺴﺎﻓﺮ‬ 0
25 ‫ﻳﺘﻮﻩ‬ 0
26 ‫ﺗﺠﺎﻭﺯ‬ 0
27 ‫ﻃﺮﻳﻖ‬ 2
28 ‫ﺍﻟﺴﺠﻦ‬ 6
29 ‫ﻗ ﻠﻊ‬ 0
30 ‫ﻫﺰﻳﻤﺔ‬ 0
31 ‫ﺍﻟﺨﻀﻮﻉ‬ 0
32 ‫ﺍﻟﻌﺰﻡ‬ 1
33 ‫ﺭﻭﺿﺔ‬ 1
34 ‫ﺍﻟﻨﺼﺮ‬ 3
35 ‫ﺛﻤﻦ‬ 0
36 ‫ﺟﺪﺍﺭ‬ 0
37 ‫ﺣﺼﺎﺩ‬ 0
38 ‫ﻣﻠﺢ‬ 2
39 ‫ﺗﻔﻜﻴﺮ‬ 0
40 ‫ﺍﻟﻘﻮﺓ‬ 3
VR* Verses retrieved
Table 7. Although the verses containing these two words are relevant to the query, but
the system failed to return them as it could not match these words with the word(s) of
the query.
123
Inf Retrieval
Fig. 4 Results of searching the

Quran scripts using full-form
queries
Table 7 Sample of words

Word Words in diacritic index
in the diacritic-less index
‫ﺍﺛﺎﺙ‬ ‫ﺍﺛﺎﺛﺎ‬
‫ﻣﺴﺘﻮﺭ‬ ‫ ﻣﺴﺘﻮﺭﺍ‬،‫ﺳﺘﺮﺍ‬
‫ﻇﻬﻮﺭ‬ ‫ﻇﻬﻮﺭﻫﺎ‬
‫ﺗﺮﺗﻴﻞ‬ ‫ﺗﺮﺗﻴﻼ‬
‫ﻣﺴﺎﻓﺮ‬ ‫ﺳﻔﺮ‬
‫ﺗﺠﺎﻭﺯ‬ ‫ﻭﻧﺘﺠﺎﻭﺯ‬
‫ﺛﻤﻦ‬ ‫ ﺛﻤﻨﺎ‬،‫ﺑﺜﻤﻦ‬
‫ﺣﺼﺎﺩ‬ ‫ﺣﺼﺎﺩﻩ‬
Example 2 Query # 15: ‫( ﺟﻮﻉ‬hunger).

The system finds two exact matches for this query. Two verses containing the word
(‫ ﺟﻮﻉ‬hunger) were retrieved. In addition, the diacritic-less index also contains three
morphological forms of this word: (‫ ﺗﺠﻮﻉ‬to be hungry), (‫ ﺍﻟﺠﻮﻉ‬starvation), (‫ ﻭﺍﻟﺠﻮﻉ‬and
starvation). Unfortunately, the system failed to retrieve these relevant verses because they
do not match the query’s bag-of-words. In the following experiments, we show how our
system can easily solve this problem and retrieve these verses through query expansion
using a stemmer and a thesaurus, respectively.
6.2.2 The effectiveness of QE
QE can be defined as the process of reformulating the query’s bag-of-words to overcome

the problem of mismatching potential documents and improving the performance of a
search engine by retrieving the documents that are more relevant (of better quality), or at
least equally relevant (Qiu and Frei 1993; Vectomova and Wang 2006). Without query
expansion, the documents which have the potential to be relevant to the user’s query may
not be retrieved at all. Many QE techniques have been investigated in the IR literature.
They include:
123
Inf Retrieval
● QE through synonymy. This is performed through finding words in a thesaurus that are
synonymous to the words in the query.
● QE through stemming. This is performed by augmenting the query’s bag-of-words with
their morphological variations that share the same stems.
● QE through word sensing. This is performed through sensing the words to resolve
ambiguity from a specialized database such as the WordNet.
● QE through fixing spelling errors. This is performed through fixing spelling errors and
automatically searching for the corrected form of the words.
● QE through paraphrasing. This is performed by rewriting the terms of the original
query.
Some QE techniques such as synonymy and stemming have been criticized for
increasing the total recall on the expense of lowering the precision. Other techniques like
word sense disambiguation (WSD) tend to increase the precision. However, despite the
increase in the recall, augmenting the user’s query with synonyms and morphological
variations and ranking the occurrences of the query’s words, cause documents with more
approximate terms to migrate near the top of the ranked list, hence, leading to a higher
performance. In the next sections we will discuss our experimental results through QE
using a stemmer and a thesaurus.
6.2.3 QE through stemming
In many cases, using the full-form query bag-of-words for searching may not give good
results and in some cases no results at all (some examples where given in Table 7). This is
because of the variation in morphological structure between the words in the corpus and
the query’s bag-of-words, which most of the time end up with no-match. Therefore, in our
modified search engine QE is done automatically to find all verses, which have words that
are correlated to the roots of the query’s bag-of-words. The process starts with running the
stemmer against the query’s bag-of-words. For each root we obtain from stemming the
query, we search the root index for all its associations within the diacritic and diacritic-less
indexes. All words satisfying this condition are added to the original query’s bag-of-words.
The new expanded query is ready to be submitted again to search for all occurrences of
documents (in our case, verses) that have these words.
Running the experiment after expanding the search through stemming was very efficient
and satisfactory. The results of QE through stemming are listed in Table 8. However, QE
through stemming outperforms the results obtained from the full-form technique as clearly
indicated by Fig. 5. Generally speaking, stemming improves the recall as well as the
precisions. The work by Larkey et al. (2002) recommends that working with light stem-
mers could perform better than the root-extraction stemmer. The choice between root
extraction and light stemming is contingent to the source of collection. However, the
obtained results in this experiment made this technique and the light stemming technique
as well very practical especially for QA systems (Hammo et al. 2004).
6.2.3.1 Discussion and findings: Example 3 The result of running query # 15: ‫ﺟﻮﻉ‬
(hunger) using a stemmer.
Query # 15 can be rewritten through QE process using a stemmer as explained in Fig. 6.
The aim of the expansion is to recover the shortage of the missed verses from the previous
experiment (as discussed in Example 2). In this experiment the query goes under stemming
to identify all its roots. For each valid root in the query, the system automatically,
searches the root index and adds to the original query all words (from diacritic index and
123
‫‪Inf Retrieval‬‬
‫‪Table 8 Results from search expansion using the stemmer‬‬

‫‪Q#‬‬ ‫‪Query‬‬ ‫‪Root‬‬ ‫*‪VR‬‬ ‫*‪VRQ‬‬
‫‪1‬‬ ‫ﻗﺒﺲ‬ ‫ﻗﺒﺲ‬ ‫‪3‬‬ ‫‪2‬‬

‫‪2‬‬ ‫ﺯﻳﻨﺔ‬ ‫ﺯﻳﻦ‬ ‫‪43‬‬ ‫‪21‬‬
‫‪3‬‬ ‫ﺍﺛﺎﺭ‬ ‫ﺍﺛﺮ‬ ‫‪21‬‬ ‫‪13‬‬
‫‪4‬‬ ‫ﺍﺛﺎﺙ‬ ‫ﺃﺛﺚ‬ ‫‪2‬‬ ‫‪2‬‬
‫‪5‬‬ ‫ﺍﺳﻒ‬ ‫ﺍﺳﻒ‬ ‫‪5‬‬ ‫‪4‬‬
‫‪6‬‬ ‫ﺑﺮﻫﺎﻥ‬ ‫ﺑﺮﻫﻦ‬ ‫‪8‬‬ ‫‪8‬‬
‫‪7‬‬ ‫ﺍﻟﺨﺎﺷﻌﻴﻦ‬ ‫ﺧﺸﻊ‬ ‫‪16‬‬ ‫‪16‬‬
‫‪8‬‬ ‫ﻟﻌﺐ‬ ‫ﻟﻌﺐ‬ ‫‪20‬‬ ‫‪20‬‬
‫‪9‬‬ ‫ﺣﺪﻳﺚ‬ ‫ﺣﺪﺙ‬ ‫‪36‬‬ ‫‪25‬‬
‫‪10‬‬ ‫ﺍﺿﺤﻚ‬ ‫ﺿﺤﻚ‬ ‫‪10‬‬ ‫‪10‬‬
‫‪11‬‬ ‫ﺯﻭﺝ‬ ‫ﺯﻭﺝ‬ ‫‪72‬‬ ‫‪56‬‬
‫‪12‬‬ ‫ﺣﺠﺎﺏ‬ ‫ﺣﺠﺐ‬ ‫‪8‬‬ ‫‪7‬‬
‫‪13‬‬ ‫ﻣﺴﺘﻮﺭ‬ ‫ﺳﺘﺮ‬ ‫‪3‬‬ ‫‪3‬‬
‫‪14‬‬ ‫ﻇﻬﻮﺭ‬ ‫ﻇﻬﺮ‬ ‫‪57‬‬ ‫‪27‬‬
‫‪15‬‬ ‫ﺟﻮﻉ‬ ‫ﺟﻮﻉ‬ ‫‪5‬‬ ‫‪5‬‬
‫‪16‬‬ ‫ﺍﻟﺼﻴﺎﻡ‬ ‫ﺻﻮﻡ‬ ‫‪11‬‬ ‫‪11‬‬
‫‪17‬‬ ‫ﺍﻧﺼﺖ‬ ‫ﻧﺼﺖ‬ ‫‪2‬‬ ‫‪2‬‬
‫‪18‬‬ ‫ﺗﺮﺗﻴﻞ‬ ‫ﺭﺗﻞ‬ ‫‪2‬‬ ‫‪2‬‬
‫‪19‬‬ ‫ﺍﻟﺼﺤﻒ‬ ‫ﺻﺤﻒ‬ ‫‪9‬‬ ‫‪8‬‬
‫‪20‬‬ ‫ﻛﻨﻮﺩ‬ ‫ﻛ ﻨﺪ‬ ‫‪1‬‬ ‫‪1‬‬
‫‪21‬‬ ‫ﻋﺠﻮﺯ‬ ‫ﻋﺠﺰ‬ ‫‪25‬‬ ‫‪8‬‬
‫‪22‬‬ ‫ﺍﻟﺴﻴﺪ‬ ‫ﺳﻮﺩ‬ ‫‪9‬‬ ‫‪3‬‬
‫‪23‬‬ ‫ﺍﻟﻄﻴﺐ‬ ‫ﻃﻴﺐ‬ ‫‪46‬‬ ‫‪21‬‬
‫‪24‬‬ ‫ﻣﺴﺎﻓﺮ‬ ‫ﺳﻔﺮ‬ ‫‪12‬‬ ‫‪8‬‬
‫‪25‬‬ ‫ﻳﺘﻮﻩ‬ ‫ﺗ ﻴﻪ‬ ‫‪1‬‬ ‫‪1‬‬
‫‪26‬‬ ‫ﺗﺠﺎﻭﺯ‬ ‫ﺟﻮﺯ‬ ‫‪5‬‬ ‫‪5‬‬
‫‪27‬‬ ‫ﻃﺮﻳﻖ‬ ‫ﻃﺮﻕ‬ ‫‪11‬‬ ‫‪4‬‬
‫‪28‬‬ ‫ﺍﻟﺴﺠﻦ‬ ‫ﺳﺠﻦ‬ ‫‪12‬‬ ‫‪10‬‬
‫‪29‬‬ ‫ﻗ ﻠﻊ‬ ‫ﻗ ﻠﻊ‬ ‫‪1‬‬ ‫‪1‬‬
‫‪30‬‬ ‫ﻫﺰﻳﻤﺔ‬ ‫ﻫﺰﻡ‬ ‫‪3‬‬ ‫‪3‬‬
‫‪31‬‬ ‫ﺍﻟﺨﻀﻮﻉ‬ ‫ﺧ ﻀﻊ‬ ‫‪2‬‬ ‫‪1‬‬
‫‪32‬‬ ‫ﺍﻟﻌﺰﻡ‬ ‫ﻋﺰﻡ‬ ‫‪9‬‬ ‫‪7‬‬
‫‪33‬‬ ‫ﺭﻭﺿﺔ‬ ‫ﺭﻭﺽ‬ ‫‪2‬‬ ‫‪2‬‬
‫‪34‬‬ ‫ﺍﻟﻨﺼﺮ‬ ‫ﻧﺼﺮ‬ ‫‪124‬‬ ‫‪124‬‬
‫‪35‬‬ ‫ﺛﻤﻦ‬ ‫ﺛﻤﻦ‬ ‫‪19‬‬ ‫‪11‬‬
‫‪36‬‬ ‫ﺟﺪﺍﺭ‬ ‫ﺟﺪﺭ‬ ‫‪4‬‬ ‫‪3‬‬
‫‪37‬‬ ‫ﺣﺼﺎﺩ‬ ‫ﺣﺼﺪ‬ ‫‪6‬‬ ‫‪3‬‬
‫‪38‬‬ ‫ﻣﻠﺢ‬ ‫ﻣﻠﺢ‬ ‫‪2‬‬ ‫‪2‬‬
‫‪39‬‬ ‫ﺗﻔﻜﻴﺮ‬ ‫ﻓﻜﺮ‬ ‫‪18‬‬ ‫‪18‬‬
‫‪40‬‬ ‫ﺍﻟﻘﻮﺓ‬ ‫ﻗﻮﻱ‬ ‫‪39‬‬ ‫‪38‬‬
‫‪VR* Verses retrieved, VRQ* Verses relevant to query‬‬
‫‪123‬‬
Inf Retrieval
Fig. 5 Results of searching the Quran using a stemmer compared with the full-form word search
Diacritic-less Diacritic Search Method Verses

Index Stem Index Index Retrieved
Full form word 2
Query expansion 3
(stemmer)
Total Retrieved 5
Query = Root =
Expanded Query = ( + + + + + + + )
Fig. 6 Results of running query # 15 (Hunger/‫ )ﺟﻮﻉ‬after being expanded using a stemmer
diacritic-less index) that have an association with this root. The new query is then used to
search the indexes for all potential occurrences of the new bag-of-words. Expanding query
# 15 adds 7 more words to the query and returns 5 verses instead of 3 as in the previous
example. All the 5 retrieved verses are relevant to the query. The new expanded query is
shown in Fig. 6.
6.2.4 QE through thesaurus
In this experiment, we benefited from the ongoing work of Kubaisi (2006) to construct a
thesaurus for the Quran. We have compiled sets of semantic classes into an experimental
thesaurus. So far, the thesaurus index contains (100) semantic groups, where each group is
made of 3–6 synonyms. The average number of words in the index is around (500) words.
The association between the root index and the thesaurus index make QE using the
thesaurus very straightforward. Words from the thesaurus that are related to the query’s
bag-of-words are added automatically to the original query to expand the search. A
comparison between QE using a stemmer and QE using a stemmer and a thesaurus is
depicted in Fig. 7.
123
Inf Retrieval
Fig. 7 Comparison between QE using a stemmer and QE using a thesaurus along with a stemmer
The results obtained from using the thesaurus outperformed the results obtained from
the stemmer alone. In Table 9 we give the semantic groups related to the data set used for
testing and the results obtained from this experiment.
6.2.4.1 Discussion and findings: Example 4 The result of running query # 15: ‫ﺟﻮﻉ‬
(hunger) using a stemmer and a thesaurus.
Again, query # 15 can be rewritten through QE process using a thesaurus as in Fig. 8.
The aim of the expansion is to extend the search using words that are related in meaning to
the query’s bag-of-words. By doing the expansion we hope to retrieve more documents that
are relevant to the original query, even if the query does not have these words.
In this experiment the query goes under stemming to identify all its roots. For each valid
root, the system automatically, searches the root index and adds to the original query all
words (from diacritic index, diacritic-less index and the thesaurus index) that have asso-
ciation with this root. The new query is then used to search the indexes for potential
occurrences of the new bag-of-words. The expansion of query # 15 adds 10 more words to
the query and returns 8 verses, which are all relevant to the query. The new expanded query
is given in Fig. 8.
6.3 Comparing the results of the experiments
A comparison of applying the different techniques on the data set after the three runs is
shown in Fig. 9. As indicated by this chart, applying QE techniques improve the results
dramatically and hence using a stemmer and a thesaurus outperformed the original search
using the full-form of words. The improvement in recall and precision for the QE process is
given in Fig. 10. Again it indicates that using a stemmer and a thesaurus improve AIR
search engines.
7 Conclusion and future work
In this article, we explained the problem of searching diacritisized text using current
international search engines such as Google and Yahoo. We provide a framework solution
123
‫‪Table 9 Results of the search expansion using a thesaurus of semantic groups from the Quran‬‬
‫‪Q#‬‬ ‫‪Query‬‬ ‫‪Semantic classes‬‬ ‫*‪VRW‬‬ ‫*‪VRS‬‬ ‫*‪VRT‬‬
‫‪1‬‬ ‫ﻗﺒﺲ‬ ‫ﺟﺬﻭﺓ‬ ‫ﻟﻬﺐ‬ ‫ﺷﻬﺎﺏ‬ ‫ﺷﺮﺭ‬ ‫ﺷﻮﺍﻅ‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪41‬‬
‫‪2‬‬ ‫ﺯﻳﻨﺔ‬ ‫ﺟﻤﺎﻝ‬ ‫ﺟﻤﻴﻞ‬ ‫ﺣﺴﻦ‬ ‫ﻧﻀﺎﺭﺓ‬ ‫‪6‬‬ ‫‪21‬‬ ‫‪37‬‬
‫‪3‬‬ ‫ﺍﺛﺎﺭ‬ ‫ﺃﻃﻼﻝ‬ ‫ﻋﻼﻣﺎﺕ‬ ‫ﺁﻳﺔ‬ ‫‪1‬‬ ‫‪13‬‬ ‫‪93‬‬
‫‪4‬‬ ‫ﺍﺛﺎﺙ‬ ‫ﻣﺘﺎﻉ‬ ‫‪0‬‬ ‫‪2‬‬ ‫‪8‬‬
‫‪5‬‬ ‫ﺍﺳﻒ‬ ‫ﺗﺎﺏ‬ ‫ﺃﻧﺎﺏ‬ ‫ﺁﺏ‬ ‫ﺭﺟﻊ‬ ‫ﺍﻧﺘﻬﻰ‬ ‫ﻓﺎﺀ‬ ‫‪0‬‬ ‫‪4‬‬ ‫‪26‬‬
‫‪6‬‬ ‫ﺑﺮﻫﺎﻥ‬ ‫ﺑّﻴﻨﺔ‬ ‫ﺁﻳﺔ‬ ‫ﺁﻻﺀ‬ ‫ﺣ ّﺠﺔ‬ ‫ﺑﺼﺎﺋﺮ‬ ‫‪4‬‬ ‫‪8‬‬ ‫‪119‬‬
‫‪7‬‬ ‫ﺍﻟﺨﺎﺷﻌﻴﻦ‬ ‫ﺍﺧﺒﺎﺕ‬ ‫ﺗﻀﺮﻉ‬ ‫ﺍﻃﻤﺌﻨﺎﻥ‬ ‫ﻟﻴﻦ‬ ‫ﺧﻀﻮﻉ‬ ‫‪1‬‬ ‫‪16‬‬ ‫‪41‬‬
‫‪8‬‬ ‫ﻟﻌﺐ‬ ‫ﺍﻟﻌﺒﺚ‬ ‫ﺍﻟﻠﻬﻮ‬ ‫ﺍﻟﻠﻐﻮ‬ ‫ﺍﻟﺨﻮﺽ‬ ‫‪3‬‬ ‫‪20‬‬ ‫‪52‬‬
‫‪9‬‬ ‫ﺣﺪﻳﺚ‬ ‫ﺻﻮﺕ‬ ‫ﻟﻔﻆ‬ ‫ﻧﻄﻖ‬ ‫ﻛﻼﻡ‬ ‫ﻗﻮﻝ‬ ‫‪10‬‬ ‫‪25‬‬ ‫‪98‬‬
‫‪10‬‬ ‫ﺍﺿﺤﻚ‬ ‫ﺍﻹﺑﺘﺴﺎﻡ‬ ‫ﺍﻹﺳﺘﻬﺰﺍﺀ‬ ‫ﺍﻟﺴﺨﺮﻳﺔ‬ ‫ﺍﻹﺯﺩﺭﺍﺀ‬ ‫ﺍﻹﺳﺘﺨﻔﺎﻑ‬ ‫‪1‬‬ ‫‪10‬‬ ‫‪58‬‬
‫‪11‬‬ ‫ﺯﻭﺝ‬ ‫ﺑﻌﻞ‬ ‫ﺳّﻴﺪ‬ ‫ﺻﺎﺣﺐ‬ ‫‪5‬‬ ‫‪56‬‬ ‫‪64‬‬
‫‪12‬‬ ‫ﺣﺠﺎﺏ‬ ‫ﻏﻄﺎﺀ‬ ‫ﻏﺸﺎﺀ‬ ‫ﺧﻤﺎﺭ‬ ‫ﺳﺘﺎﺭ‬ ‫‪4‬‬ ‫‪7‬‬ ‫‪23‬‬
‫‪13‬‬ ‫ﻣﺴﺘﻮﺭ‬ ‫ﺃﺧﻔﻰ‬ ‫ﻳﻌﺰﺏ‬ ‫ﺃﺳ ّﺮ‬ ‫ﻛﺘﻢ‬ ‫ﺣﺠﺐ‬ ‫‪0‬‬ ‫‪3‬‬ ‫‪45‬‬
‫‪14‬‬ ‫ﻇﻬﻮﺭ‬ ‫ﺑﺪ ﺍ‬ ‫ﻃﻠﻊ‬ ‫ﺑﺰﻍ‬ ‫ﺑﺮﺯ‬ ‫‪0‬‬ ‫‪27‬‬ ‫‪47‬‬
‫‪15‬‬ ‫ﺟﻮﻉ‬ ‫ﺧﺼﺎﺻﺔ‬ ‫ﻣﺨﻤﺼﺔ‬ ‫ﻣﺴﻐﺒﺔ‬ ‫‪2‬‬ ‫‪5‬‬ ‫‪9‬‬
‫‪16‬‬ ‫ﺍﻟﺼﻴﺎﻡ‬ ‫ﺃﻣﺴﻚ‬ ‫ﻃﻮﻯ‬ ‫ﻭﺻﻞ‬ ‫ﺣﺼﺮ‬ ‫‪3‬‬ ‫‪11‬‬ ‫‪24‬‬
‫‪17‬‬ ‫ﺍﻧﺼﺖ‬ ‫ﺇﺳﺘﻤﻊ‬ ‫ﺳﻤﻊ‬ ‫‪0‬‬ ‫‪2‬‬ ‫‪28‬‬
‫‪18‬‬ ‫ﺗﺮﺗﻴﻞ‬ ‫ﺗﻼ‬ ‫ﻗﺮﺃ‬ ‫‪0‬‬ ‫‪2‬‬ ‫‪74‬‬
‫‪19‬‬ ‫ﻛﻨﻮﺩ‬ ‫ﺟﺤﻮﺩ‬ ‫ﻛﻔﺮﺍﻥ‬ ‫ﻧﻜﺮﺍﻥ‬ ‫ﺧﺪﺍﻉ‬ ‫‪3‬‬ ‫‪8‬‬ ‫‪26‬‬
‫‪20‬‬ ‫ﺍﻟﺼﺤﻒ‬ ‫ِﺫﻛﺮ‬ ‫ﻗﺮﺁﻥ‬ ‫ﻛﺘﺎﺏ‬ ‫ﻓﺮﻗﺎﻥ‬ ‫‪0‬‬ ‫‪1‬‬ ‫‪245‬‬
‫‪21‬‬ ‫ﻋﺠﻮﺯ‬ ‫ﺍﻟﻀﻌﻒ‬ ‫ﺍﻟﻮﻫﻦ‬ ‫ﺍﻟﻮﻫﻲ‬ ‫ﺍﻟﻔﺘﻮﺭ‬ ‫ﺍﻟﻜﺴﻞ‬ ‫ﺍﻟﺘﺜﺎﻗﻞ‬ ‫‪4‬‬ ‫‪8‬‬ ‫‪50‬‬
‫‪22‬‬ ‫ﺍﻟﺴﻴﺪ‬ ‫ﺍﻟﺠﻠﻴﻞ‬ ‫ﺍﻟﻜﺒﻴﺮ‬ ‫ﺍﻟﻌﻈﻴﻢ‬ ‫ﺍﻟﻮﺟﻴﻪ‬ ‫‪0‬‬ ‫‪3‬‬ ‫‪51‬‬
‫‪23‬‬ ‫ﺍﻟﻄﻴﺐ‬ ‫ﺍﻟﺰﺍﻛﻲ‬ ‫ﺍﻟﻄﺎﻫﺮ‬ ‫ﺍﻟﺼﺎﻓﻲ‬ ‫ﺍﻟﻤﺼﻨﻮﻉ‬ ‫‪5‬‬ ‫‪21‬‬ ‫‪73‬‬
‫‪24‬‬ ‫ﻣﺴﺎﻓﺮ‬ ‫ﻇﻌﻦ‬ ‫ﺧﺮﺝ‬ ‫ﺭﺣﻞ‬ ‫ﺳﺎﺡ‬ ‫ﺳﺎﺭ‬ ‫ﺳﺮﻯ‬ ‫ﻫﺎﺟﺮ‬ ‫‪0‬‬ ‫‪8‬‬ ‫‪98‬‬
‫‪25‬‬ ‫ﻳﺘﻮﻩ‬ ‫ﺫﺑﺬﺏ‬ ‫ﺗﺤّﻴﺮ‬ ‫ﺗﺮﺩﺩ‬ ‫‪0‬‬ ‫‪1‬‬ ‫‪4‬‬
‫‪26‬‬ ‫ﺗﺠﺎﻭﺯ‬ ‫ﻋﺒﺮ‬ ‫ﻗﻄﻊ‬ ‫ﺳﺒﻖ‬ ‫ﺳﺎﺭﻉ‬ ‫‪0‬‬ ‫‪5‬‬ ‫‪21‬‬
‫‪123‬‬
‫‪Table 9 continued‬‬
‫‪Q#‬‬ ‫‪Query‬‬ ‫‪Semantic classes‬‬ ‫*‪VRW‬‬ ‫*‪VRS‬‬ ‫*‪VRT‬‬
‫‪123‬‬
‫‪27‬‬ ‫ﻃﺮﻳﻖ‬ ‫ﺇﻣﺎﻡ‬ ‫ﺻﺮﺍﻁ‬ ‫ﺳﺒﻴﻞ‬ ‫ﻧﻬﺞ‬ ‫ﻓﺞ‬ ‫ﺟﺪﺩ‬ ‫ﻧﻔﻖ‬ ‫‪2‬‬ ‫‪4‬‬ ‫‪227‬‬
‫‪28‬‬ ‫ﺍﻟﺴﺠﻦ‬ ‫ﺣﺒﺲ‬ ‫ﺍﻣﺴﺎﻙ‬ ‫ﺗﻮﻗﻴﻒ‬ ‫ﺍﺛﺒﺎﺕ‬ ‫ﺣﺠﺮ‬ ‫‪6‬‬ ‫‪10‬‬ ‫‪26‬‬
‫‪29‬‬ ‫ﻗﻠﻊ‬ ‫ﻗﻄﻊ‬ ‫ﻧﺰﻉ‬ ‫ﺻﺮﻡ‬ ‫‪0‬‬ ‫‪1‬‬ ‫‪19‬‬
‫‪30‬‬ ‫ﻫﺰﻳﻤﺔ‬ ‫ﺩﺣﺾ‬ ‫ﺩﺣﺮ‬ ‫ﺩﺧﺮ‬ ‫ﺩﻓﻊ‬ ‫ﺃﺟﻠﻰ‬ ‫ﻃﺮﺩ‬ ‫ﺷ ّﺮﺩ‬ ‫‪0‬‬ ‫‪3‬‬ ‫‪25‬‬
‫‪31‬‬ ‫ﺍﻟﺨﻀﻮﻉ‬ ‫ﺍﻹﺫﻋﺎﻥ‬ ‫ﺍﻹﺳﺘﺴﻼﻡ‬ ‫ﺍﻹﺳﺘﻜﺎﻧﺔ‬ ‫ﺍﻟﻄﺎﻋﺔ‬ ‫ﺍﻻﺳﺘﺠﺎﺑﺔ‬ ‫‪0‬‬ ‫‪1‬‬ ‫‪16‬‬
‫‪32‬‬ ‫ﺍﻟﻌﺰﻡ‬ ‫ﺍﻟﺤﻤﻴﺔ‬ ‫ﺍﻟﺼﻤﻮﺩ‬ ‫ﺍﻟﺼﺒﺮ‬ ‫ﺍﻟﻌﺮﺍﻡ‬ ‫ﺍﻟﻌﺰﺓ‬ ‫‪1‬‬ ‫‪7‬‬ ‫‪20‬‬
‫‪33‬‬ ‫ﺭﻭﺿﺔ‬ ‫ﺣﺪﻳﻘﺔ‬ ‫ﺣﺮﺙ‬ ‫ﺯﺭﻉ‬ ‫ﻧﺒﺎﺕ‬ ‫ﺑﺴﺘﺎﻥ‬ ‫ﺣﻘﻞ‬ ‫‪1‬‬ ‫‪2‬‬ ‫‪28‬‬
‫‪34‬‬ ‫ﺍﻟﻨﺼﺮ‬ ‫ﺛّﺒﺖ‬ ‫ﺃّﻳﺪ‬ ‫ﺃﻣ ّﺪ‬ ‫ﺃﻋﺎﻥ‬ ‫ﻣ ّﻜﻦ‬ ‫ﺁﺯﺭ‬ ‫‪3‬‬ ‫‪124‬‬ ‫‪153‬‬
‫‪35‬‬ ‫ﺛﻤﻦ‬ ‫ﺃﺟﺮ‬ ‫ﻋﻄﺎﺀ‬ ‫ﺟﺰﺍﺀ‬ ‫ﺛﻮﺍﺏ‬ ‫ﻭﻓﺎﺀ‬ ‫ﻓﻀﻞ‬ ‫‪0‬‬ ‫‪11‬‬ ‫‪184‬‬
‫‪36‬‬ ‫ﺟﺪﺍﺭ‬ ‫ﺣﺎﺋﻂ‬ ‫ﺳﻮﺭ‬ ‫ﺳﺪ‬ ‫ﺭﺩﻡ‬ ‫ﺑﺮﺯﺥ‬ ‫ﺣﺎﺟﺰ‬ ‫ﺣﺠﺎﺏ‬ ‫‪0‬‬ ‫‪3‬‬ ‫‪17‬‬
‫‪37‬‬ ‫ﺣﺼﺎﺩ‬ ‫ﺟ ﻨﻰ‬ ‫ﺧﻀﺪ‬ ‫ﻗﻄﻒ‬ ‫‪0‬‬ ‫‪3‬‬ ‫‪8‬‬
‫‪38‬‬ ‫ﻣﻠﺢ‬ ‫ﺃُﺟﺎﺝ‬ ‫‪2‬‬ ‫‪2‬‬ ‫‪5‬‬
‫‪39‬‬ ‫ﺗﻔﻜﻴﺮ‬ ‫ﺍﻟﻌﻘﻞ‬ ‫ﺍﻟﻤﺘﺪﺑﺮ‬ ‫ﺍﻟﺮﺷﻴﺪ‬ ‫‪0‬‬ ‫‪18‬‬ ‫‪83‬‬
‫‪40‬‬ ‫ﺍﻟﻘﻮﺓ‬ ‫ﺍﻟﺸﺪﻳﺪ‬ ‫ﺍﻟﻐﻠﻴﻆ‬ ‫ﺍﻟﺜﻘﻴﻞ‬ ‫ﺍﻟﻤﺘﻴﻦ‬ ‫ﺍﻟﺼﻠﺪ‬ ‫ﺍﻟﺼﺮﺻﺮ‬ ‫‪3‬‬ ‫‪38‬‬ ‫‪61‬‬
‫‪VRW* Verses retrieved using query’s bag-of-words, VRS* Verses retrieved using stemmer, VRT* Verses retrieved using a stemmer & a thesaurus‬‬
Inf Retrieval
Diacritic-less Diacritic Search Method Verses

Index Stem Index Index Retrieved
Full form word 2
Query expansion 3
(stemmer)
Thesaurus Index
Query expansion 3
(thesaurus)
Total Retrieved 8
Query = Root =
Expanded Query = ( + + + + + + + + + + )
Fig. 8 Results of expanding query # 15 (Hunger/‫ )ﺟﻮﻉ‬using a stemmer & a thesaurus
Fig. 9 Comparison of the three different searching techniques
Fig. 10 Recall/precision chart for the data set using query expansion techniques
123
Inf Retrieval
for the searching problem through indexing. We investigated the use of QE on searching
the Quran scripts in the absence of diacritics, where queries are automatically augmented
with related terms extracted from a diacritic and diacritic-less indexes by applying a
stemmer and a thesaurus of semantic classes. We conducted a set of experiments to test our
system on a data set of 40 queries and the scripts of the holy Quran. We found that QE for
searching Arabic text is promising and it is likely that the efficiency can be further
improved.
Applications such as IE, QA, TS, and MT are few examples of NLP applications that
rely extensively on extracting concepts from web documents. This process requires the
analysis of the document content, either morphologically, syntactically or semantically and
therefore, new search engines equipped with tools to integrate and derive new meaning
from Arabic repositories need to be investigated. Our system could be improved by adding
a more sophisticated morphological analyzer, POST, and Arabic ontology.
Acknowledgment We would like to thank Shereen Khoja for providing her stemmer, Prof. Nadim Obeid
for his valuable suggestions to improve this work and Mahmoud El-Hajj for helping with construction the
thesaurus and the database implementation.
References
Abdelali, A., Cowie, J., Farwell, D., Ogden, W., & Helmreich S. (2003). Cross-language information
retrieval using ontology. In Proceedings of TALN ’2003, Batz-sur-Mer, France.
Abdelali, A., Cowie, J., & Soliman, H. (2004). Arabic information retrieval perspectives. In Proceedings of
JEP-TALN 2004 Arabic Language Processing.
Abdelali, A., Cowie, J., & Soliman, H. (2006). Improving query expansion precision using latent semantic
analysis: Application on Arabic retrieval. Journies d’Etudes sur le Traitement Automatique de la Langue
Arabe (JETALA), Rabat, Morocco.
AbdulJaleel, N., & Larkey, L. (2003). Statistical transliteration for English-Arabic cross language infor-
mation retrieval. In Proceedings of the Twelfth International Conference on Information and Knowledge
Management, pp. 139–146.
Aljlayl, M., Frieder, O., & Grossman, D. (2002). On Arabic-English cross-language information retrieval: A
machine translation approach. In Proceedings of the Third International Conference on Information
Technology, pp. 2–7.
Al-Maskari, A., Sanderson, M., & Clough, P. (2007). Arabic users’ satisfaction with the online information
as obtained from Google. In Proceedings of Sixth International Conference on Conceptions of Library
and Information Science (CoLIS).
Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of
Information Science, 31(1), 13–28.
Black, W., & Elkateb, S. (2004). A prototype English-Arabic dictionary based on WordNet. In Proceedings
of 2nd Global WordNet Conference, (GWC 2004), pp. 67–74.
Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., et al. (2006). Introducing the
Arabic wordnet project. In Proceedings of the Third International WordNet Conference, (GWC 2006),
pp. 295–299.
Buckwalter, T. (2007). Issues in Arabic morphological analysis. In A. Soudi, A. Van den Bsch, &
G. Neumann (Eds.), Arabic computational morphology (pp. 23–41). Netherlands: Springer. ISBN 978-1-
4020-6045-8.
Debili, F., Achour, H., & Souissi, E. (2002). Del’etiquetage grammatical a’ la voyellation automatique de
l’arabe. Correspondances (Vol. 71, pp. 10–28). Tunis: Institut de Recherche sur le Maghreb
Contemporain.
Diab, M. (2007). Improved Arabic base phrase chunking with a new enriched POS tag set. In Proceedings of
the 5th Workshop on Important Unresolved Matters, pp. 89–96.
Dumais, S., Landauer, T., & Littman, M. (1996). Automatic cross-linguistic information retrieval using
latent semantic indexing. In SIGIR’96-Workshop on Cross-Linguistic Information Retrieval, pp. 16–23.
Elkateb, S., & Black, W. (2004). A bilingual dictionary with enriched lexical information. In Proceedings of
NEMLAR Cairo, Egypt 2004 Arabic Language Tools and Resources, pp. 79–84.
123
Inf Retrieval
El-Helw, A., & Aly, H. (2004). An intelligent database application for the semantic web. In Proceedings of
CSITeA-04 Conference, Cairo, Egypt.
El-Sadany, T., & Hashish, M. (1988). Semi-automatic vowelization of Arabic verbs. In 10th NC Conference,
Jeddah, Saudi Arabia.
French, J., Powell, A., Gey, F., & Perelman, N. (2001). Exploiting a controlled vocabulary to improve
collection selection and retrieval effectiveness. In Proceedings of the Tenth International Conference on
Information and Knowledge Management, pp. 199–206.
Gal, Y. (2002). An HMM approach to vowel restoration in Arabic and Hebrew. In Proceedings of ACL-02
Workshop on Computational Approaches to Semitic Languages, pp. 27–33.
Gey, F., Kando, N., & Peters, C. (2002). Cross language information retrieval: A research roadmap. ACM
SIGIR Forum, 36(2), 72–80.
Grefenstette, G. (1996). Cross-linguistic information retrieval workshop. In Proceedings of the 19th Annual
International ACM SIGIR Conference on Research and Development in IR, p. 344.
Grefenstette, G., Semmar, N., & Elkateb-Gara, F. (2005). Modifying a natural language processing system
for European languages to treat Arabic in information processing and information retrieval applications.
In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 31–38.
Hammo, B., Abu-Salem, H., Lytinen, S., & Evens, M. (2002). QARAB: A question answering system to
support the Arabic language. In Proceedings of ACL-02 Workshop on Computational Approaches to
Semitic Languages, pp. 55–65.
Hammo, B., Abuleil, S., Lytinen, S., & Evens, M. (2004). Experimenting with a question answering system
for the Arabic language. Computers and the Humanities, 38(4), 379–415.
Hayashi, Y., Kikui, G., & Susaki, S. (1997). TITAN: A cross-linguistic search engine for the WWW. In
Working Notes of AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, pp. 58–65.
Hedlund, T., Airio, E., Keskustalo, H., Lehtokangas, R., Pirkola, A., & Järvelin, K. (2004). Dictionary based
cross-language information retrieval: Learning experiences from CLEF 2000–2002. Information
Retrieval, 7(1), 99–119.
Hull, D., & Grefenstette, G. (1996). Querying across languages: A dictionary-based approach to multilingual
information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 49–57.
Kampas, J. (2004). Improving retrieval effectiveness by reranking documents based on controlled vocab-
ulary. Lecture Notes in Computer Science, 2997, 283–295.
Khadir, M. (2002). Quran lexicon. Retrieved April 10, 2008 from http://www.al-mishkat.com/words/book.
htm.
Khoja, S. (1999). Stemming Arabic text. Retrieved June 20, 2007 from http://zeus.cs.pacificu.edu/shereen/
research.htm.
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL
2001, pp. 20–25.
Kirchhoff, K., & Vergyri, D. (2005). Cross-dialectal data sharing for acoustic modeling in Arabic speech
recognition. Speech Communication, 46(1), 37–51.
Kubaisi, A. (2006). Quran words. Retrieved April 10, 2008 from http://www.islamiyyat.com/kalema.htm.
Landauer, T., & Littman, M. (1990). Fully automatic cross-language document retrieval using latent
semantic indexing. In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford
English Dictionary and Text Research, pp. 31–38.
Larkey, L., Ballesteros, L., & Connell, M. (2002). Improving stemming for Arabic information retrieval:
Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International ACM
SIGIR Conference on Research & Development in IR, pp. 275–282.
Larkey, L., & Connell, M. (2005). Structured queries, language modeling, and relevance modeling in cross-
language information retrieval. Information Processing and Management: An International Journal, 41
(3), 457–473.
Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding?
Journal of Information Science, 33(5), 622–636.
Lundquist, C., Frieder, O., Holmes, D., & Grossman, D. (1997). A parallel relational database management
system approach to relevance feedback in information retrieval. Journal of the American Society of
Information Science (JASIS), 50(5), 413–426.
Moukdad, H. (2004). How do search engines handle Chinese queries? Lost in cyberspace: How do search
engines handle arabic queries? In Proceedings of the 32nd Annual Conference of the Canadian Asso-
ciation for Information Science. Retrieved October 1, 2008 from www.cais-acsi.ca/proceedings/2004/
moukdad_2004.pdf.
Moukdad, H., & Cui, H. (2005). How do search engines handle Chinese queries? Webology, 2(3). Retrieved
October 1, 2008 from www.Webology.ir/2005/v2n3/a17.html.
123
Inf Retrieval
Oard, D. (1998). A comparative study of query and document translation for cross-language information
retrieval. In Proceedings of the 3rd Conference of the Association for Machine Translation in the
Americas, pp. 472–483.
Oard, D. (2000). Evaluating interactive cross-language information retrieval: Document selection. In Cross-
Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum,
CLEF 2000, pp. 57–71.
Pirkola, A., Hedlund, T., Keskustalo, H., & Järvelin, K. (2001). Dictionary-based cross-language infor-
mation retrieval: Problems, methods, and research findings. Information Retrieval, 4(3–4), 209–230.
Qiu, Y., & Frei, H. (1993). Concept based query expansion. In Proceedings of the 16th ACM SIGIR
International Conference on Research and Development in IR, pp. 160–169.
Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York: McGraw-Hill
Book Company.
Salton, G. (1989). Automatic text processing—the transformation analysis and retrieval of information by
computer. MA: Addison Wesley.
Semmar, N., & Fluhr, C. (2007). Arabic to French sentence alignment: Exploration of a cross-language
information retrieval approach. In Proceedings of the 5th Workshop on Important Unresolved Matters,
pp. 73–80.
Semmar, N., Laib, M., & Fluhr, Ch. (2006). Using stemming in morphological analyzer to improve Arabic
information retrieval. In Proceedings of TALN 2006, pp. 317–327.
Sroka, M. (2000). Web search engines for Polish information retrieval: Questions of search capabilities and
retrieval performance. International Information & Library Research, 32(2), 87–98.
Strzalkowski, T., & Vauthey, B. (1992). Information retrieval using robust natural language processing. In
Proceedings of ACL-92, pp. 104–111.
Talvensaari, T., Juhola, M., Laurikkala, J., & Järvelin, K. (2007). Corpus-based cross-language information
retrieval in retrieval of highly relevant documents: Research articles. Journal of the American Society
for Information Science and Technology, 58(3), 322–334.
Vectomova, O., & Wang, Y. (2006). A study of the effect of term proximity on query expansion. Journal of
Information Science, 32(4), 324–333.
Virga, P., & Khudanpur, S. (2003). Transliteration of proper names in cross-lingual information retrieval. In
Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recogni-
tion, Vol. 15, pp. 57–64.
Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Pro-
ceedings of the 25th Annual International ACM SIGIR Conference on Research & Development in
Information Retrieval, pp. 269–274.
Zaidi, S., & Laskri, M. (2005). A cross-language information retrieval based on an Arabic ontology in the
legal domain. In Proceedings of the International Conference on Signal-Image Technology and Internet-
Based Systems (SITIS’05), pp. 86–91.
Zitouni, I., Sorensen, J., & Sarikaya R. (2006). Maximum entropy based restoration of Arabic diacritics. In
Proceedings of the 21st International Conference on Computational Linguistics, pp. 577–584.
123

Towards Enhancing Retrieval Effectiveness of Search Engines For Diacritisized Arabic Documents

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Towards Enhancing Retrieval Effectiveness of Search Engines For Diacritisized Arabic Documents

Încărcat de

Drepturi de autor:

Formate disponibile

Inf Retrieval

Towards enhancing retrieval effectiveness of search

Received: 11 May 2008 / Accepted: 24 November 2008

Keywords Arabic information retrieval · Diacritisized scripts · Query expansion ·

LSI Latest semantic indexing

Table 1 Common short vowels (diacritics) used in Arabic text

fatha A small diagonal line appears above a letter ‫َﺩ‬ da

‫َﻛَﺘ َﺐ‬ kataba 3PSNG (verb) Wrote

techniques using a rule-based stemmer and an experimental thesaurus. We carried our

2 State of the art

2.1 Arabic information retrieval

2.2 Cross language information retrieval

2.3 Arabic cross language information retrieval

3 Experimenting with current international search engines

Web Arabic English Web Arabic English

1 4,050,000 4,950,000 198,000 10,400,000 9,320,000 276,000

‫َﻛَﺘ َﺐ‬ He wrote 113,000 24,700,000

4 The proposed experimental search engine

In previous work, we designed and implemented an experimental search engine to

5.1 The study domain

5.2 Data acquisition

5.3 The data model of the extended search engine

5.4 Processing the scripts of the Quran

5.4.1 Preprocessing the scripts

5.4.2 Building the inverted lists (indexes)

5.4.3 Using the stemmer

Diacritic-less Stem Index Diacritic Index

6 Experiments and results

6.1 Data set and test collection

6.2 Experimental results

6.2.1 Searching the diacritic-less index using full-form words

Table 5 List of test data

Example 1 Query # 13: ‫( ﻣﺴﺘﻮﺭ‬covered).

Table 6 Results obtained from

Fig. 4 Results of searching the

Table 7 Sample of words

Example 2 Query # 15: ‫( ﺟﻮﻉ‬hunger).

6.2.2 The effectiveness of QE

QE can be defined as the process of reformulating the query’s bag-of-words to overcome

6.2.3 QE through stemming

‫‪Table 8 Results from search expansion using the stemmer‬‬

‫‪1‬‬ ‫ﻗﺒﺲ‬ ‫ﻗﺒﺲ‬ ‫‪3‬‬ ‫‪2‬‬

Diacritic-less Diacritic Search Method Verses

6.2.4 QE through thesaurus

6.3 Comparing the results of the experiments

7 Conclusion and future work

Diacritic-less Diacritic Search Method Verses

Fig. 8 Results of expanding query # 15 (Hunger/‫ )ﺟﻮﻉ‬using a stemmer & a thesaurus

Fig. 9 Comparison of the three different searching techniques

S-ar putea să vă placă și