Sunteți pe pagina 1din 13

International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n.

A Computer Aided Lexicography Tool for Making


Dictionaries on Historical Principles
Saad Harous
harous@uaeu.ac.ae
College of Information Technology
United Arab Emirate University
Sane M Yagi,
saneyagi@yahoo.com
Linguistics Department
University of Jordan, Amman 11942, Jordan
Jim Yaghi
yaghi1@gmail.com
Leptobyte Innovations Ltd.
AucklandP.O.Box 205, Shafa Badran, Amman 11934, Jordan, New Zealand
Sane M Yagi,
Linguistics Department
University of Jordan, Amman, Jordan

Abstract: Despite the long standing tradition of lexicography that Arabic prides itself
on, the language does not have a dictionary that states the origin of words and that traces their
development across time. Several attempts have been made at it recently but failed, resulting
in frustration and in the conclusion that the task is daunting. The main reason for failure is the
lack of computer tool to help build such a dictionarysheer volume of work required. In this
paper, we present a computational tool that would facilitate the compilation of an Arabic dictionary on
historical principles. There are no openly available tools for Arabic dictionary making; if they do exist,
they are jealously guarded for their commercial value; hence, they are unavailable to scholars who
might want to take part in the grand endeavor of building an etymological Arabic dictionary. This
research shall make its tool available to the open source community to encourage further development
and refinement. The computational tool can also be used in the development of computer-assisted
language learning software. Concordances, for example, are by-products of this research, for example,
yet they are invaluable to the teaching of grammar and morphology; they encourage learning by
discovery.

Key words: Arabic, Dictionary, Computational ToolsLexicography, Morphology,


LearningComputational Tool, CALL.

Introduction
At the outset of this paper, we will discuss the importance of dictionaries on historical principles,
outline the efforts that lexicographers have made towards making compiling such a dictionary for
Arabic, and list some resources that could be utilized to trace the history of Arabic words. Then we
will briefly state the various components and specifications of the software that we developed to help
lexicographers compile this dictionary. Then we move on to a detailed discussion of the major
technical difficulties that we encountered when developing it. We will first describe the difficulties
one by one, discuss the alternative solutions considered, and state the adopted solutions and the
justification for that. At the end, we will list the current status of the software and the features being
implemented.ng.

Lexis is a set of tools that we developed for Arabic computational lexicography. It is a web 2.0-based
dictionary maker that has special facilities for chronicling the etymology of Arabic words.

1
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

Arabic’s 2000 year documented history has given each word in its vocabulary a history and a
fascinating story. By learning how the meaning of a word got altered over time, we can index changes
in the Arab intellect and culture.

Furthermore, the study of word origins can help students remember meanings and grasp the relevance
of words, beyond their definition (Readence, Bean, & Baldwin, 1998). As Ludwig WittgensteinK.
Ogden and I.A. Richards, however, stated, ''the meaning of a word is its use in the language''. and
indeed that is the only 'true' meaning of a word. It is what current userlexicographers of the language
denote with the word.

When interpreting classical texts, etymology makes it easy for the reader to unravel what the texts
mean by indicating how a word was understood by the contemporaries of the author several centuries
in the past.

To create an etymological dictionary, it is necessary to compile 1,000s of gigabytes of texts of


different genres and ages. Adding to this are the arduous tasks of classifying the texts chronologically,
indexing all words in them, and determining what each word meant in their textual contexts.

This is a major undertaking that has frightened too many a scholar for years. With the blessings of
Web 2.0 social platforms, hope to compile an etymological dictionary has been rekindled.
Contributions from a community or network of specialists can benefit from the currently presented
lexicography engine.

Lexis would facilitate combining the efforts of specialists from around the world in a platform-
independent setting for the purpose of creating and publishing a comprehensive archive of the historic
“stories” of the majority of words ever used in the Arabic language.

The reader is invited to access Lexis at http://www.arabiclexis.com/ to experience for themselves


Arabic computational lexicography at work.

The Arabic Dictionary on Historical Principles


Haywood attributes to Arabs a central position in the history of lexicography. “They were not, by any
means, the first people to compile dictionaries of merit: but al-Khalîl may well have been the first man
to attempt to register the complete vocabulary content of any language” (Haywood 1965). For the
development of Arabic lexicography and its historical roots, see Gätje, 1985. For tracing modern
Arabic lexicography efforts, see (Sawaie, 1999) and, (Haywood 1965) who talk at length about the
activities of modern lexicographers at the end of the 19th century and the turn of the 20th century,
particularly as-Sidyaq and al-Bustani.

Arabic lexicography had a lasting impact on the dictionary-making of Persian, Turkish, Malay,
Hebrew, Syriac, and several European languages; yet as (Omar 1998) points out, it has lately fossilized
and become incapable of serving the diverse needs of its speakers and learners. (Hamzaoui, 1989),
(Haywood 1965), (Omar 1998) to name but a few, stress the need for an Arabic dictionary on
historical principles. The Cairo Arabic Language Academy’s 1932 charter designated as one of its
primary goals the task of compiling a dictionary on historical principles (Academy 1972). This
dictionary has not seen the light after more than 70 78 years of existence of this language academy.

Several attempts have been made at compiling an Arabic dictionary on historical principles. Ibrahim
Madkour (Academy 1972), in his introduction to the 1960 first edition of Al-Mu’jam al-Waseet,
outlines the involvement of the Academy in this regard. But then the Academy of the Arab Language
settled for a less ambitious dictionary that escaped some of the severe requirements of a dictionary on
historical principles. They named it al-Mu’jam al-Kabeer, the comprehensive dictionary. It had started
out as representative of a wide spectrum of Arabic history, from the pre-Islamic era to the present, but
soon it abandoned this course finding the task daunting (Dh ayf, 1984). Their website

2
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

(http://www2.sis.gov.eg/En/Arts&Culture/ala/073500000000000001.htm) asserts that five volumes of


this dictionary have been published, but without the historical dimension ( A cademy, 2003).

August Fischer was not the only orientalist who worked on this. (Dozy 1968) wrote Dictionnaire
détaille des noms des vêtements chez les Arabes and a two volume supplement to that, entitled aux
dictionnaires arabes.

Etymology is often pursued whenever it is desired to establish a connection between the literal
meaning of a term and its technical meaning. The Encyclopedia of Islam (Donzel, 1997) is replete with
etymology of terms. For Take as an example, Gabrieli’s article on adab ‘literature’ and Haywood’s on
kamus ‘dictionary’. So is the case with jurisprudence, mysticism, and criticism encyclopedias and
dictionaries, examples of which are (Al- Asfahani, 1997), (Al- Sharqawi, 1987) and (Matloub, 2001).

In addition, some literary epochs such as Jahiliya poetry, the Quranic text, Omayyad and Abbassid
literature received extensive interest over the centuries. As a result, there are Hutay'a, many studies
commentaries on classical books that delineate that comment on lexical meanings of terms and the
socio-cultural contexts associated with them (e.g., (Ibn Hillizah, 1969) and (Al-Hutay'a, 1987)). These
studies would be invaluable to any etymological investigation of Classical Arabic.

Manuscripts are excellent sources for commentaries on lexical meanings; editors of old Arabic
manuscripts quite often explain the text by translating what terms used to mean in the historical
context of the manuscript. (eE.g., (Ibn Muqbil, 1962) and (al- Zoabi, 1999)).

The Exegesis-authority Mujahid (d. 722), who was a disciple of the Prophet’s cousin and companion
Ibn Abbas, attributed Quranic words such as Toor to Syriac, sijeel to Persian or Nabataen and qisTas
to Greek. Well-known books that explore such words in the Quran are (Al-Suyouti, 1998) and (Al-
Jawaleeqi, 1988). Several books have investigated loan words in Arabic (Hebbo, 1970) and (J effery,
1938); a few focused on words from Persian (A sbaghi, 1988) and (Shir, 1990), Turkish (al- Samarraí,
1997), Syriac (Barsoum, NY), Aramaic (Fränkel, 1962), Latin (Koningsveld 1976) and (Simonet,
1888),etc.

Closer to Arab vernaculars are the numerous compilations that offer synchronic description of the
language of the average citizencommon speaker. Although most are prescriptivist in nature, they do
document semantic change that came about as a result of variation in geographic location, time period,
sociocultural status, or language contact. Dictionaries that index vernacular terms and expressions in
the history of the Arabic language abound: (Al-Hareeri, 1996; Al- Jawaleeqi, 1995; Al-Zubaidi, 1995;
Ibn Al-Sakeet, 1987; Ibn Qutayba, 2001; Ibn Sayed Al-Batliyousi, 1996).

The library abounds with primary references that comment on the culture and social life of Arabs at
different historical intervals, such sources can be used to account for vocabulary and terms in common
use at specific periods. (Ibn Al-Washa, 1990) describes the Arab social and cultural life in the 10th
century; (Ibn Batutah, 1994) made valuable observations about a large array of 14th century cultures
and ways of life; and (al- Jabarti, 1997; al- Jabarti, et al, 1993) give a detailed account of the social life
of Egypt in the late 18th and early 19th centuries.

Etymology can also benefit from the dictionaries that the Arabs have been writing since the seventh
century. These span across almost the entire life span of Arabic authorship.

Lexis
Lexis is a Web 2.0 socially updated dictionary. It indexes the words of the Arabic language by their
meaning and in context of their etymological development. It is a collaborative, platform independent,
social medium where lexicographers and other linguistic experts can define words and add a time
dimension to meanings.

3
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

Contributions can be made by specialists, simply by uploading new texts to their own profiles.

New texts are automatically tagged with title, author, and year of first publication. After some filtering
and splitting on of the corpus data, Lexis very quickly indexes every unique word in the volume and
produces a concordance. The contributor can then choose example sentences. Lexis in turn copies the
example sentences to the user’s lexicographer’s scrapbook and tags them with the name of author, title
of reference, and year of publication. It then displays the compiled example sentences for further
scrutiny by the user lexicographer and in order to stimulate their thinking on the various senses that the
word might have taken.

A dictionary editor utility displays the semantic features of the root of the target word, gives the
distinctive semantic features of the morphological pattern that the target word is cast into, proposes a
sentence frame for defining the word, and offers the lexicographer chance to select illustrative
examples of the word’s meanings and senses. The compiled definition will constitute an entry in the
userlexicographer’s scrapbook.

UserLexicographers may choose to maintain a profile where they would store their own searches, can
create definitions of their own, and may trade these definitions with other community members and
experts.

At the Lexis administrator level, volunteer and paid moderators select, modify, and approve
userlexicographer contributed entries. Their role is to control the quality of the resulting online
dictionary.

The general public may request definitions using stem, exact, and diacritic-insensitive searches and
can choose a date range for the meaning of a desired word.

Lexis has the following features:


• Filtering tools for multiple text formats and multiple encodings to handle CP1256 (MS Windows
Arabic) and Unicode, HTML parsing, plain text, doc, etc.
• Additional downloadable utility for splitting large files for inclusion in the corpus
• Ultra high speed indexing of large corpora into mySQL database with a PHP driven engine, and
DHTML, DOM Layer 2, and Javascript interface
• Multiple user interface (web-based) with Administrator, Moderator, Contributor, and Guest
modes.
• Users Lexicographers’ can share sharing of corpus texts between each other.
• UserLexicographer profiles which allow experts to find each other and communicate about their
data and contributions
• Scrapbook which allows userlexicographers to create named groupings of words in context, each
sentence is automatically tagged with information about author, title, and year of publication
• Template-driven entry creation to standardize the tasks. Content contributors can then save entries
into their profiles and allow “friends” to search them
• The entry Entry submission service which groups similar entries together and distributes them
amongst moderators automatically and by entry
• Entry editing with advanced presentation capabilities which permits Moderators moderators can to
select some entries and discard others. They can also combine information from entries submitted
by two or more userlexicographers. together with advanced presentation editing capabilities
• Personal dictionary facility which enables userlexicographers to search, share, and combine entries
with other experts.

4
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

Technical Difficulties
Text Format

Arabic text may be represented in any of a number of encodings, the most important of which are the
Microsoft Arabic Windows character set (cp1256) and Unicode (UTF8, UTF16, UTF32). In addition,
text is often stored in formats other than raw text files. For example, formatted texts may come in the
form of HTML, XML, Microsoft Word Documents, Rich Text Format (RTF) files, Post Script (PS),
and Portable Document Format (PDF).

Our first major goal was to be able to deal with many automatically collected Internet texts which
come in different a large array of encodings and formats, . such as those mentioned above. Our Lexis
system was intended to be a socially updated updatable Arabic dictionary. So, it would be fitting that it
accepted as many types of files containing text as possible. Specifically, it must ought to be able to
decode and extract text from any file that contains Arabic.

We developed a stand-alone tool (Lexis Converter) to process offline user-contributed texts. It takes
as input most common formats of Arabic files including MS Office Documents, raw text, HTML,
XML, RTF, and even PDF. UserLexicographers download the tool to their machines, then drag and
drop their files and folders onto it, hit “Convert”, and wait for it to output special “lex” files from
them. These generated lex files can later subsequently be added to the index through the web-based
portal that we developed for Lexis.

In our quest for a modular solution to file formats, we discovered Microsoft’s iFilter which was
designed for use by the Microsoft Indexing Service on their Windows operating system. Since
Microsoft uses the Indexing Service to facilitate fast desktop searching, their Indexer must have been
built to solve the exact same problem that we had.

iFilter turned out to be the ideal solution because, like the Indexing Service, we too needed fast access
to the raw text inside files of various formats. When If a user wants wanted to read files of a new
format, they would simply register the appropriate “filter” on their system. Many filters are available
for download from 3rd parties and software vendors. Some filters will even extract text from a TIFF
image file by performing OCR on it.

Since the userlexicographer-contributed-files could be of literally of any size, passing information


from iFilter into Lexis Converter had to be scalable and efficient. Amongst the challenges that we had
to overcome was the handling of some COM threading issues that arose when filtering a file of a large
size. Some of the issues were caused by bugs within the non-complying filters such as(e.g., the Adobe
PDF Filter which crashed the computer because it was not closed). In one of our previous
implementations of the iFilter interface with the Lexis Converter, we had to extract the entire contents
of the file to be indexed, store it in memory, and then attempt to process it. Of course, this would work
for small files or with large memory, but the cumulative effect of several files processed in a row was
that it overtaxed the .NET garbage collector. This was due to the fact that when entire files were
loaded into memory, .NET used the .NET Large Objects Heap and thiswhich soon got messy quickly.

Eventually, we implemented a FilterReader that buffered the extracted data in memory a littlein small
quantities at a time. Clearly, this approach was more efficient and made it faster and more manageable
to retrieve the text input instead of occupying the entire machine’s memory with the content of a large
file.

There were more problems with the filters themselves though. Some filters were marked as COM
Single Thread Apartment (STA), others as COM Multithread Apartment (MTA), and yet another set
yet indicated that they were both. In order to avoid any problem that might arise, we chose a technique
that bypassed standard COM instantiation. We chose to implement a custom instantiation of the filter.
Since we were not restricted by the existing set of rules surrounding COM, we were able to resolve the

5
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

PDF filter problem. If we had followed normal practice of using STA or MTA instantiation of the
filter, the PDF filter’s DLL would not have unloaded, thus causing the machine to crash.

Once we overcame these challenges, we could finally use the iFilter to read any file format as long as
it was registered in the client’ system. The interface into filters allows us to extract raw text in chunks
and to prepare it for indexing.

Word-Breaking

How one language terminates a word as opposed to another language often differs. WordBreaker is a
function of a Windows system library which is used to read and index files in order to facilitate fast
search on user's desktop. So the WordBreaker interface givesIt has a modular approach that can be
customized by third-party developers to give a new definition of a word so that the system would
know when it has come across a character that word terminates terminala word. For example, English,
like many other languages, uses a white space to terminate words. But many punctuation marks like
commas, periods, and question marks can also indicate the end of an English word.

The WordBreaker has interesting stemming features also that are especially useful for Full Text
Searching.

Remember, when a user searches their Windows desktop for a file containing a particular word, they
do not necessarily want words that match the search term exactly. They would be interested in also
finding documents that contained the same stem.

Since an interface for Arabic WordBreaker and an Arabic Stemmer was provided, we hoped that we
could use that to take a shortcut and allow search by stem functionality in the concordance part of the
program. Unfortunately for Arabic, much of this technology is still at a beta stage and has been this
way for some time. After exploring it, we found that it was inadequate for the Lexis Converter
application.

Instead, we used the generic WordBreaker, which is normally used by default when the language of
the document is unrecognized or unhandled. The default WordBreaker provides an efficient way to
break input text at white spaces, punctuation, numbers, etc. This was good enough for our purposes
because we planned to use our own custom stemming routines later.

Along with iFilter, we developed a bridge to connect iFilter and WordBreaker so that the input would
be any file format that contained text, and the output would be a stream of single word chunks. For
this, we developed yet another C# interface which streamed word chunks directly into a cache ready
for output.

Finally, the Lexis Converter application would write these streams of words into special lex files
specifically formatted into UTF-8 encoding.

Indexing

Normally when When indexing, one would normally store a record for each search term within large
texts and provide references for each word record back to the original file. Specifically, tThere is one
position reference stored for each word to facilitate searching. A search application would take a
search term and seek it in the index, pull out its record, use the file reference to find the original file
that the term was in, or use the position reference to locate the exact place where that term appeared in
the original file.

6
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

Search engines only need only a file reference in the index since they only return the set of documents
in which the search word was found.

In concordance applications, we need to extract the actual context that demonstrates where and how
the a word was used in the an original text. For such an application, both file reference and position
reference are needed. Each record allows the application to pinpoint the exact place where the word
appeared and in which file. This would beis necessary because the user would wants to see the actual
context their search term appears in..

We Cognizant of all that, we removed file dependency entirely by storing the text and indexing on the
remote mySQL server in a compact and efficient way.

After Once Lexis Converter’s word-breaking in thehad pre-processing processed a target text,phase,
we would have a sequential stream of individual words. Suppose Then we store a version of the
original word together with any punctuation that ended might have been treated as up being part of its
word-chunk and call this version of the word “ExactWord”. This becomes the first column of the
Index table, after the identifier.

Because we wanted to enable a less strict search that required no diacritics and that was tolerant of
common misspelling, we built at the pre-processing phase a version of the word that we labeled
“FilteredWord”. This came to be the second column of the Index table. We will use the FilteredWord
to create queries for basic searching later.

Inconsistency of spelling among Arabic users poses a formidable problem for text processing and
specifically for text indexing (Habash, et al, 2007). A word spelt in two or three different ways would
be considered different entries in an index. Examples of inconsistencies (Buckwalter, 2007):
o Confusion between word-final 'ha' and 'ta-marboota'.
o Confusion between word-final ‘ya’ and 'alef-maqsoora'
o Confusion between Hamza and alef forms
o Confusion between hamza on ‘waw’ and ‘waw’ without hamza
o Confusion between hamza on ‘ya’ and ‘ya’ without hamza

We created the FilteredWord as a solution to this inconsistency in Arabic writing practices. The
FilteredWordIt is a version of the ExactWord but translated into a standardized form of spelling and
without any diacritics. If one came across an ExactWord of the form 'taqwaY,' Lexis would insert it
into the FilteredWord column without diacritics or comma as 'tqwY'.

Later when a user searcheslexicographer would search using and wants less strict matching, if they
typed any of 'taqway', 'taqwaY', or 'tqwaY', their search word would be first standardized to 'tqwY' and
it would now match the FilteredWord column for the originally indexed ExactWord ‘taqwaY’.

In a similar fashion, the word 'firqap' will would be stored without the diacritics ‘i’ and ‘a’ removed
and with the ta-marboota (p) replaced with a word-final-ha (h) as ‘frqh’. FilteredWord would then be
‘frqh’, ExactWord ‘firqap’.

A third column in the mySQL table stores a serial reference to each word-chunk in a column called
Pos. In the next section, we shall explain in detail how this facilitates concordance and helps avoids
storing the original text.

Each indexed term is also associated with a FileIdentifier column. This identifier refers to a File Index
Table that has details on the file that the search term was extracted from. This includes the Book’s
Title, the Book’s Author, and the Book’s Year of Publication.

7
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

Speed

Since this is a Web-Based application, speed is critical. Most servers place size restrictions on
uploads. In a corpus-based dictionary, gigabytes of text are indexed and searched. How would that be
done through a web-based application and within the confines of file size restrictions?

As we mentioned earlier, we built the a stand alone Lexis Converter. This application allows
userlexicographers to pre-process their files for indexing. The tool takes a wide range of Arabic
formatted texts and converts them to a standard UTF-8 encoding.

File output is restricted to only 2 MB segments which get grouped into one specific folder. This means
that even a large book of 100 MB will be split into 50 files of 2 MB size and they will be placed
together into a single directory that corresponds to the original file.

The reason for this is to prevent userlexicographers from uploading files to their personal corpus that
are larger than 2 MB in size. We did not want to burden the Web-Server which would be potentially
shared amongst thousands of contributors. Nor did we want to test the patience of the user with
extended waiting times when uploading massive files of 100 MB or so.

Users can upload a maximum of 10 files at a time, each of a size of 2 MB. This is less straining on our
server. The input files are not uploaded to be stored—only to be read. Once they have been consumed
and inserted into the Index database, they are deleted from the server.

An advantage of pre-processing the data is that it outputs files with entries already formatted into
mySQL syntax so that the Server does would not need to do any extra work. So the The server-side
indexing script simply reads- in each file and immediately inserts the index into the database without
further processing. Speed of indexing improves tremendously as a result and gives all users a positive
experience.

Searching

Contributors to Lexis might be based in many different locations across the globe. Lexis assigns each
of them an ID and password; this would enable them to and they can make concordance searches of
the main database. They can use the results then to build their own dictionary entries.

Since the index will certainly become enormous, search efficiency is just as crucial as indexing
efficiency is!

When a userlexicographer searches the corpus, they have the choice to use “standard” matching as, we
explained earlier, exact matching, stem matching, or root matching.

Depending on which version of the search that the userlexicographer runs, we ensure that Lexis
converts the input word into the appropriate form for the type of search they choose. This means that
we also have to build the mySQL query to such that it would search the column that corresponds to the
user’s search preference.

For example, if the userlexicographer wants to search by root, Lexis first finds the root of the user’s
input word then queries the index’s “WordRoot” column to find all contexts of that root being used in
the corpus.

But even filtering by word can return results too numerous to be useable. A userlexicographer may
not want to sift through 10,000 examples of their search word in context. Instead, the user may like to

8
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

filter his or her concordance results by identifying for investigation one or more files from their corpus
that they want to see a concordance for.

For this, we allow the userlexicographer to click all the files he or she might want to search in. Lexis
will only search for files that match the chosen FileIdentifier(s). And when When there are still too
many results, we use the a paging feature to break down the returned concordance to a manageable
size. A userlexicographer may choose to browse only 20 results per screen or 100 if they prefer.

The challenge, however, is not so much in returning the rows that contain the search term. A simple
mySQL query takes care of that. But how do we return the word in its original context? Remember, we
no longer have the original texts. The index consists of single word entries and serial numbers to
indicate their respective sequence in the original file. Using a few carefully built mySQL queries, we
can reconstruct any word in its original context simply by stringing together the words from entries
with sequential serial numbers.

So we take the word whose context is in question, then get its serial number from the Pos column. This
tells its position in the original text relative to other words in the index and in the original text. Let us
call this position p0.

UserLexicographers can indicate to us how many words before and after the word in question they
would like to retrieve from the original context. Suppose a user chooses 10 words. We retrieve the
context of the word by getting all entries in the same file that have the following positions:

p-10, p-9, p-8, p-7, p-6, p-5, p-4, p-3, p-2, p-1, p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10

We can use the same approach to retrieve the entire text of the original file if we so wished. We repeat
this for every entry and matching the search criteria in alphabetical order in the tradition of
concordances. This unique strategy is the core of the concordance and is lightning-fast in retrieving
word contexts from the corpus index. The index is also independent of files and file paths. It is
scalable for tremendous amounts of data and numerous concurrent users and contributors.

Dictionary Building

We recognize the need for the Dictionary contributors to do a fair amount of research before producing
a dictionary entry. So we designed the Scrapbook. The scrapbook is a kind of work area where a
contributor can gather their favorite illustrative examples of the entry in from different contexts. The
lexicographer might start by reading through a concordance list of sentences that would illustrate how
a word was used in a specific period of time (often one hundred year intervals). Then as they read
through this concordance list, they would tick interesting sentences for further scrutiny. Lexis would
automatically and discretely copy these selected sentences to the Scrapbook. scrapbook examples are
derived from concordance lines that the user thought he or she might want to include in their
definition.

Every time the contributing user lexicographer sees a word context in the concordance that he or she
wants to use in constructing definitions, they user may click that row and Lexis would append it to the
scrapbook. Scrapbook information is stored in another mySQL table and it retains a copy of the chosen
context independent of the original index, in case that file is was later deleted. Taking a copy of the
sentence and all other information into the scrapbook is extremely crucial. In later versions, we
anticipate that userlexicographers might want to delete unshared texts that they had previously added.

A scrapbook is essentially the property of the contributor even if it has been derived from someone
else’s shared text. The user lexicographer may, for example, completely alter their extracted
sentences., for example.

9
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

But each scrapbook item also retains a copy of the Title, the Author, and Year of Publication for the
book that the illustrative example of the word’s context was originally derived from.

The contributor can also add research notes in the scrapbook. Therefore, we decided that it would be
best to remove any dependency between scrapbooks and original corpus texts, and deemed it
necessary to allow them to exist as separate entities.

Scrapbooks are brilliant collaboration tools also. One userlexicographer may share portions of his or
her Scrapbook with another researcher and discuss their research together.

Once the userlexicographer has gathered enough information in his or her scrapbook to begin writing a
dictionary entry or definition, the userlexicographer can simply insert into the entry any of the
scrapbook contexts he or she had originally collected and edited.

Guest visitors of the Lexis Dictionary project can search for words and find high quality, well-
researched, peer-defined dictionary entries.

Conclusion
In this paper we, presented a computer tool that is designed and implemented to help dictionary
makerlexicographers. We discussed all some issues and challenges relatinged to this tool and how they
were handled.

We plan to implement far more features in Lexis Dictionary Builder than is currently available. We
plan to have store the semantic components of each root in the language and, the semantic components
of every morphological pattern, . Then and sentence frames for the definition of all grammatical
categories in all semantic domains in the languagewe will merge the two sets of semantic components
to match the morphological-pattern-molded root that the target word exhibits. Lexis would then
automatically generate sentence frames that the lexicographer can use to start the defining process.
These sentence frames would ensure that all the essential semantic content of the target word are
brought to to the fore.

References

1. Academy, Arabic Language. (1972). Al-Mu'jam al-Waseet (Mi Middle Dictionary) (2 ed.).
Cairo: Arabic Language Academy.

2. Academy, Arabic Language. (2003). Engazat "Accomplishments" [Electronic Version].


Retrieved 20/2/2006 from http://www.arabicacademy.org.eg/engazat.asp.

3. Al-Asfahani, Al-Raghib. (1997). M’ajm Mufrdat ’Alfaz Alquraan Alkrym (Glossary of the
Glorious Qur'an). Beirut: Dar Al-Kotob Al-'Ilmiyah.

4. Al-Hareeri, Abu Al-Qasim Ben Ali ben Mohammed Ben Othman Al-Basri. (1996). Durrat Al-
ghawas fi Awham Al-Khawas. Beirut: Dar Al-Jeel.

5. Al-Hutay'a, Jirwal ibn Aws ibn Malik al-'Absi (d.665). (1987). Diwan al-Hutay'a. Cairo:
Maktabat al-Khanji.

6. Al-Jabarti, Abd al-Rahman. (1997). 'Ajayb Al-Athar fi Al-Tarajim wa Al-Akhbar (Marvels of


Actions in Biographies and Bulletins). Beirut: Dar Al-Kutub Al-'Ilmiyah.

10
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

7. Al-Jawaleeqi, Abu Mansour Mawhoub Ben Ahmed (d.1145). (1995). Sharh Adab Al-Katib.
Kuwait: Kuwait University.

8. Al-Jawaleeqi, Abu Mansour Mohammed (d.1145). (1998). Al-Mu'rab min Al-Kalaam Al-
A'jami 'ala Hurouf Al-Mu'jam (A Dictionary-Index of Arabized Foreign Speech). Beirut: Dar
Al-Kotob Al-'Ilmiyah.

9. al-Samarraí, Ibrahim. (1997). Turkish and Persian Loanwords in Arabic and Vice Versa.
Beirut: Librairie du Liban

10. Al-Sharqawi, Hassan. (1987). Mu'jam 'Alfaz Al-Sufiyah. Cairo: Muassasat Al-Mukhtar.

11. Al-Suyouti, Jalal Al-Deen (d. 1505). (1998). Al-Muhathab fima Waqa'a fi Al-Qur'an min Al-
Mu'arrab (Treatise in What Occurred in the Qur'an of Arabized Words). Beirut: Dar Al-Kotob
Al-'Ilmiyah.

12. alAl-Zoabi, Mohammed. (1999). Mo'jam al-Abniya al-Hadhariya fi al-Shi'r al-Jahili. al-Lisan
al-'Arabi, 48(2), 105-138.

13. Al-Zubaidi, Abu Bakr Mohammed Ben Al-Hassan (d. 989). (1995). Al-Ziyadat 'Ala Kitab
Islah Lahn Al-'Ammah bi Al-Andalus Dubai: Markaz Jum'a Al-Majid li Al-Thaqafati wa Al-
Turat.

14. Asbaghi, Asya. (1988). Persische Lehnworter im Arabischen. Wiesbaden: O. Harrassowitz.

15. Barsoum, Patriarch Afram Al-Awal. (N.Y.). Al-Al-Faz Al-Siryaniyah fi Al-Ma'ajim


Al-'Arabiya (Syriac Words in Arabic Dictionaries). Damascus: Matba'at Al-Taraqqi.

16. Dhayf, Shawqi. (1984). Majma' al-Lugha al-Arabiya fi Khamseen Aman "The Arabic
Lanugage Academy in Fifteen Years". Cairo: Academy of the Arabic Language.

17. Donzel, E. van, B. Lewis, and Ch. Pellat (ed.). (1997). Encyclopedia of Islam, New Edition,
3rd Impression. Leiden: E.J. Brill.

18. Dozy, Reinhart Pieter Anne (d. 1883). (1968). Supplément aux dictionnaires arabes Beyrouth:
Librairie du Liban.

19. Fränkel, Siegmund. (1962). Die aramäischen Fremdwörter im Arabischen. Hildesheim: G.


Olms.

20. Gätje, H. . (1985). Arabische Lexikographie. Historiographia Linguistica 12, 105-147.

21. Hamzaoui, Mohammed Rached. (1989). Introduction: A Historical Arabic Dictionary Issues
and Techniques. Revue de la Lexicologie, 5 & 6, 11-28.

22. aywood, John A. . (1965). Arabic Lexicography: Its History, and Its Place In the General
History of Lexicography. Leiden, the Netherlands: E.J. Brill.

23. Hebbo, Ahmed Irhayem. (1970). Die Fremdworter in der arabischen Prophetenbiographie
des Ibn Hischam (gest. 218/834) University of Heidelberg, 1970).

24. Ibn Al-Sakeet, Abu Yusuf Yaqoub Ben Ishaq. (1987). Islah Al-Mantiq. Cairo: Dar Al-Ma'arif.

11
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

25. Ibn Al-Washa, Abuttayeb Mohammed Ibn Ishaq (d.936). (1990). Al-Muwasha: Al-Zurf wa Al-
Zurafaa' (Adorned: Wit and Witty). Beirut: Dar Sadir.

26. Ibn Batutah, Muhammad Ibn Abdallah (d.) and Sir Hamilton Gibb. (1994). The travels of Ibn
Battuta, A.D. 1325-1354 London Hakluyt Society.

27. Ibn Hillizah, Hārith (d.645). (1969). Diwan al-Harith ibn Hillizah. Baghdad: Matba'at al-
Irshad.

28. Ibn Muqbil, Tamīm ibn Ubayy (d.657). (1962). Diwan Ibn Muqbil. Damascus: Ministry of
Culture & National Guidance.

29. Ibn Qutayba, Abu Mohammed Abdullah (d. 889). (2001). Adab Al-Katib. Beirut: Dar Al-Jeel.

30. Ibn Sayed Al-Batliyousi, Abu Mohammed. (1996). Al-Iqtidhab fi Sharh Adab Al-Kitab. Cairo:
Dar Al-Kutub Al-Misriya.

31. Jabarti, 'Abd al-Rahman; Louis Antoine Fauvelet de Bourrienne; Edward W Said. (1993).
Napoleon in Egypt : Al-Jabartî's chronicle of the first seven months of the French occupation,
1798. Princeton M. Wiener.

32. Jeffery, Arthur. (1938). The Foreign Vocabulary of the Qur'an. Baroda: Oriental Institute.

33. K oningsveld, P.S. Van. (1976). The Latin-Arabic Glossary of the Leiden University Library.
Leiden University Library, 1976).

34. Matloub, Ahmed. (2001). Mu'jam Mustalahat Al-Naqd Al-Arabi Al-Qadeem (Dictionary of
Classical Critical Arabic Terminology), 2nd ed. Beirut: Libraire du Liba.

35. Omar, Ahmed Mokhtar. (1998). Sinaát Al-Mu'jam Al-Hadith. Cairo: Alam Alkotob.

36. Sawaie, Mohammed (1990). An Aspect of 19th Century Arabic Lexicography: The
Modernizing Role and Contribution of Faris al-Shidyaq. In H.-J. N. a. K. Koerner (Ed.),
History and Historiography of Linguistics: Papers from the Fourth International Conference
on the History of the Language Sciences (pp. 157-171). Amsterdam: Benjamins.

37. Shir, Al-Sayyid Addi (d.). (1990). Mu'jam Al-Al-Faz Al-Farisiyah Al-Mu'arrabah (A
Dictionary of Persian Loand Words in the Arabic Language). Lebanon: Lirairie du Liban.

38. Simonet, Francisco Javier. (1888). Glosario de voces ibéricas y latinas usadas entre los
mozarábes. Madrid: Est. tip. de Fortanet.

39. Habash, Nizar, Abdelhadi Soudi, Timothy Buckwalter. (2007). On Arabic Transliteration. In
A. Soudi, A. van den Bosch and G. Neumann (Eds.), Arabic Computational Morphology, (pp.
15-22). The Netherlands: Springer.

40. Buckwalter, Timothy. (2007). Issues in Arabic Morphological Analysis. In A. Soudi, A. van
den Bosch and G. Neumann (Eds.), Arabic Computational Morphology, (pp. 23-41). The
Netherlands: Springer.

41. Readence, John E., Bean, Thomas W., and Baldwin, R. Scot. (2004). Content-Area Literacy:
An Integrated Approach, 8th ed. Bubuque, Iowa: Kendall/Hunt.

12
International Review on Computers and Software (I.RE.CO.S.), Vol. 5, n. 4

42. Wittgenstein, Ludwig. (2009). Philosophical Investigations, 4th Ed. (P.M.S. Hacker and
Joachim Schulte, Eds). PChichester, Sussex: Wiley-Blackwell.

43.

13

S-ar putea să vă placă și