Unicode - ARABIC SCRIPT TUTORIAL

ARABIC SCRIPT TUTORIAL
Thomas Milo - DecoType
1 INTRODUCTION
Like all multi-lingual computing, Arabic computing is now firmly in the domain of
Unicode. Unicode is an industrial protocol with the status of international agreement. It
is designed to encode the elements of all known script systems in such a way that they
become interchangeable between programs and operating systems. Its implementation is
well underway.
Unicode eliminates the need to tamper with fonts to get special characters, but it is not a
font. For legible text on screen and paper, Unicode depends on compatible fonts with the
required characters, where necessary with additional dedicated font technology.
Thomas Milo –Arabic script Tutorial
2 THE ARABIC ALPHABET
a. the primary character inventory

Arabic alphabet is related to the Latin alphabet, as can be seen from its historical sorting
order A/ALEF, B/BEH, C/JEEM, D/DAL:
Its modern sorting order is on the basis of similarity of the letters:
The modern morphological order can be broken down as follows:

Historical initial letter ALEF
Similar letters b t ṯ
Similar letters ǧ ḥ ḫ
Similar letters d ḏ r z
Similar letters s š ṣ ḍ ṭ ẓ
Similar letters ʿ ġ f q
Historical group k l m n
Rest h w y
b. derived primary characters

There is a number of letters, mostly skeleton-cum-mark combinations, that do not have
independent status in orthography or sorting order:
The hamza diacritic and its various

supporting letters
Morphophonologic use of YEH
Morphophonologic use of HEH
29th Internationalization and Unicode Conference March 2006, San Francisco, CA

2
c. the secondary character inventory

Arabic spelling is not fully alphabetic: only short consonants and long vowels are written
with the primary character set. For elaborate spelling or casual disambiguation, a set of
secondary characters exists. They are written above or below a primary character, e.g.:
U+064E FATHA to mark the vowel /a/ ◌َ

U+064F DHAMMA to mark the vowel /u/ ◌ُ
U+0650 KASRA to mark the vowel /i/ ◌ِ
e.g.: kitābi
Traditionally, a repetition of the vowel marks is used at the end of a word to indicate that
the indefinite article /-n/ is attached to the vowel:
FATHA – FATHA to indicate /a-n/ ◌َ

DHAMMA - DHAMMA to indicate /u-n/ ◌ُُ
KASRA - KASRA to indicate /i-n/ ◌ِ
e.g.: kitābi-n
Unicode deals with repeated vowel markers as if they are separate characters. This is a
legacy from the metal typesetting era, when it was impossible to compose such minute
superscript or subscript groups:
FATHATAN to mark the vowel /a/ +n ◌ً

DHAMMATAN to mark the vowel /u/ +n ◌ٌ
KASRATAN to mark the vowel /i/+n ◌ٍ
NOTA BENE: the ending –TAN, added to the original name, means “twice”.

3
d. direction of writing
M Æ H Æ M Æ D
Arabic script runs from RIGHT to LEFT:
D Å M Å H Å M
e. Letter group formation

Efficient, streamlined connections assimilate letters into continuous groups to form
words. Assimilation frequently takes the form of mergers. The merger of some letter
groups can be so strong that letters lose their individual characteristics and instead
contribute a distinctive feature to a kind of ideograph. In other words, the writing system
becomes almost synthetic in nature, although it evolved from an analytic alphabetical
structure:
MHMD
(pronounced: muḥammad)
For technical and pedagogical reasons, there is a strong tendency to eliminate or simplify
the connectivity of Arabic script; still even the simplest fonts maintain a minimal degree
of connection between letters. This approach removes from Arabic script its synthetic,
ideographic quality and turns it back into the analytic alphabet from which it evolved:
MHMD

4
3 CONVENTIONAL ANALYSIS OF ARABIC SCRIPT
Most Arabic letters consist of a skeleton, e.g. a curve, and a marker:
Markers have a distinctly graphemic function. They combine with various skeletons to
form other letters, e.g. the dot-above is used by eight Arabic letters:
In the conventional analysis, some skeletons have no independent meaning, e.g.:
Other unmarked skeletons by themselves are already meaningful letters that differ from
the ones characterized by a marker, e.g.:
pro’s and con’s of the conventional analysis

Pro: Considering the combination skeleton and marker a single letter has advantage that:
- IT MEETS THE EXPECTATION OF USERS;
- IT CONFORMS TO CONVENTIONAL AND LEGACY ENCODING.
Con: For scholarly work, the merger of skeleton and marker denies the evolutionary
stages of the script, where the use of markers was casual, in a way similar to the use of
vowels. Therefore, modern industrial encoding as inherited by Unicode has the
disadvantage that:
- IT MISREPRESENTS HISTORICAL USAGE
- IT DISRUPTS INTERNET SEARCHES BY MISMATCHING IDENTICAL GRAPHEMES
In manuscripts and even in older prints, markers are often incomplete or unreliable
because markers were secondary, often redundant elements;
or because markers were added later to interpret or eliminate ambiguities;
because double markers sometimes co-exist to maintain original ambivalence.

5
4 ARCHIGRAPHEMES
A complete and unambiguous element of script is called a grapheme. Without markers,

most skeletons become multi-interpretable, e.g. all these words share the same skeleton
elements:
transcription and meaning Shape
ʿabdu “servant”
ʿīd “feast”
ʿinda “by, near” (preposition)
ġayad “female tenderness”
In historical texts any one of them can look like this:
Transliteration Shape
EBD
(capitals are used to represent
indeterminate graphemes)
In this kind of spelling the skeletons are not “defective” graphemes, but valid
archigraphemes. An archigrapheme is the common element(s) between two or more
graphemes, minus the marker(s) that disambiguate them. The majority of historic texts
are written with archigraphemes.
Unicode does not – yet – have the data structure to deal with archigraphemes and
discrete markers as meaningful text elements.

6
5 GRAPHEMES
A grapheme is the smallest unambiguous unit in a writing system. Ideally graphemes

correspond to the plain text units of Unicode. In Arabic most of the accepted graphemes
correspond with a phoneme (the smallest unambiguous sound unit in speech):
y w h n m l k q f ġ ʿ ẓ ṭ ḍ ṣ š s z r ḏ d ḫ ḥ ǧ ṯ t b a
However, in a few cases this correspondence is not stable:

a. there can be more than one way to encode a single grapheme, e.g.:
the Arabic grapheme YEH WITH HAMZA ABOVE can have multiple encodings, which causes
inconsistent usage:
U+0626 YEH WITH HAMZA ABOVE
U+0649 ALEF MAKSURA

U+0654 HAMZA ABOVE
U+06CC FARSI YEH
U+0654 HAMZA ABOVE
b. More than one grapheme for a code, e.g.

U+06CC FARSI YEH
shares non-final dots with
U+064A YEH
shares final forms without dots with
U+0649 ALEF MAKSURA
This inconsistency is not a feature of the Arabic writing system, but a consequence of the
legacy approach adopted by Unicode. Accepting all graphemic markers as independent
secondary characters with their own code points would make these cases unambiguous.
The template for this solution already exists: in the latest version of the Unicode Standard,
the combination of composition elements ALEF and HAMZA ABOVE has been declared
canonically equivalent to the legacy pre-composed grapheme ALEF WITH HAMZA ABOVE:
U+0627 ARABIC LETTER ALEF U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
U+0654 ARABIC HAMZA ABOVE

7
ALLOGRAPHS AND LIGATURES
a. simplified support for graphic assimilation

In Arabic the abstract, nominal graphemes are represented by context-dependent
allographs. Simplified support for Arabic handles contextual allographs according to two
patterns, discontinuous and continuous assimilation:
Pattern final unconnected final connected Medial initial

DISCONTINUOUS:
2 allographs ‫د‬ ‫… ـد‬ ‫… ـد‬ ‫د‬
‫ب‬ ‫بـ … … ـبـ… … ـب‬
CONTINUOUS:
4 allographs
b. full support for graphic assimilation

Graphic assimilation of Arabic letters is a sophisticated art – and the foundation of
Islamic calligraphy – which produces well-designed and pleasantly legible script images.
Without a thorough understanding it cannot be supported. E.g. in initial position, BEH
coverage can get quite elaborate in naskh:
In metal-based typography and nostalgic computer fonts, only an inconsistent number of

random ligatures remain of the original system:

8
6 WRITING ARABIC
Here are two additional aspects of Arabic script that have consequences for rendering
systems:
a. horizontal and vertical connections
The traditional connection is still reflected in a number of ligatures.
traditional assimilation modified assimilation
‫ﺣﺤﺢ‬
b. unstable spelling caused by changing font technology
Spelling and font technology have mutually influenced each other since the fast
emergence of computer technology for Arabic script. The fast development of font
technology has the unintentional result that different fonts may require different spellings
for the same printed image. For instance, most fonts cannot deal with al-lāhu, “God”:
ALEF-FATHA ALEF-FATHA
LAM-LAM-SHADDA-FATHA-HEH-DAMMA LAM-LAM-HEH-DAMMA
correct data structure, wrong image wrong data structure, wrong vowel image
‫اَﻟﱠﻠ ُﻪ‬ ‫ﷲ‬

ُ ‫َا‬
For comparison, the correct image representing the above data structures:
complete vowels incomplete vowels
A related phenomenon occurs when older font technology cannot handle the
combination of ligatures and vowels, forcing the users into systematically misspelling
words, e.g., the word al-islāmu “Islam”:
correct data structure, wrong image wrong data structure, approximate image
‫ﺳﻠَﺎ ُم‬
ْ ‫َا ْﻟِﺈ‬ ‫ﻼ ُم‬
َ‫ﺳ‬
ْ‫ﻹ‬ِ ‫َا‬
For comparison, the correct image representing the above data structures:
complete vowels incomplete and misplaced vowels

9
7 RENDERING ARABIC SCRIPT
a. font technology
A font is an industrial product designed to enable handling Arabic with technology that is
not designed for Arabic. In the design process, Arabic is an object that can be adapted at
will: corners can be cut and rules can be broken. The resulting script can be seen as an
“innovation”.
ِ ْ َ ‫بتـــثب‬
‫یتینـــــــ‬ ِ ْ َ ِ ‫یتین‬ ِ ْ َِ
ِ ْ َ ‫بتثب‬
b. script analysis and synthesis
The term script synthesis describes the effort to analyze and synthesize traditional
calligraphic styles or high quality typesetting systems. In this approach Arabic is the
subject whose integrity needs to be preserved when it is reproduced in digital form. Here
the underlying technology is the innovation.

10
8 ENCODING ARABIC SCRIPT FOR THE ARABIC LANGUAGE
a. what to encode
Unicode uses a model resulting from earlier conferences about Middle Eastern
computing: contextual shapes of one and the same letter are all attributed to a single
nominal text code. This is the graphemic model:
GRAPHEME ALLOGRAPH ALLOGRAPH ALLOGRAPH ALLOGRAPH

Character code final unconnected final connected Medial initial
U+062F ‫د‬ ‫… ـد‬ ‫… ـد‬ ‫د‬

U+0628 ‫ب‬ ‫بـ … … ـبـ… … ـب‬
There is single logical representation regardless the visual complexity of the assimilations, mergers or ligatures
b. code page legacy

The original encoded Arabic character sets had external and internal limitations -
external in the sense that only a small number of characters could be accommodated and
internal in the sense that only simplified modern orthography for office use was
supported.
Today there is no limitation to the number of characters that can be handled

simultaneously by a computer system, while the original purely synchronic, limited scope
has changed into a diachronic and comprehensive ambition. Unicode is being extended
with additional characters to handle literary orthography, archaic orthography, as well as
contemporary Qur’anic orthography.
Historical Qur’anic orthography is fully archigraphemic and therefore not supported by

Unicode graphemic model. This serious defect is curiously matched in Arabic studies by
the absence of an authoritative critical text edition documenting the transmission
through the ages of this key historic text.

11
9 ENCODING ARABIC SCRIPT FOR OTHER LANGUAGES

a. extra characters
The Arabic character set has been expanded over time to cover speech sounds not used in
the Arabic language. Practically always the existing archigrapheme-cum-marker template
is used, e.g..
‫ٮ‬ ‫ٹ‬ ‫ٺ‬ ‫ٻ‬ ‫ټ‬ ‫ٽ‬ ‫پ‬ ‫ٿ‬ ‫ڀ‬
‫ح‬ ‫ځ‬ ‫ڂ‬ ‫ڄ‬ ‫ڃ‬ ‫څ‬ ‫چ‬ ‫ڿ‬ ‫ڇ‬
b. regional calligraphic and typographic preferences
Various user communities of the Arabic script have specific calligraphic traditions that
result in preferences for certain fonts or script styles. For instance, the preferred way to
write Urdu is a special form of nastaliq script1:
‫ؤ‬ 5 ‫ؤ‬ : ‫ﻩ‬h ‫ اس‬:‫ا‬ 5 ‫دراز ی‬ ‫ے‬a ‫ۓ‬ ‫م‬h

The same text in simplified naskh would not be acceptable:
‫اگر اس طره پر پیچ وخم کا پیچ وخم نکلے‬ ‫بھرم کھل جاۓ ظالم تیرے قامت کی دراز ی کا‬
c. calligraphic preferences sometimes cause incompatible encoding

There are instances where one and the same Arabic letter received a different encoding
because a regional calligraphic style shaped it differently than the ubiquitous naskh.
A case in point is the Arabic letter KAF, which in nastaliq has an extra swash in the final
forms. Unicode now has an extra code U+06A9 KEHEH, causing identical letters to be
encoded with language dependent codes. As a result, two out of the three letters of the
place name MECCA are not interchangeable between various Arabic-scripted languages:
‫مكة‬ U+0645 MEEM U+0643 KAF U+0629 TEH MARBUTA
‫مکه‬ U+0645 MEEM U+06A9 KEHEH U+0647 HEH
‫مکه‬ U+0645 MEEM U+06A9 KEHEH U+06D5 AE
‫مکہ‬ U+0645 MEEM U+06A9 KEHEH U+06C1 HEH GOAL
‫مکۃ‬ U+0645 MEEM U+06A9 KEHEH U+06C3 TEH MARBUTA GOAL
(the GOAL variants of HEH and TEH MARBUTAH are also calligraphy-based mismatches)
1
bharam khul ǧāʾē ẓālim tērē qāmat kī darāzi kā - agar us tura ē pur pēč ū ḫam kā pēč u ḫam niklē
“O tyrant, the mistake about the tallness of your figure will be rectified - if the curls and twists of your hair
full of curls and twists are straightened out” (Ġālib, quoted in Finn Thiessen, A manual of Classical Persian
Prosody with chapters on Urdu, Karakhanidic and Ottoman prosody, Wiesbaden 1982, p.188)

12
10 BASIC LAY-OUT
There exist three distinct line-breaking patterns in Arabic-scripted languages:

a. Graphic: equidistant and equivalent spaces follow final forms and discontinuous letters2:
b. Graphemic: Only word-separating spaces and final forms are valid line breaking points:
c. Orthographic: in addition to word-separating spaces and final forms, hyphenation is

used for line-breaking, just like in Latin-based orthographies:
a: Historic Arabic b: Arabic, Persian, Urdu, etc. c: Modern, non-Arabic

early archigraphemic Arabic semi-alphabetic modern Arabic fully alphabetic Uyghur Turkic
NOTA BENE: so far only pattern b is documented and supported by Unicode.
2
The sample (repeated in the text columns) illustrates the spelling evolution in Arabic, as well as the
complete phonologic, lexical and orthographic integration of Arabic words in Uyghur (spoken in China):
Arabic: muḥammad ʿabdu l-lāh nadīm ʿarab miṣrī;
Turkic: muhämmäd abdullah nadim äräb mısırlıq
(Mohammed, Abdallah, Nadeem [personal names], and “Arab”, “Egyptian" – from Arabic miṣr, “Egypt”)

13
11 LANGUAGES
Languages written with the Arabic script [millions of speakers]3
Arabic [221m]
Qurʾānic Arabic
Classical Arabic
Modern Standard Arabic
Colloquial Arabic dialects
Algerian [22m]
Baharna (Bahrain, Oman)
Chadian
Dhofari (Oman)
Egyptian [46m]
Hadrami (East Yemen, Oman)
Hassaniyya [2.6m] (Mauretania)
Hijazi (KSA)
Judeo-Iraqi (Israel)
Judeo-Moroccan
Judeo-Tripolitanian (Lebanon)
Judeo-Tunisian
Judeo-Yemeni (Yemen, Israel)
Libyan
Mesopotamian [14m] (Iraq, Iran, Syria)
Moroccan / Maghrebi [19.5m]
Najdi [10m] (Saudi Arabia, Iraq, Jordan, Syria)
North Levantine [15m] (Lebanon, Syria)
North Mesopotamian
Omani
Saidi [19m] (Egypt)
Sanaani (North Yemeni)
Shihhi (UAE)
South Levantine
Sudanese [19m] Geo: Sudan
Ta'izzi-Adeni (South Yemeni)
Tunisian
Indo-Aryan
Kurdish / Kurmanji / Northern Kurdish [26m]
Several of the Kurdish-specific letters in Unicode have no
corresponding positional forms in the PRESENTATION blocks
3
This is a rough compilation that does not distinguish between current and historical use of the Arabic
script; numbers of speakers have not been verified.
Sources: http://en.wikipedia.org; http://www.omniglot.com; http://www.travelphrases.info/fonts.html

14
Persian
Persian / Western Farsi (Persian of Iran) [70m]
Dari / Eastern Farsi (Persian of Afghanistan) [7m]
Tajiki (Persian of Tajikistan and Afghanistan [4.4m]
Pashto / Afghan [27m]
alias: Pathan, Pushto, Pashtoe, Pashtu, and Pukhto
Western Balochi / Baluchi (Balochistan: Pakistan, Iran, and Afghanistan;
Turkmenistan, the Arab countries of the Gulf, and Kenya)
Urdu [104m]
Kalami (Pakistan)
Punjabi, Lahnda (Pakistan)
Sindhi [9m] (Pakistan, Sind province, India)
Parkari (Pakistan)
Kashmiri / kashur [4.5m] (India, Pakistan, China, UK)
Saraiki / Multani / Derawali / Western Punjabi (Pakistan)
Pathwari (Pakistan)
Rajasthani (India)
Turkic
Uyghur [7.6m] (China)
Turkmen [6.4m] (Turkmenistan, Afghanistan, Germany, Iran, Iraq, Kazakhstan,
Kyrgyzstan, Pakistan, Russia, Tajikistan, Turkey, USA and Uzbekistan.
Kazak [8m] (Kazakstan, Russia and China)
Kyrghyz [1.5m] ( Kyrghyzstan, China)
Turkish /Osmanli
Chagatai
Tatar [7m] (Russian Republic of Tatarstan, and also in Afghanistan, Azerbaijan,
Belarus, China, Estonia, Finland, Georgia, Kazakhstan, Kyrgyzstan, Latvia,
Lithuania, Moldova, Tajikistan, Turkey (Europe), Turkmenistan, Ukraine, USA
and Uzbekistan)
African
Hausa / Ajami [39m]
Swahili / Kiswahili (Zanzibar, Tanzania - official, Kenya - official, Malawi,
Mozambique, E. Congo, Uganda, Rwanda, Burundi, Somalia, S Ethiopia.)
Mandinka [1.2m] (Senegal, Gambia (main language), Guinea-Bissau)
Wolof [6.7m] (Senegal - main language, Gambia, Mauritania)
Comorian (Comoros Islands)
Maba [0.25m] (Africa)
SE Asia
Malay / Jawi [18m] (Brunei - co-official script, Malaysia, Indonesia, Singapore,
Thailand) Malay written in Arabic is called Jawi.

15
Caucasian
Dargwa [2.5m] (Russian Republic of Dagestan)
European
Morisco (Spanish)
Bosnian (Serbian)
Ukrainian
13 COUNTRIES AND AREAS WHERE ARABIC SCRIPT IS USED
Afghanistan, Algeria, Bahrain, Chad, China, Cyprus, Djibouti, Egypt, Eritrea, Iran, India,
Iraq, Israel, Jordan, Kenya, Kuwait, Lebanon, Libya, Mali, Mauritania, Morocco, Niger,
Oman, Palestinian West Bank & Gaza, Qatar, Saudi Arabia, Somalia, Sudan, Syria,
Tajikistan, Tanzania, Tunisia, Turkey, UAE, Uzbekistan and Yemen.

16

Unicode - ARABIC SCRIPT TUTORIAL

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Unicode - ARABIC SCRIPT TUTORIAL

Încărcat de

Drepturi de autor:

Formate disponibile

ARABIC SCRIPT TUTORIAL

Thomas Milo - DecoType

2 THE ARABIC ALPHABET

a. the primary character inventory

Its modern sorting order is on the basis of similarity of the letters:

The modern morphological order can be broken down as follows:

b. derived primary characters

The hamza diacritic and its various

Morphophonologic use of HEH

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

c. the secondary character inventory

U+064E FATHA to mark the vowel /a/ ◌َ

FATHA – FATHA to indicate /a-n/ ◌َ

FATHATAN to mark the vowel /a/ +n ◌ً

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

Arabic script runs from RIGHT to LEFT:

e. Letter group formation

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

3 CONVENTIONAL ANALYSIS OF ARABIC SCRIPT

Most Arabic letters consist of a skeleton, e.g. a curve, and a marker:

In the conventional analysis, some skeletons have no independent meaning, e.g.:

pro’s and con’s of the conventional analysis

- IT CONFORMS TO CONVENTIONAL AND LEGACY ENCODING.

- IT DISRUPTS INTERNET SEARCHES BY MISMATCHING IDENTICAL GRAPHEMES

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

A complete and unambiguous element of script is called a grapheme. Without markers,

transcription and meaning Shape

ʿinda “by, near” (preposition)

ġayad “female tenderness”

In historical texts any one of them can look like this:

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

A grapheme is the smallest unambiguous unit in a writing system. Ideally graphemes

However, in a few cases this correspondence is not stable:

U+0649 ALEF MAKSURA

b. More than one grapheme for a code, e.g.

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

ALLOGRAPHS AND LIGATURES

a. simplified support for graphic assimilation

Pattern final unconnected final connected Medial initial

b. full support for graphic assimilation

In metal-based typography and nostalgic computer fonts, only an inconsistent number of

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

‫اَﻟﱠﻠ ُﻪ‬ ‫ﷲ‬

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

7 RENDERING ARABIC SCRIPT

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

8 ENCODING ARABIC SCRIPT FOR THE ARABIC LANGUAGE

GRAPHEME ALLOGRAPH ALLOGRAPH ALLOGRAPH ALLOGRAPH

U+062F ‫د‬ ‫… ـد‬ ‫… ـد‬ ‫د‬

b. code page legacy

Today there is no limitation to the number of characters that can be handled

Historical Qur’anic orthography is fully archigraphemic and therefore not supported by

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

9 ENCODING ARABIC SCRIPT FOR OTHER LANGUAGES

‫ؤ‬ 5 ‫ؤ‬ : ‫ﻩ‬h ‫ اس‬:‫ا‬ 5 ‫دراز ی‬ ‫ے‬a ‫ۓ‬ ‫م‬h

c. calligraphic preferences sometimes cause incompatible encoding

‫مكة‬ U+0645 MEEM U+0643 KAF U+0629 TEH MARBUTA

‫مکه‬ U+0645 MEEM U+06A9 KEHEH U+0647 HEH

‫مکه‬ U+0645 MEEM U+06A9 KEHEH U+06D5 AE

‫مکہ‬ U+0645 MEEM U+06A9 KEHEH U+06C1 HEH GOAL

‫مکۃ‬ U+0645 MEEM U+06A9 KEHEH U+06C3 TEH MARBUTA GOAL

29th Internationalization and Unicode Conference March 2006, San Francisco, CA

There exist three distinct line-breaking patterns in Arabic-scripted languages:

c. Orthographic: in addition to word-separating spaces and final forms, hyphenation is