Sunteți pe pagina 1din 30

CHAPTER 2

ELECTRONIC REPRESENTATION OF TEXT

Electronic processing of text in any language requires that characters (letters of


the alphabet along with special symbols) be represented through unique codes.
This is called encoding. Usually, this code will also correspond to the written
shape of the letter. A code is basically a number associated with each letter so
that computers can distinguish between different letters through their codes. The
set of all the characters in a language is called a repertoire. The encoding can be
further classified into glyph based encoding and character based encoding [38].

A character encoding defines how sequences of numeric codes are presented


as sequences of octets. The code serves the important purpose of standardizing
the approach to dealing with text on different computer systems [32]. All Indian
languages have a phonetic base built on top of a fixed number of vowels and
consonants, the writing systems permit many different shapes to be generated
depending on the syllables in the text. In this thesis, languages like English and
Tamil are taken for case study. Some of the character based encodings are
ASCII, ISCII, and Unicode. A glyph is the shape given to a symbol. Some of the
glyph based encodings are TAM, TAB, TSCII.

In typography, a glyph is a particular graphical representation of a grapheme, or


several graphemes in combination (a composed glyph), or only a part of a
grapheme. In computing as well as typography, the term character refers to a
grapheme. A character or grapheme is a unit of text, whereas a glyph is a
graphical unit.

20
For example, the sequence ffi contains three characters, but can be represented
by one glyph, the three characters being combined into a single unit known as a
ligature. Conversely, some typewriters require the use of multiple glyphs to depict
a single character. Most typographic glyphs originate from the characters of a
typeface. In a typeface, each character typically corresponds to a single glyph,
but there are exceptions, such as a font used for a language with a large
alphabet or complex writing system, where one character may correspond to
several glyphs, or several characters to one glyph.

2.1 AMERICAN STANDARD CODE FOR INFORMATION

INTERCHANGE

American Standard Code for Information Interchange (ASCII) is a character


encoding based on the English alphabet. ASCII codes represent text in
computers, communications equipment, and other devices that work with text.
Most modern character encodings which support many more characters than did
the original have a historical basis in ASCII. Work on ASCII began in 1960. The
first edition of the standard was published in 1963 a major revision in 1967, and
in 1986. It currently defines codes for 128 characters where 33 are non-printing,
and 94 are printable characters (excluding the space).

Like other character encodings, ASCII specifies a correspondence between


digital bit patterns and character symbols (i.e. graphemes and control
characters). This allows digital devices to communicate with each other and to
process, store, and communicate character-oriented information such as written
language. The ASCII character encoding is used on nearly all common
computers, especially personal computers and workstations. The preferred MIME
name for this encoding is "US-ASCII". ASCII does not define any mechanism for
describing the structure or appearance of text within a document.

21
ASCII is, strictly, a seven-bit code, meaning it uses patterns of seven binary
digits (a range of 0 to 127 decimal) to represent each character. When ASCII
was introduced, many computers used eight-bit bytes (groups of bits), also called
octets, as the native data type. In seven-bit ASCII encoding, the eighth bit was
commonly used as a parity bit for error checking on communication lines or for
other device-specific functions. Machines that did not use parity checking
typically set the eighth bit to 0.

2.1.1 ASCII Control Characters

ASCII reserves the first 32 codes (numbers 0-31 decimal) for control characters ;
these are codes originally intended not to carry printable information, but rather
to control devices (such as printers) that make use of ASCII, or to provide meta­
information about data streams such as those stored on magnetic tape. For
example, character 10 represents the "line feed" function, character 8 represents
"backspace", and the character 13 represents the “carriage return “.

2.1.2 ASCII Printable Characters

The code 32, the "space" character, denotes the space between words, as
produced by the space-bar of a keyboard. The codes 33 to 126, known as the
printable characters, represent letters, digits, punctuation marks, and a few
miscellaneous symbols. Table 2.1 shows the ASCII character space that
includes both control (0-31) and printable (32-127) characters.

22
Table 2.1 ASCII Character Space

01 23456789
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT
10 LF VT FF CR SO SI DLEDC1DC2DC3
20 DC4 NAK SYN ETB CAN EM SUB ESC FS GS
30 RS US SP % &
40 ( ) + / 0 1
50 2 3 4 5 7 8 9 5

60 < = > ? A B C D E
70 F G H I j K L M N O
80 P Q R S T U V W X Y
A
90 Z [ \ a b c
100 d e f h i k m
110 n o P q r s u w
120 x y z { } Del

2.2 MULTI BYTE CHARACTER SET

A variable-width encoding is a type of character encoding scheme in which


codes of differing lengths are used to encode a character set for representation
in a computer. Most common variable-width encodings are multibyte encodings,
which use varying numbers of bytes (octets) to encode different characters. Early
variable width encoding use less than a byte per character to pack English text
into fewer bytes in adventure games for early microcomputers.

Multi Byte Character Set (MBCS) is usually the result of a need to increase the
number of characters which can be encoded without breaking backward
compatibility with an existing constraint. For example, with one byte (8 bits) per
character, one can encode 256 possible characters. In order to encode more
than 256 characters, the obvious choice would be to use two or more bytes per
encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but
such a change would break compatibility with existing systems and therefore
might not be feasible at all.

23
CJK multibyte encodings

The first use of multibyte encodings was for the encoding of Chinese, Japanese
and Korean s , which have large character sets well in excess of 256
characters. To represent multiple character sets registered ISO escape
sequences which are 3 chatacters long are used. Using 1 byte, 94 printable
characters can be defined (in addition to 33 control characters and one space).
Using 2 bytes, it is possible to represent (94 x 94) characters. The stateful nature
of these encodings and the large overlap make them very awkward to process.

2.2.1 Double Byte Character Set

DBCS stands for Double Byte Character Set. This term has two basic
meanings:

• In CJK (Chinese/Japanese/Korean) computing, the term "DBCS" means


a character set in which every graphic character not representable by an
accompanying SBCS (Single Byte Character Set) is encoded in two bytes.
Han characters would generally comprise most of these two-byte
characters.
• The term "DBCS" can also mean a character set in which all characters
(including all control characters) are encoded in two bytes.

The DBCS always has lead bytes with the most significant bit set (i.e., being 1),
and is always paired up with a single-byte character-set (SBCS). Furthermore,
for the practical reason of maintaining compatibility with unmodified, off-the-shelf
software, the SBCS is associated with half-width characters and the DBCS with
full width characters.

24
DBCS Sort Order and String Comparison

In sorting and comparing DBCS text, the Option Compare Text statement has a
special behavior. In English "case-insensitive" means ignoring the differences
between uppercase and lowercase. In a DBCS environment, this has additional

implications.

For example, some DBCS character sets (including Japanese, Traditional


Chinese, and Korean) have two representations for the same character: a
narrow-width letter and a wide-width letter. For example, there is a single-byte
"A" and a double-byte "A." Although they are displayed with different character
widths, Option Compare Text treats them as the same character. There are
similar rules for each DBCS character set.

2.3 INDIAN SCRIPT CODE FOR INFORMATION INTERCHANGE

In the early eighties, the Dept, of Electronics of the Govt, of India set up an
expert committee to set up standards for information processing of Indie
languages. The Indian script Code for Information Interchange (ISCII), was first
launched in 1984, is the outcome of this exercise. The ISCII is an 8-bit umbrella
standard, defined in such a way that all Indian languages can be treated using
one single character encoding scheme. ISCII is a bilingual character encoding
(not glyphs) scheme. Roman characters and punctuation marks as defined in the
standard lower-ASCIi take up the first half the character set (first 128 slots).
Characters for indie languages are allocated to the upper slots (128-255). The
Indian Standard ISCII-84 was subsequently revised in 1991 (ISCI1-91) and 1997
(ISCII-97). Along with the character encoding scheme (ISCII), the Govt, of India
also defined a keyboard layout for input called INSCRIPT. The research and
development wing of the DOE, Govt of India has developed software packages
based on these Indian standards. Multilingual and Multimedia products are based
on Graphics and Intelligence-based Script Technology (GIST).

25
Commercial DTP packages based on ISCII are also available. ISCII has not been
widely used outside of certain government institutions and has now been
rendered largely obsolete by Unicode. While using a separate block for each
Indie writing system, Unicode largely preserve the ISCII layout within each block.
In a 7-bit environment the control code SI can be used for invocation of the ISCII
code set, and the control code SO can be used for reselection of the ASCII code
set.

There are 15 officially recognized languages in India: Hindi, Marathi, Sanskrit,


Punjabi, Gujarati, Oriya, Bengali, Assamese, Telugu, Kannada, Malayalam,
Tamil, Urdu, Sindhi and Kashmiri. Out of these, Urdu, Sindhi and Kashmiri are
primarily written in Perso-Aabic scripts. Apart from Perso-Arabic scripts, all the
other 10 scripts used for Indian languages have evolved from the ancient Brahmi
script and have a common phonetic structure, making a common character set
possible.

The northern scripts are Devanagari, Punjabi, Gujarati, Oriya, Bengali and
Assamese, while the southern scripts are Telugu, Kannada, Malayalam and
Tamil. The official language of India, Hindi is written in the devanagari script.
Devanagari is also used for writing Marathi and Sanskrit. It is also the official
script of Nepal. As Perso-Arabic scripts have a different alphabet, a different
standard is envisaged for them. An attribute mechanism has been provided for
selection of different Indian script font and display attributes. An extension
mechanism allows use of more characters along with the ISCII code.

The ISCII code table is a super-set of all the characters required in the ten
Brahmi-based Indian scripts. The Arabic-based writing systems have
subsequently been encoded in the PASCII encoding. The ISCII code standard
specifies a 7-bit code table which can be used in 7 or 8-bit ISO compatible

26
environment. It allows English and Indian script alphabets to be used
simultaneously.

ISCII Code Philosophy

There are manifold advantages in having a common code and keyboard for all
the Indian scripts. Any software which allows ISCII codes to be used, can be
used in any Indian script, enhancing its commercial viability. Furthermore,
immediate transliteration between different Indian scripts becomes possible, just
by changing the display modes. Simultaneous availability of multiple Indian
languages in the computer medium will accelerate their development and
facilitate national integration. The 8-bit ISCII code retains the standard ASCII
code, while the Indian script keyboard overlay is designed for the standard
English QWERTY overlay. This ensures that English can co-exist with the Indian
scripts. This approach also makes it feasible to use Indian scripts along with
existing English computers and software, so long as 8-bit character codes are
allowed.

The common INSCRIPT keyboard overlay allows typing of all the ten Indian
scripts. This overlays fits on any existing English keyboard. Alternating
between the English and Inscript overlay is achieved through the CAPSLOCK
key. The INSCRIPT keyboard, provides a logical and intuitive arrangement of
vowels and consonants. It is based both on the phonetic properties and the
relative usage frequencies of the letters.

Not only does this made the keyboard much easier to learn, but also enables a
person to type subsequently in all the Indian scripts. The differences between
scripts primarily are in their written forms, where different combination rules get
used.

27
Properties of ISCII Code

• Phonetic Sequence
The ISCII characters, within a word, are kept in the same order as they
would get pronounced.

• No Direct Sorting
Since there are variations in ordering of a few consonants between
different Indian scripts, it is not possible to achieve perfect sorting in all
Indian scripts. Special routines would be required.

• Unique Spellings
By using only the basic characters in ISCII, there is only one unique way
of typing a word.

• Display Independence
A word in an Indian script can be displayed in a variety of styles
depending on the conjunct repertoire used. ISCII codes however allow a
complete delinking of the codes from the displayed fonts. The Inscript
keyboard overlay has one-to-one correspondence with the ISCII code.
This way, typing of word does not depend upon its displayed form.

• Transliteration
The ISCII codes are rendered on the display device according to the
display composition methodology of the selected script. Transliteration to
another script can thus be obtained by merely redisplaying the same text
in a different script. Since the display rendering process can be very
flexible, it is possible to transliterate the Indian scripts to the Roman script,
using diacritic marks. Similarly it is possible to transliterate them to their
scripts such as Perso-Arabic. Transliteration involves mere change of the
script, in a manner that pronunciation is not affected. This is not the same

28
as "translation" here the language itself changes. Table 2.2 shows the
ISCII table for Hindi.

2.4 UNICODE

In computing, Unicode [26] is an industry standard allowing computers to


consistently represent and manipulate text expressed in most of the world's
writing systems. Developed in tandem with the Universal Character Set (UCS)
standard, Unicode consists of a repertoire of about 100,000 characters, a set of
code charts for visual reference, an encoding methodology and set of standard
character encodings, an enumeration of character properties such as upper and
lower case, a set of reference data computer files, and a number of related items,
such as character properties, rules for text normalization, decomposition,

Table 2.2 ISCII Table for Hindi

Hex 0 1 > 3 4 rt 5 7 s A B c D E F

Hex Dec 0 16 33 4S 64 SO 96 113 13S S44 160 176 192 30S 334 340
0 0 NUL OLE SP a @ p ’ P # z T EXT

1 I SOH DC1 \ i A 0 2 9 •5 ■or 1? C

2 STX DC’ 2 B R h r ~fr w c5 r


v
3 3 ETX DC3 3 C s C s '4 o5 or 7
4 4 EOT DC4 s 4 D T 4 t 31 fr a- ot 3
5 5 ENQ NAK % 5 E U * u 3T frT Sf ITT OT V
6 6 ACK SYN & 0 F V f V X ?! a **
1 7 BEL ETB 7 G w 9 w FT IT ‘fr
S S BS CAN c 8 H X h x fr fr 1 is
>
9 9 HT EM ) 9 1 Y i y u: Xf INV c
A 10 LF SOB J z i Z IT fr •:*T

B 11 VT ESC K [ k { & n fr
-
C 12 FF FS < L V l 1 fr ’U 3T
-
D 13 CR OS SB M ] m } tr z 'fr
-
E 14 SO RS > N A n ~ ft- z V Y
F 15 SI US 7 O c DEL 3T z Am
✓ - fr fr.

29
collation, rendering and bidirectional display order (for the correct display of text
containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right
scripts).

Unicode standards are widely being used by the industry for the development of
multilingual software. The basic Input for evolving the Unicode standard for
Indian scripts were taken from "Indian Script Code for Information Interchange-
1988 (ISCII-1988)" document. Some modifications were necessary to incorporate
in the Unicode standard for adequate representation of Indian scripts.
Department of Information Technology, Ministry of Communications & IT, is the
voting member of the Unicode Consortium. Department of Information
Technology finalized the proposed changes in the Unicode standard in
consultation with respective state government, Indian IT Industry & Linguists. The
proposal was submitted to Unicode Consortium. Unicode Technical Committee
(UTC) has accepted some of the proposed changes for inclusion in the Unicode
standards. These changes have been incorporated in Unicode Standard 4.0. The
latest version of Unicode Standard is Unicode 5.0.

The Unicode Consortium, the non-profit organization that coordinates Unicode's


development, has the ambitious goal of eventually replacing existing character
encoding schemes with Unicode and its standard Unicode Transformation
Format (UTF) schemes, as many of the existing schemes are limited in size and
scope and are incompatible with multilingual environments. Unicode Standards
are very useful for computer users who deal with multilingual text, business
people, linguists, researchers, scientists, mathematicians and technicians.
Unicode uses a 16 bit encoding that provides code point for more than 65000
characters (65536). Unicode standard assigns each character a unique numeric
value and name. The standard has been implemented in many recent
technologies, including XML, the Java programming language.

30
Unicode policy for character encoding

Unicode consortium has laid down certain policy regarding character encoding
stability by which no character deletion or change in character name is possible
only annotation update is possible. That is, once a character is encoded :

• It will not be moved or removed.

• Its character name will not be changed.

• Its canonical combining class and decomposition (either canonical or


compatibility) will not be changed in a way that would affect normalization.

• Its properties may still be changed, but not in such a way as to change the
fundamental identity of the character.

• The structure of certain property values in the Unicode character database


will not be changed.

Mapping and encodings

Several mechanisms have been specified for implementing Unicode. The one
the implementors choose depends on available storage space, source code
compatibility, and interoperability with other systems.

Unicode defines two mapping methods

• Unicode Transformation Format (UTF) encodings


• Universal Character Set (UCS) encodings

31
UTF encodings include:

• UTF-7 is a relatively unpopular 7-bit encoding, often considered obsolete.


• UTF-8 is an 8-bit, variable-width encoding, which maximizes compatibility
with ASCII.
• UTF-EBCDIC is an 8-bit variable-width encoding, which maximizes
compatibility with EBCDIC.
• UTF-16 is a 16-bit, variable-width encoding.
• UTF-32 is a 32-bit, fixed-width encoding.

An encoding map the range of Unicode code points to sequences of values in


some fixed-size range, termed code values. The numbers in the names of the
encodings indicate the number of bits in one code value (for UTF encodings) or
the number of bytes per code value (for UCS) encodings. UTF-8 and UTF-16 are
probably the most commonly used encodings. UCS-2 is an obsolete subset of
UTF-16; UCS-4 and UTF-32 are functionally equivalent.

An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a
0x00 byte in front of every ASCII byte. For a UCS-4 file, three 0x00 bytes have
to be inserted before every ASCII byte. Using UCS-2 (or UCS-4) under Unix
would lead to very severe problems. Strings with these encodings can contain
many wide characters which have a special meaning in filenames and other C
library function parameters. In addition, the majority of UNIX tools expect ASCII
files and cannot read 16-bit words as characters without major modifications. For
these reasons, UCS-2 is not a suitable external encoding of Unicode in
filenames, text files, environment variables, etc.

The UTF-8 [18] encoding defined in ISO 10646-1:2000 does not have these
problems. It is clearly the way to go for using Unicode under Unix-style operating
systems.

32
UTF-8 has the following properties:

• UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes


0x00 to 0x7F. This means that files and strings which contain only 7-bit
ASCII characters have the same encoding under both ASCII and UTF-8.
• All UCS characters >U+007F are encoded as a sequence of several
bytes, each of which has the most significant bit set. Therefore, no ASCII
byte (0x00-0x7F) can appear as part of any other character.
• The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range OxCO to OxFD and it indicates how many
bytes follow for this character. All further bytes in a multibyte sequence are
in the range 0x80 to OxBF. This allows easy resynchronization and makes
the encoding stateless and robust against missing bytes.
• All possible 231 UCS codes can be encoded.

• UTF-8 encoded characters may theoretically be up to six bytes long,


however 16-bit BMP characters are only up to three bytes long.
• The sorting order of bigendian UCS-4 byte strings is preserved.
• The bytes OxFE and OxFF are never used in the UTF-8 encoding which is
used to represent BOM.

Table 2.4 shows the sequences that are used to represent a character. The
sequence to be used depends on the Unicode number of the character.

The xxx bit positions are filled with the bits of the character code number in
binary representation. The rightmost x bit is the least-significant bit. Only the
shortest possible multibyte sequence which can represent the code number of
the character can be used. In a multibyte sequences, the number of leading
1 bits in the first byte is identical to the number of bytes in the entire sequence.

33
Table 2,3 UTF-8 representation of Unicode Character

From To Sequences of UTF-8 Bytes

U-00000000 U-0000007F: 0x.xxx.xxx

U-00000080 U-000007FF: 11 Oxrcxr lOxvxvxt


-

U-00000800 U-0000FFFF: 1 1 1 0.YXYY 1 OxYXYXY 1 OxYXYXY

U-00010000 U-001FFFFF: 1 1 1 10.Y.Y.Y 1 OXYXYXY 1 OXXX.X.XX 1 OxYYXY.Y

U-00200000 U-03FFFFFF: 111110XY 1 OXYXYXY 1 OXYXYXY 1 0.Y.YXYXY 1 OxYXYXY

U-04000000 U-7FFFFFFF: 1 1 1 1 1 1 0.Y 10.YXY.YXY 1OXXXXXX 1 OxYXYXY 1 OxYXYXY 10.YXY.YXY

Example 2.1

The Unicode character U+00A9 =1010 1001 (copyright sign) is encoded in


UTF-8 as:

11000010 10101001 = 0xC2 0xA9

Example 2.2

U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

f1100010 10001001 10100000 = 0xE2 0x89 OxAO

For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that
are longer than necessary to encode a character. The code positions U+D800 to
U+DFFF (UTF-16 surrogates) as well as U+FFFE and U+FFFF must not occur in
normal UTF-8 or UCS-4 data. They are malformed or overlong sequences for
safety reasons. UTF-8 uses one to four bytes per code point and, being compact
for Latin scripts and ASCII-compatible, provides the de facto standard encoding
for interchange of Unicode text. It is also used by most recent Linux distributions
as a direct replacement for legacy encodings in general text handling.

34
The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM)
for use at the beginnings of text files, which may be used for byte ordering
detection (or byte endianness detection). Some software developers have
adopted it for other encodings, including UTF-8, which does not need an
indication of byte order. In this case it attempts to mark the file as containing
Unicode text.

In UTF-32 and UCS-4, one 32-bit code value is needed to represent any of
character's code point. UTF-32 is widely used as internal representation of text in
programs (as opposed to stored or transmitted text), since every Unix operating
system which uses the gcc compilers to generate software uses it as the
standard "wide character" encoding.

Unicode in Operating systems

Unicode has become the dominant scheme for internal processing and storage
of text. The best known such system is Windows NT (and its descendants,
Windows 2000, Windows XP and Windows Vista), which uses Unicode as the
sole internal character encoding. The Java and .NET bytecode environments,
Mac OS X, and KDE also use it for internal representation. UTF-8 has become
the main storage encoding on most Unix-like operating systems because it is a
relatively easy replacement for traditional extended ASCII character sets.

Unicode criticisms

Han unification (the identification of forms in the three East Asian languages
which one can treat as stylistic variations of the same historical character) has
become one of the most controversial aspects of Unicode, despite the presence
of a majority of experts from all the three regions in the Ideographic Rapporteur
Group (IRG), which advises the consortium and ISO on additions to the
repertoire and on Han unification.

Unicode has been criticized for failing to allow for older and alternative forms of
Kanji which, critics argue, complicates the processing of ancient Japanese and

35
uncommon Japanese names, although it follows the recommendations of
Japanese language scholars and of the Japanese government and contains all of
the same characters as previous widely used encoding standards. The official
encoding of China, GB-18030, supports the full range of characters in Unicode.

Indie scripts

Indie scripts of India are each allocated only 128 code points, matching the ISCII
standard. The correct rendering of Unicode Indie text requires transforming the
stored logical order characters into visual order and the forming of ligatures out of
components.

Unicode contains some Arabic and other ligatures for back compatibility
purposes only. Encoding of any new ligatures in Unicode will not happen, in part
because the set of ligatures is font-dependent, and Unicode is an encoding
independent of font variations.

Difference between Unicode and ISCII

Unicode uses a 16 bit encoding that provides code point for more than 65000
characters (65536). Unicode standard assigns each character a unique numeric
value and name. Unicode standard provides the capacity to encode all of the
characters of the written languages of the world. ISCII uses an 8 bit code which
is an extension of the 7 bit ASCII code containing the basic alphabet required for
the 10 Indian scripts which have originated from the Brahmi script. There are 15
officially recognized languages in India. Apart from Perso-Arabic scripts, all the
other 10 scripts used for Indian languages have evolved from the ancient Brahmi
script and have a common phonetic structure, making a common character set
possible. The ISCII Code table is a super set of all the characters required in the
Brahmi based Indian scripts. For convenience, the alphabet of the official script
Devanagari has been used in the standard.

36
2.5 TAMIL STANDARD CODE FOR INFORMATION

INTERCHANGE

Tamil is one of the two classical languages of India. It is the only language in the
country which has continued to exist for over two thousand years. TSCII had
used visual (written) order encoding for Tamil language.

Need for the Proposed Standard for Tamil

If ISCII and UNICODE standards already exist for information interchange of


Indie languages (including Tamil), a natural question is as to why propose
another standard for Tamil. Listed below are some of the key arguments
advanced in this context:

1. Based on "character-encoding" concept, both ISCII and Unicode leave the


screen rendering of the Tamil alphabets to software developers.
Implementation of these standards are through additional hardware cards
(as with GIST interface card of CDAC) or through dedicated softwares
that invoke advanced font-handling technologies such as glyph
substitution, available in the still evolving font specifications, like truetype
open and truetype GX.
2. Dravidian languages are notorious for their complex glyph structures. The
necessity to go for advanced font handling techniques such as glyph
substitution further puts to a disadvantage that the applications (DTP,
Word Processing etc.) have to be developed from scratch for Tamil and
can not enjoy the luxury of using off-the-shelf applications that were
developed for English as-is in Tamil.
3. Using Devanagiri script as the reference language, ISCII defines a certain
encoding scheme for all Indie languages. The phonology and the script
usage of Dravidian languages are very different. There are many
characters in Tamil and Malayalam for which there are no equivalent
Devanagiri ones. Compromises are made by allocating extra slots to
introduce these additional characters. By treating all Indian scripts under

37
one scheme, ISCII philosophy does not take advantage of the fact that
Tamil can be encoded in a simple form that seamlessly integrates with
existing computing platforms without requiring specialized rendering
technologies.
4. ISCII and Unicode are not the only avenues open for Tamil information
interchange. It is worth pointing out that these are "evolving" standards.
Before their emergence, for several decades, information processing and
exchange in major languages of the world has been going on and these
are via usage of simple, self-standing 7- and 8-bit fonts. The only problem
with these Tamil fonts is that no standard encoding scheme has been
used. So, the exchange of Tamil text files is not simple and one needs to
use converters to go from one scheme to other.

There are several advantages to develop a Tamil standard for information


interchange that is based on simple, self-standing fonts:

• Once installed in the system, they could be used practically on all


applications directly without any extra software/hardware intervention.
• The development of fonts corresponding to one encoding scheme can be
easily implemented to other computer platforms (particularly between
Windows, Macintosh and Unix).
• World-wide, free Distribution of a self-standing Tamil font will lead to vary
rapid standardization of information interchange, as has been the case
with most of the European, Russian and Japanese languages.

Goals of TSCII

• Establish a consistent International Tamil character encoding standard


that in turn lead to a self-standing Tamil font usable on all widely-used
computer platforms (PCs, Macintosh and Unix), particularly on earlier
models and operating systems.

38
o TamilNadu government very recently has embarked on an
ambitious plan to provide Internet-access booths all over the
state. This will certainly increase the awareness of computer
utility amongst lay Tamils, who will be interested to get on to
Tamil computing on whatever computer they can have
access to. In such a scenario, it is most likely that, all early
generation computers that have been produced in the last
decade will be put to use (e.g. AT/XT PCs capable of
running early versions of Windows). It will be a great
disappointment to all lay Tamils if the standards require
expensive, state-of-the-art computer systems for use.
o A Tamil font defined very much like the roman font such as
Times or Helvetica, once installed in the system, can be
used on all software packages supported by the respective
OS without the need for additional software/hardware
intervention. It is likely that over 90% of Tamil computing is
in the form of simple word-processing of plain text.

The encoding standard must be such as to be readily


implemented in most of the widely used computer platforms
(UNIX, Windows and Mac). The input of Tamil materials will
be in all these three platforms. On the Internet, the
information exchange may involve all of the three OS
(sender could use a Windows PC, the recipient uses a Mac
and the intermediate mail server a Unix-based computer).

Fortunately, procedures have been developed for production


of fonts with identical encoding scheme that work under
these different platforms. Information exchange via e-mail
and WWW has also been perfected that, no serious
problems are anticipated in rapid implementation of the
proposed scheme on all three OS.

39
TamiiNadu govt, had undertook the task of producing one
such Tamil font and distribute it free on Internet. Free
distribution of a handful of such fonts will not deprive the
software market of designing new fonts. There will always be
a need for specially designed fonts for professional usage,
very much the same way the font market still exists for
roman fonts (Adobe and others continue to make millions
marketing roman fonts!)

• The encoding could be glyph-based, at the 8-bit bilingual level, using a


unique set of glyphs and the usual lower ASCII set. Roman letters with
standard punctuation marks occupy the first 128 slots and the Tamil
glyphs occupy the 'upper-ASC' segment (slots 128-255).

> Almost all of the European languages currently employ such 8-


bit bilingual scheme, commonly known as ISO 8859-X schemes.
Such 8-bit schemes are proven standards widely implemented
by all major computer platforms.

> An 8-bit scheme with lower ASCII part in the first 128 slot can
facilitate enormously the smooth flow of information across the
Internet in all of the commonly used protocols (SMTP, FTP,
HTTP, NNTP, POP, IMAP,..)

All non-Tamil speaking personnel entrusted with communication


flow (postmasters, system administrators, particularly those
outside India and outside TamiiNadu) can easily follow the
content, its originator, destination etc., and ensure their
smooth exchange across platforms and communication
protocols used in the Internet.

40
> TamilNadu as a constituent state of India works under a
bilingual scenario with both English and Tamil as the languages
for official communication. With a single font it will be possible to
correspond in either or both of the languages. ISCII standard of
the govt, of India is also defined in a similar way.

> Tamil has far too many alphabets to be accommodated as a


single glyph in the 128 slots left. So, depending on the
complexity of the character (and its rendering) the scheme may
use one, two or three bytes to define a single alphabet. But the
choices of glyphs are such that, each of the 250+ Tamil
alphabets (uyir, mei and uyirmei) are represented by one and
only one way.

In the past, Tamil language used alternative glyphs for some of


the Tamil alphabets (e.g. forward kombu/kokki to write
lai/Nai/nai, Raa, Naa and Naa, referred to as ORNL). A unique
definition scheme implies that there is no place for these old
style characters in the encoding scheme. If the glyph encoding
scheme is unambiguous in defining the resulting character set,
then it does not really matter if one choose to encode glyphs or
characters. Defining a unique set of glyphs leading to a unique
definition of all of the 250+ Tamil characters makes the glyph
encoding scheme unambiguous. Defining glyphs also defines
the rendering part of the characters.

It was pointed out earlier that, defining characters alone and


leaving the rendering part to the software (as in Unicode and
ISCII), requires dedicated, expensive hardware and/or
softwares. Unicode fonts and Apple multilingual package can be

41
used only on the latest generation computers with Power PC
chips and current OS software.

• The Tamil standard must be an open standard.

All the Tamil fonts and software that are currently in use world-wide
are the recent work of individuals and hence are subject to copyright
protection. The copyright protection to authors is very clear with DTP
packages. But when it comes to fonts, the scope is very hazy and
protection vary from country to country. So it is desirable to develop a
true international open standard. Also, this approach will avoid
someone picking up the encoding of his/her existing font/software as a
standard.

The TSCII Tamil "encoding" scheme and associated 'Keyboard Input


Options" are open standards - i.e., no one needs to seek permission or
stake credit to implement the standard in any applications, including
commercial, freeware and shareware versions. But the "implemented"
software may or may not be copyrighted by the developer - this is
entirely the developers discretion. Both ISCII and the associated
INSCRIPT keyboard are "propriety” standards owned by the Govt, of
India.

• The encoding scheme should be universal in scope. The Tamil standard must
include all characters that are likely to be used in everyday Tamil text
interchange.

□ For centuries, Tamil language has grown with several grantha


characters added on. The usage of these grantha characters
along with pure Tamil ones is so deep-rooted in the day-to-day
usage of Tamil by the common man. Hence, the inclusion of
these grantha characters becomes essential under the above

42
criterion. Both ISCII and Unicode recognize this situation and
have provided specific slots for a number of grantha characters.
□ Unlike many of the Tamil fonts and software packages that
leave out rarely used Tamil alphabets (such as ngu, ngU, nyu,
nyuu), the present scheme ensures their presence. This has
been done so that multimedia and softwares for teaching Tamil
can display all of the Tamil alphabets without exception.

• The encoding standard must be Unicode and ISCII compatible.

The glyph choices are to be such that, a one-to-one correspondence


mapping table between the alphabet/character definitions under the
present scheme and Unicode / ISCII can be established. Using such a
table, it will be possible to save a TSC-based file in either format. Both
Unicode and ISCII scheme include a number of Tamil numerals. So the
present scheme need to include these Tamil numerals. Else, there cannot
be a one-to-one correspondence between these forthcoming standards.

There are major advantages by ensuring this compatibility with the TSCII
standards.

❖ It is an undeniable fact that the world is heading towards multi­


lingualism. This is particularly true for a country like India, where
the migration of people amongst different constituent states is
very pronounced. The encoding standards for "multi-lingualism",
namely, Unicode and ISCII are still "evolving" and are not fully
established,. Keeping this in mind, TSCII was prepared as an
"interim standard" and move on to multi-lingual standards (of
either Unicode or ISCII) on a later date. A clean compatibility will
ensure that, all Tamil materials generated in TSCII format be
made available in Unicode/ISCII format at all times - present

43
and future. None of the TSCII-based resources will be lost when
Unicode/ISCII become fully functional.
❖ Secondly, the present glyph encoding scheme can happily co­
exist with the more sophisticated Unicode/ISCII schemes and
even can make way for smooth transition to Unicode at a future
date.

Table 2.5 shows the ISCII encoding table for Tamil.

Table 2.4 TSCH table for Tamil

0 1 ? 3 •1 5 6 ? fi 9 0 c C D c r
L. XJ

i i i ; 1 :
0 (0-15) ; i i : i !
1 1 i
i 1
t (16-31) j |
i : i
! % . l ;
? (32-47) I ii » ; s & { : ) i i i
* + - i !
i ! i ;
! t . ■>
3(48 63) Oil 2 i ; 4 ! s 6 1 i ; b ; 9 j1
r I '* !
i.. ...F1 1 t"'"""' ' 1...... T
4 (64-79) © A B C j D G ; H ; ! 1 J K | L M !N1 0
1
! u v! j * ;
5 (BO-95) P i Q R 5 j t
W ; X : Y l Z f
1 \
1
\ J :
1 !
i i i
6(96-111) ' * P c ;d i i | g j h ; t i ) Y ! 1 m i n i 0
i i i : ! ;
7(112-12?) h i :; I i
P <1 r s | t ¥ j w : X y 11 y
£»
; i l i i - -
\ j J
...... i P ;Qi It. ! ff
8(128143) O ; ff P 8? pi j&U :P r.
’i.'
! ffT | : ■-./ 5 'fpi
9(144-159) (>?■ i i
i ii : ?i | Sri
i
& ; &)■
:
i m ! : P
\w LJ\
| *
A (160-175) b’T i .. -v, ffff ::<• (_ ; €■ i10.T ;
l ,>A I ... !
|°v SX: j i*;* ! ff
i j
![
B (176-191) iff' i
11
■ ti i ,rr P | i 1 rf* j ff
! v !>' !
it i a j rff 1 |
C (192 207) U iiii iff_ rr i ffT ! |P i £T : ^ ,-T. !U: !■ ffJlT .i.
i LsxJ
i i ■ 1 1
i -jr p P-l ! ff,..! * ip- !■
D (20R-223) p ! P ®i-T;
Li
i §
i i.......t: - 1 —... i
--- -
i® j P iiffff i! |lj ;
fr (224-239) ipp Li-, 1 m t i • 1
..iff
> •‘^4, M i iffff
i i li i iff
; . . 1 i J___ J
C i ~
F (240-255) i-L P ,* * ! ii
n i Vff : 1)2 1 ! i r
1 1 Pj 07 1 :A; 1
iff \ U
\....
off j P
6f
i

44
2.6 MONOLINGUAL ENCODING SCHEME FOR TAMIL

TAM is the official Monolingual Tamil (8 bit) encoding scheme of the government
of TamilNadu, which has the largest Tamil speaking population in the world. A
vast amount of Tamil textual information in digital libraries, online newspapers,
magazines etc., is available today in this encoding scheme. TAM, a monolingual
Tamil encoding scheme, is a superset of the TAB bilingual Tamil encoding
scheme. TAM is a glyph encoding scheme, while Unicode is a character
encoding scheme. Hence, there exists a one-to-one, one-to-many, many-to-one
or many-to-many relationship between the Tamil alphabets in TAM and those in
Unicode. A Tamil alphabet in a TAM encoded text could be made up with a
single, two or three bytes. Table 2.6. shows the TAM encoding table for Tamil.

Table 2.5 TAM encoding for Tamil

45
2.7 BILINGUAL ENCODING SCHEME FOR TAMIL

TAB is the official Bilingual Tamil (8 bit) encoding scheme of the government of
TamilNadu. A vast amount of Tamil textual information in digital libraries, online
newspapers, magazines etc., is available today in this encoding scheme too.
TAB encodes the roman script along with the Tamil script. The first 128 code
points of the TAB encoding scheme is exactly identical to that of the ASCII
character set. The next 128 code points is a subset of the TAM monolingual
Tamil encoding scheme. TAB is also a glyph encoding scheme. A Tamil alphabet
in a TAB encoded text could be made up of a single, two or three code points.
Table 2.7 shows the TAB encoding table for Tamil. As with any glyph encoding,
TAM and TAB encoding too would suffer from problems like kerning,
mis-scripting etc.

2.8 GLYPH ENCODING

2.8.1 Merits
It is very simple to use as far as Desktop Applications are concerned.

2.8.2 Demerits
Kerning Problem

This problem is to do with the rendering of the character and the visualization of
the resulting character.

Example 2.3

dO, while it should have been <£!.

6OTT, while it should have been swr ; actually pulli (') should be in the middle of
sow as in «rst. But, due to kerning problem «rst appears as 655T, where the pulli
has gone to the right end.

46
Table 2.6 TAB Encoding for Tamil

I Bll IM.I U < t M > 1 M, M ill- Ml !UK I \M!I

r. i ' i I : in

Mt 0 ,!

R i)

l K,

6 : F V SfT 9

-y :
10

11
4 4
-........... - -
nr j p
12 < L *

n : m m ^ tg.

14 I n
15 0 ; a mi

47
Example 2.4

Also, Word Arts using glyphs could be horrible at times. In case, the Tamil word
njOTTrfil is displayed vertically using glyphs, it would be an unreadable word as

shown in Fig. 2.1(a); i.e., instead of displaying r5®mf)], it displays 15051 ' pH in a

vertical order. If the same word is displayed using a character oriented font like
Muhil, it reads correct, which is void of kerning problem.

XE> !3 0

(a) (b)

Fig. 2.1 Kerning Problem in Word Art

Mis-Scriptinq

Example 2.5
Qrfl, a meaning less character, which is a result of mis-scripting; i.e., it is neither
Qff nor &\.

Sorting is complex

Complex parsing would be required in order to sort (^arr €uifl0nffLJu®^§j^£b)

Tamil strings represented using glyphs. That it, to decide whether &na

48
should preceed Qarr, one has to generate their respective ordinal numbers in the

<^arr airflsna (i.e., Lexical order).

Storage and transport requirements

Tamil characters would require anywhere between 1 to 3 bytes, as many as the


number of glyphs required to make that character. For example in TAB, a, <£!

and Qarr would need 1, 2, 3 bytes, respectively. This is because <s is

represented as 178 , <£) is made up of the glyphs s> (178) and H (164) and Qarr is

made up of the glyphs 61(170), <5 (178) and fT(163).

49

S-ar putea să vă placă și