Sunteți pe pagina 1din 73

Content and Content Types

Shalini R. Urs International School of Information Management University of Mysore Mysore Shalini Urs Open Elective 2011

What is Document Genre ?


Genre - the fusion of content, purpose and form of communicative actions Greek philosophers and orators recognized that the content of the message is not always its most important aspect; rather, the delivery, the context, and the rhetorical structure all play complementary roles in the subtle but profound act of one human being transferring information to another Shalini Urs Open Elective and thereby creating meaning from 2011

Document Genre
a distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form

Shalini Urs Open Elective 2011

Content Types
It all began with MIME (Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email to support Text in Character sets other than ASCII Non-text attachments Message bodies with multiple parts Header information in non-ASCII character sets
Shalini Urs Open Elective 2011

Media Types
MIME's use, however, has grown beyond describing the content of email to describe content type in general, including for the web. A media type is composed of at least two parts: a type, a subtype, and one or more optional parameters. For example, subtypes of text type have an optional charset parameter that can be included to indicate the Shalini Urs Open Elective character encoding, and subtypes of 2011

Character Encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of Urs Open Elective text in computers. Shalini
2011

What is a code?
In communications, a code is a rule for converting a piece of information (for example, a letter, word, or phrase) into another form or representation, not necessarily of the same sort. In communications and information processing, encoding is the process by which a source (object) performs this conversion of information into data, which is then sent to a receiver (observer), such as a data processing system. Decoding is the reverse process of converting data, which has been sentShalinia source, into information by Urs Open Elective
2011

One reason for coding is to enable communication in places where ordinary spoken or written language is difficult or impossible. For example, a cable code replaces words (eg, ship or invoice) into shorter words, allowing the same information to be sent with fewer characters, more quickly, and most important, less expensively. Another example is the use of semaphore flags, where the configuration of flags held by a signaller or the arms of a semaphore tower encodes parts of the message, typically individual letters and numbers. Another person standing a great Shalini Urs Open Elective
2011

Character Encoding
A character encoding is a code that pairs a set of natural language characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. Common examples include Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols as both integers and 7bit binary versions of those integers
Shalini Urs Open Elective 2011

What are UCS and ISO 10646?


The international standard ISO 10646 defines the Universal Character Set (UCS). UCS is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets. No information will be lost if you convert any text string to UCS and then back to the original encoding.
Shalini Urs Open Elective 2011

UCS contains the characters required to represent practically all known languages. This includes not only the Latin, Greek, Cyrillic, Hebrew, Arabic, Armenian, and Georgian scripts, but also Chinese, Japanese and Korean Han ideographs as well as scripts such as Hiragana, Katakana, Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetian, Runic, Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Open Elective Thaana, Yi, and Shalini Urs Sinhala,
2011

For scripts not yet covered, research on how to best encode them for computer usage is still going on and they will be added eventually. This includes not only Cuneiform, Hieroglyphs and various IndoEuropean languages, but even some selected artistic scripts such as Tolkien's Tengwar and Cirth.
Shalini Urs Open Elective 2011

UCS also covers a large number of graphical, typographical, mathematical and scientific symbols, including those provided by TeX, PostScript, APL, the International Phonetic Alphabet (IPA), MS-DOS, MSWindows, Macintosh, OCR fonts, as well as many word processing and publishing systems, and more are being added. Urs Open Elective Shalini
2011

ISO 10646 defines formally a 31-bit character set. The most commonly used characters, including all those found in older encoding standards, have been placed in one of the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that were later added outside the 16-bit BMP are mostly for specialist applications such as historic scripts and scientific notation.
Shalini Urs Open Elective 2011

Current plans are that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters. The ISO 10646-1 standard was first published in 1993 and defines the architecture of the character set and the content of the BMP. A second part ISO 10646-2 was added in 2001 and defines characters encoded outside the BMP. New characters are still being added on a continuous basis, but the existing characters will not be changed Shalini Urs Open Elective
2011

UCS assigns to each character not only a code number but also an official name. A hexadecimal number that represents a UCS or Unicode value is commonly preceded by "U+" as in U+0041 for the character "Latin capital letter A". The UCS characters U+0000 to U+007F are identical to those in USASCII (ISO 646 IRV) and the range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1). The range U+E000 to U+F8FF and also larger ranges outside the BMP are reserved for private use. Shalini Urs Open Elective UCS also defines several methods for
2011

Some code points in UCS have been assigned to combining characters. These are similar to the non-spacing accent keys on a typewriter. A combining character is not a full character by itself. It is an accent or other diacritical mark that is added to the previous character. This way, it is possible to place any accent on any character. The most important accented characters, like those used in the orthographies of common languages, have codes ofUrs Open Elective UCS to their own in Shalini ensure backwards compatibility with 2011

What are combining characters?

They are known as precomposed characters. Precomposed characters are available in UCS for backwards compatibility with older encodings that have no combining characters, such as ISO 8859. The combining-character mechanism allows one to add accents and other diacritical marks to any character. This is especially important for scientific notations such as mathematical formulae and the International Phonetic Alphabet, where any possible combination of a base Shalini Urs Open Elective
2011

Combining characters follow the character which they modify. For example, the German umlaut character ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code U+00C4, or alternatively by the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": U+0041 U+0308. Several combining characters can be applied when it is necessary to stack multiple accents or add combining marks both above and below the base character. The Thai script, for example, Shalini Urs Open needs up to two 2011 Elective characters combining

Not all systems can be expected to support all the advanced mechanisms of UCS, such as combining characters. Therefore, ISO 10646 specifies the following three implementation levels: Level 1
Combining characters and Hangul Jamo characters are not supported. [Hangul Jamo are an alternative representation of precomposed modern Hangul syllables as a sequence of consonants and vowels. They are required to fully support the Open Elective Shalini Urs Korean script including
2011

What are UCS implementation levels?

Level 2
Like level 1, however in some scripts, a fixed list of combining characters is now allowed (e.g., for Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugo, Kannada, Malayalam, Thai and Lao). These scripts cannot be represented adequately in UCS without support for at least certain combining characters.

Level 3
All UCS characters are supported, such that, for example, mathematicians can place a tilde or an arrow (or both) on any character.
Shalini Urs Open Elective 2011

What is Unicode ?
In the late 1980s, there have been two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Shalini Urs Open Elective
2011

Fortunately, the participants of both projects realized in around 1991 that two different unified character sets is not exactly what the world needs. They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any furtherShalini Urs Open Elective extensions.
2011

Unicode 1.1 corresponded to ISO 10646-1:1993, Unicode 3.0 corresponded to ISO 10646-1:2000, Unicode 3.2 added ISO 106462:2001, and Unicode 4.0 corresponds to the forthcoming third version of ISO 10646. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future. Shalini Urs Open Elective
2011

In computing, Unicode is the international standard whose goal is to specify a code matching every character needed by every written human language to a single unique integer number, called a code point. Despite technical problems and limitations and criticism on them, Unicode has emerged as the dominant encoding scheme in internationalization of software and multilingual environments.
Shalini Urs Open Elective 2011

Microsoft Windows NT and its descendants Windows 2000 and Windows XP make extensive use of Unicode, more specifically UTF-16, as an internal representation of text. UNIX-like operating systems such as Linux, BSD and Mac OS X have adopted Unicode, more specifically UTF-8, as the basis of representation of multilingual text.Elective Shalini Urs Open
2011

The Unicode Standard published by the Unicode Consortium corresponds to ISO 10646 at implementation level 3. All characters are at the same positions and have the same names in both standards. The Unicode Standard defines in addition much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (say Arabic), handling of biShalini Urs Open Elective
2011

Unicode and ISO 10646: Differences

The ISO 10646 standard on the other hand is not much more than a simple character set table, comparable to the old ISO 8859 standards. It specifies some terminology related to the standard, defines some encoding alternatives, and it contains specifications of how to use UCS in connection with other established ISO standards such as ISO 6429 and ISO 2022. There are other closely related ISO standards, for instance ISO 14651 on sorting UCS strings. A nice feature of the ISO 10646-1 standard is that it provides CJK example glyphs in five Shalini Urs Open Elective different style variants, while the 2011

What is UTF-8?
UCS and Unicode are first of all just code tables that assign integer numbers to characters. There exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes. The two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2 and Shalini Urs Open Elective UCS-4, respectively. 2011

Unless otherwise specified, the most significant byte comes first in these (Bigendian convention). An ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte.
Shalini Urs Open Elective 2011

Encoding Forms
Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits. The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit).
Shalini Urs Open Elective 2011

All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.
Shalini Urs Open Elective 2011

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

Shalini Urs Open Elective 2011

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units. UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32bit code unit when using UTF-32. All three encoding forms need at most 4 bytes (or 32-bits) Shalini Urs for each character. of data Open Elective
2011

Defining Elements of Text


Written languages are represented by textual elements that are used to create words and sentences. These elements may be letters such as "w" or "M"; characters such as those used in Japanese Hiragana to represent syllables; or ideographs such as those used in Chinese to represent full words or concepts.
Shalini Urs Open Elective 2011

The definition of text elements often changes depending on the process handling the text. For example, in historic Spanish language sorting, "ll"; counts as a single text element. However, when Spanish words are typed, "ll" is two separate text elements: "l" and "l".

Shalini Urs Open Elective 2011

To avoid deciding what is and is not a text element in different processes, the Unicode Standard defines code elements (commonly called "characters"). A code element is fundamental and useful for computer text processing. For the most part, code elements correspond to the most commonly used text elements. In the case of the Spanish "ll", the Unicode Standard defines each "l" as a separate code element. The task of combining two "l" together for alphabetic sorting is left to the software processing the text.
Shalini Urs Open Elective 2011

Text Processing
Computer text handling involves processing and encoding. Consider, for example, a word processor user typing text at a keyboard. The computer's system software receives a message that the user pressed a key combination for "T", which it encodes as U+0054.

Shalini Urs Open Elective 2011

The word processor stores the number in memory, and also passes it on to the display software responsible for putting the character on the screen. The display software, which may be a window manager or part of the word processor itself, uses the number as an index to find an image of a "T", which it draws on the monitor screen. The process continues as the user types in more characters.
Shalini Urs Open Elective 2011

The Unicode Standard directly addresses only the encoding and semantics of text. It addresses no other action performed on the text. For example, the word processor may check the typist's input as it is being entered, and display misspellings with a wavy underline. Or it may insert line breaks when it counts a certain number of characters entered since the last line break. An important principle of the Unicode Standard is that it does not specify how to carry out these processes as long as the character encoding and decoding is performed properly.
Shalini Urs Open Elective 2011

Interpreting Characters and Rendering Glyph


The difference between identifying a code point and rendering it on screen or paper is crucial to understanding the Unicode Standard's role in text processing. The character identified by a Unicode code point is an abstract entity, such as "LATIN CHARACTER CAPITAL A" or "BENGALI DIGIT 5." The mark made on screen or paper -- called a glyph -- is a visual representation of the character.
Shalini Urs Open Elective 2011

The Unicode Standard does not define glyph images. The standard defines how characters are interpreted, not how glyphs are rendered. The software or hardware-rendering engine of a computer is responsible for the appearance of the characters on the screen. The Unicode Standard does not specify the size, shape, nor style of on-screen characters.

Shalini Urs Open Elective 2011

Character Sequences
Text elements are encoded as sequences of one or more characters. Certain of these sequences are called combining character sequences, made up of a base letter and one or more combining marks, which are rendered around the base letter (above it, below it, etc.). For example, a sequence of "a" followed by a combining circumflex "^" would be rendered as ""
Shalini Urs Open Elective 2011

The Unicode Standard specifies the order of characters in a combining character sequence. The base character comes first, followed by one or more non-spacing marks. If there is more than one non-spacing mark, the order in which the non-spacing marks are stored isn't important if the marks don't interact typographically. If they do interact, then their order is important. The Unicode Standard specifies how successive nonspacing characters are applied to a base character, and when the order is significant.
Shalini Urs Open Elective 2011

Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "" can be encoded as the single code point U+00FC "" or as the base character U+0075 "u" followed by the nonspacing character U+0308 "". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "" and "".
Shalini Urs Open Elective 2011

Precomposed characters may be decomposed for consistency or analysis. For example, in alphabetizing (collating) a list of names, the character "" may be decomposed into a "u" followed by the nonspacing character "". Once the character has been decomposed, it may be easier for the to work with the character because it can be processed as a "u" with modifications. This allows easier alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines the decompositions for all precomposed characters. Shalini Urs Open Elective
2011

The Unicode Standard was created by a team of computer professionals, linguists, and scholars to become a worldwide character standard, one easily used for text encoding everywhere. To that end, the Unicode Standard follows a set of fundamental principles:

Shalini Urs Open Elective 2011

Universal repertoire Logical order Efficiency Unification Characters, not glyphs Dynamic composition Semantics Equivalent Sequence Plain Text Convertibility
Shalini Urs Open Elective 2011

The character sets of many existing international, national and corporate standards are incorporated within the Unicode Standard. For example, its first 256 characters are taken from the widely used Latin-1 character set.

Shalini Urs Open Elective 2011

Duplicate encoding of characters is avoided by unifying characters within scripts across languages; characters that are equivalent in form are given a single code. Chinese/Japanese/Korean (CJK) consolidation is achieved by assigning a single code for each ideograph that is common to more than one of these languages. This is instead of providing a separate code for the ideograph each time it appears in a different language. (These three languages share many thousands of identical characters because their ideograph sets evolved from the same source.)
Shalini Urs Open Elective 2011

The Unicode Standard specifies an algorithm for the presentation of text with bidirectional behavior, for example, Arabic and English. Characters are stored in logical order. The Unicode Standard includes characters to specify changes in direction when scripts of different directionality are mixed. For all scripts Unicode text is in logical order within the memory representation, corresponding to the order in which text is typed on the keyboard.

Shalini Urs Open Elective 2011

Assigning Character Codes


A single number is assigned to each code element defined by the Unicode Standard. Each of these numbers is called a code point and, when referred to in text, is listed in hexadecimal form following the prefix "U". For example, the code point U+0041 is the hexadecimal number 0041 (equal to the decimal number 65). It represents the character "A" in the Unicode Standard.
Shalini Urs Open Elective 2011

Each character is also assigned a unique name that specifies it and no other. For example, U+0041 is assigned the character name "LATIN CAPITAL LETTER A." U+0A1B is assigned the character name "GURMUKHI LETTER CHA." These Unicode names are identical to the ISO/IEC 10646 names for the same characters
Shalini Urs Open Elective 2011

The Unicode Standard groups characters together by scripts in code blocks. A script is any system of related characters. The standard retains the order of characters in a source set where possible. When the characters of a script are traditionally arranged in a certain order -- alphabetic order, for example -- the Unicode Standard arranges them in its code space using the same order whenever possible. Code blocks vary greatly in size. For example, the Cyrillic code block does not exceed 256 code points, while the CJK code blocks contain many thousands of code points.
Shalini Urs Open Elective 2011

Codespace
Code elements are grouped logically throughout the range of code points, called the codespace. The coding starts at U+0000 with the standard ASCII characters, and continues with Greek, Cyrillic, Hebrew, Arabic, Indic and other scripts; then followed by symbols and punctuation. The code space continues with Hiragana, Katakana, and Bopomofo. The unified Han ideographs are followed by the complete set of modern Hangul.

Shalini Urs Open Elective 2011

A range of code points on the BMP and two very large ranges in the supplementary planes are reserved as private use areas. These code points have no universal meaning, and may be used for characters specific to a program or by a group of users for their own purposes. For example, a group of choreographers may design a set of characters for dance notation and encode the characters using code points in user space.

Shalini Urs Open Elective 2011

A set of page-layout programs may use the same code points as control codes to position text on the page. The main point of user space is that the Unicode Standard assigns no meaning to these code points, and reserves them as user space, promising never to assign them meaning in the future.
Shalini Urs Open Elective 2011

The Unicode Standard specifies unambiguous requirements for conformance in terms of the principles and encoding architecture it embodies. A conforming implementation has the following characteristics, as a minimum requirement:

Conformance to the Unicode Standard

characters are from the common repertoire; characters are encoded according to one of the encoding forms; characters are interpreted with Unicode semantics; unassigned codes are not used; and, unknown characters are not corrupted.
Shalini Urs Open Elective 2011

Stability
The Unicode Standard has a lot of room to grow, and there are a considerable number of scripts that will be encoded in upcoming versions. This process is strictly additive, in other words, while characters may be added or new character properties may be defined, no characters will be removed -- or reinterpreted in incompatible ways.
Shalini Urs Open Elective 2011

These stability guarantees make it possible to encode data in Unicode and expect that future implementations that conform to a later version of the Unicode Standard will be able to interpret them in the same way, as implementations conforming to The Unicode Standard, Version 3.2.
Shalini Urs Open Elective 2011

The range of surrogate code points is reserved for use with UTF-16. Towards the end of the BMP is a range of code points reserved for private use, followed by a range of compatibility characters. The compatibility characters are character variants that are encoded only to enable transcoding to earlier standards and old implementations, which made use of them.

Shalini Urs Open Elective 2011

Unicode is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers. Unicode characters can be encoded using any of several schemes termed Unicode Transformation Formats (UTF).
Shalini Urs Open Elective 2011

The Unicode Consortium has as its ambitious goal the eventual replacement of existing character encoding schemes with Unicode, as many of the existing schemes are limited in size and scope, and are incompatible with multilingual environments. Its success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including XML, the Open Elective Shalini Urs Java programming
2011

Other terms like character encoding, character set (charset), and sometimes character map or code page are used almost interchangeably, but these terms now have related but distinct meanings. Common examples of character encoding systems include Morse code, the Shalini Urs Open Elective the American Baudot code,
2011

Other codes
Morse code was introduced in the 1840s and is used to encode each letter of the Latin alphabet and each Hindu-Arabic numeral as a series of long and short presses of a telegraph key. Representations of characters encoded using Morse code varied in length. The Baudot code was created by mile BaudotUrs Open Elective Shalini in 1870, patented in
2011

ASCII and other codes


ASCII was introduced in 1963 and is a 7-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixedlength codes using integers. IBM's Extended Binary Coded Decimal Interchange Code (usually abbreviated EBCDIC) is an 8-bit encoding scheme developed in 1963.
Shalini Urs Open Elective 2011

Why UNICODE ?
The limitations of such sets soon became apparent, and a number of ad-hoc methods were developed to extend them. The need to support more writing systems for different languages, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character Shalini Urs Open Elective
2011

Encoding
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
Shalini Urs Open Elective 2011

Encoding
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. Shalini Urs Open Elective
2011

Encoding
In communications, a code is a rule for converting a piece of information (for example, a letter, word, or phrase) into another form or representation, not necessarily of the same sort.

Shalini Urs Open Elective 2011

Encoding
In communications and information processing, encoding is the process by which a source (object) performs this conversion of information into data, which is then sent to a receiver (observer), such as a data processing system. Decoding is the reverse process of converting data, which has been sent by a source, into information understandable by a Shalini Urs Open Elective
2011

Encoding
One reason for coding is to enable communication in places where ordinary spoken or written language is difficult or impossible. For example, a cable code replaces words (eg, ship or invoice) into shorter words, allowing the same information to be sent with fewer characters, more quickly, and most important, less expensively. Shalini Urs Open Elective
2011

Encoding
Another example is the use of semaphore flags, where the configuration of flags held by a signaller or the arms of a semaphore tower encodes parts of the message, typically individual letters and numbers. Another person standing a great distance away can interpret the flags and reproduce the words sent.
Shalini Urs Open Elective 2011

S-ar putea să vă placă și