Documente Academic
Documente Profesional
Documente Cultură
Agenda
1. 2. 3. 4.
Introduction to Unicode Unicode & SAP in General Technology in Depth Sizing Information for Unicodebased SAP Systems
2
3/31/2004
Introduction to Unicode
3/31/2004
1. Introduction to Unicode
What
is text? History of character encoding Problem of character encoding From ASCII to Unicode What is Unicode exactly? The Unicode Standard Where is Unicode used? The Unicode Consortium Unicode Encodings
3/31/2004 4
What is text?
Code pages & encodings describe the handling of and the way text is stored in
Computers Files Data structures
Inside a computer program or data file, text is stored as a sequence of numbers just like everything else A character is a:
Letter, Digit, Period, Hyphen, Punctuation or Math symbol
computers were pretty slow, had fairly little memory and were very expensive Up to 1960s I/O meant pushing holes into paper tapes Most of the character sets date back to punch-card age and are designed with these cards in mind In the early days of computers every hardware manufacturer used proprietary technology (and encodings) International data interchange was no issue and so nothing needed to fit together
3/31/2004 6
number is assigned to which character? When typing an A on the keyboard, the computer uses the character code as a basis for pulling the character shape of A from a font file listing with the same binary number, and displays or prints it The character A may also have different integer values in different programs or data files (A might be in an Arabic font file) In some instances no number available for certain characters (f.i. ä ) All data encoded in the form of binary numerical codes
3/31/2004 7
Character repertoire
English
more: ~ 60 characters
Western Korean: Chinese
European Standard: ~ 300 characters for several languages ~12.000 syllables dictionaries: ~ 50.000 letters
Hundreds
3/31/2004
character sets and encodings in 70s/80s were modifications or extensions of ASCII of them used 8-bit with a subset of the 94 used ASCII characters common encodings nowadays use single byte per character (SBCS) are all limited to 256 characters to that, none of them can even cover the letters for the Western European languages
9
3/31/2004
many different 8-bit encodings were created to fulfill the needs of different user communities for data interchange in global networked information society and collaborative business world: single character set for all languages in use can encode 4.294.967.296 different characters, symbols and control characters
10
Solution
Unicode
3/31/2004
= universally encoded character set to store information from any language defines
properties for each character standardizes script behavior provides a standard algorithm for bi directional text defines cross-mappings for other standards
Unicode
defines a unique code value for every character, regardless of platform, program or programming language used
3/31/2004
11
Unicode standard primarily encodes scripts rather than languages comprise several languages that historically share the same set of symbols many cases a script may serve to write dozens of languages (e.g. the Latin script) other cases one script complies to one language (e.g. Hangul)
Scripts In In
3/31/2004
12
it also includes punctuation marks, diacritics, mathematical symbols, technical symbols, musical symbols, arrows, dingbats etc. all, the Unicode Standard comprises >95.000 characters, ideograph sets, symbols (version 4.0)
In
3/31/2004
13
In
3/31/2004
Unicode standards has been adopted by many software and hardware vendors Mosts OSs support Unicode Unicode is required for international document and data interchange, the Internet and the WWW, and therefore by modern standards such as:
Java, C#, Perl, Python Markup languages such as XML, HTML, XHTML, MathML, WML etc. JavaScript LDAP CORBA etc.
3/31/2004 15
Members
of the Consortium include major computer corporations, software producers, database vendors, research institutions, international agencies, various user groups, and interested individuals
16
3/31/2004
W3C and ISO and has liaison status "C" with ISO/IEC/ JTC 1/SC2/WG2, which is responsible for in refining the specification and expanding the character set of ISO/IEC 10646
3/31/2004
17
Unicode Encodings
UTF
= Unicode Transformation Format UCS = Universal Character Set CESU = Compatibility Encoding Scheme
Conversion
between different encodings is a simple, bit-wise operation (defined in standard) No performance excessive conversion table necessary!
3/31/2004 18
Unicode Encodings
UTF-8:
Unicode Transformation based on 8bit representation Compatibility Encoding Scheme of UTF-16 on an 8-bit base Unicode Transformation based on 16-bit representation
CESU-8:
UTF-16:
3/31/2004
19
Unicode Encodings
UCS-2:
Universal Character Set 2 byte variation (16-bit) Unicode Transformation based on 32-bit representation Universal Character Set 4 byte variation (32 bit)
UTF-32:
UCS-4:
3/31/2004
20
Unicode Encodings
Not
all Unicode characters are 2 bytes long no doubling of hw requirements in the first place encoding determines the length of a character in one Unicode encoding can be longer than 1 byte; therefore Unicode characters can be longer than characters defined in a standard code page
21
Unicode
Character
3/31/2004
UTF-8
UTF-8 Its
a variable-width encoding and also a strict superset of 7-bit ASCII superset means that every character in 7-bit ASCII is available in UTF-8 with the same corresponding code point value character = 1byte 4 bytes in the encoding from European scripts: either 1or 2
Strict
Characters
bytes
Asian
3/31/2004
scripts: 3 or 4 bytes
22
UTF-8
UTF-8 Main
used for UNIX-platforms, HTML and most Internet Browsers benefits of UTF-8: compact storage requirements for European scripts in general European scripts will occupy less storage on disk and memory ease of migration > since 7-bit ASCII data remains the same in UTF-8, data conversion effort between ASCII based character sets and UTF-8 is reduced significantly
23
3/31/2004
encodings are well-suited for data transfer since all 7-bit ASCII and 8-bit ISO characters retain the same code points communication with legacy and nonUnicode systems variable character length
24
Easier
Downside:
3/31/2004
UCS-2
UCS-2 has a fixed width of 16 bit (2 bytes) UCS-2 is the Unicode encoding for Java & Win NT 4.0 Main benefits of UCS-2: More compact storage requirements for Asian scripts (each character represented with 2 bytes only) String processing will be faster because all characters are of the same width Good compatibility with Java and Microsoft clients Downside: UCS-2 can support Unicode characters defined up to Unicode 3.0 only (max. 65.536)
3/31/2004 25
UTF-16
UTF-16
Basically One
Unicode character can be 2 or 4 bytes in the encoding from European and most Asian scripts are represented in 2 bytes characters are represented in 4 bytes
Characters
Supplementary UTF-16
3/31/2004
UTF-16
Main
benefits of UTF-16: More compact storage requirements for Asian scripts (2 bytes for commonly used characters) Ideal if European and Asian scripts are used together --> UTF-16 will occupy less storage on disk and memory than with UTF-8 (3 bytes for Asian part) Balance of efficient access to characters and economical use of storage
Above
3/31/2004
mentioned points reason for use of UTF-16 in SAP Web Application Server
27
encodings offer a compromise between the pros and cons of the 8-bit and the 32-bit encodings, respectively do not need as much memory as 32-bit encodings, but offer quasi fixed character length has a fixed character length, but it cannot define more than 2^16 (65.636) characters
28
They
UCS-2
3/31/2004
UTF-32
32-Bit
Popular
Fixed
width (4Byte)
3/31/2004
29
This
3/31/2004
30
Example #1
Character UTF-8 UCS-2 UTF-16
A c
41 63 C3 86 C3 B6 DA 64 E4 BA 75 F0 9D 84 9E
3/31/2004
1100
UTF16
BIN HEX
1010 A
1100 C
0000 0
0000 0
32
3/31/2004
3/31/2004
33
and characters Characters on Disk/Memory Code Pages SAP & Code Pages Language Combinations before Unicode Recommendations from SAP (w/o Unicode) Unicode-compliant SAP products When/why do customers need Unicode?
3/31/2004 34
3/31/2004
35
Characters on Disk/Memory
A a
code page defines the mapping between the byte sequence and a character
Characters on Disk/Memory
3/31/2004
36
Code Pages
The
code page determine what character you can see and enter
Characters on Disk/Memory
3/31/2004
37
Code Pages
different
Characters on Disk/Memory
3/31/2004
38
3/31/2004
39
is also possible to specify a customerspecific language; this language must use one of the code pages that SAP supports; see Note 0112065
3/31/2004
41
Blended Code Pages ( Rel. 3.1D) SAP proprietary code pages that contain characters from one or more standard code pages
increases the combinations of languages that can be used functionally, a Blended Code Page system uses a single code page a Blended Code Page is a single code page system users can see and enter all characters contained in the code page, regardless of their log-in language
3/31/2004 42
Supported Languages
3/31/2004
43
availability of SAP blended code pages is platform dependent, because SAP blended locales need to be created for each platform
Blended Locale Status (x = available = not available)
3/31/2004
44
Multi-Display / Multi-Processing
allows dynamic code page switching on the application server therefore permits any combination of standard code pages on one system the log-on language determines the code page that is active for each user an MDMP system is recommended if:
1. one or more additional code pages are required to add languages to your existing installation 2. a blended code page cannot support the combination of languages you need for a new installation. For example, an MDMP system with the code pages 1100 and 8000, allows German and Japanese users to log onto the same R/3 system in their respective languages
3/31/2004
45
Front End
DB
Application Server
1100 ISO-1
Japan
Each
user can only access one code page at a time: a user who logs in as a Japanese user cannot enter German characters, and all German characters in the database will not be correctly displayed
46
Germany
3/31/2004
Japanese User
German User
3/31/2004
47
is possible for a user to log on with German and then manipulate the character set and font settings so that he can enter what appear to be Japanese characters; these characters will not be correctly stored in the database and this data will be corrupt a user wants to enter f.i. Japanese, he/she must log on in Japanese
If
3/31/2004
48
insure that no data corruption occurs, the following restrictions must be followed: Global data must contain only 7-bit ASCII characters, which are in all code pages Users may use only the characters of their log-in language or 7-bit ASCII Batch processes must be assigned with the correct user ID and language EBCDIC code pages are not supported
3/31/2004
49
general, using a single standard code page for new installations and upgrades is the optimal decision additional languages or language combinations are needed, SAP recommends Unambiguous Blended Code Pages for new installations and MDMP for existing installations Blended Code Pages only support certain language combinations and therefore an MDMP setup may be the only possibility for new installations as well
50
If
Unambiguous
3/31/2004
Unicode installations are currently planned only with written permission of SAP carried out as customer projects together with SAP, except of new installations of R/3 Enterprise Extension Set 2.0
3/31/2004
51
(SAP
mySAP
(CRM)
The Unicode version of mySAP CRM 4.0 is available via Ramp-Up
mySAP
The Unicode version of mySAP SRM 4.0 is available via Ramp-Up conversions (with or without MDMP) of existing SRM installations
3/31/2004 52
(SAP
The Unicode version of mySAP BW 3.5 is available via Ramp-Up the conversion of existing BW installations as customer project SAP Note 643813 has a collection of all relevant SAP notes concerning Unicode-based SAP BW installations
mySAP
SAP
3/31/2004
businesses that require IT systems to support multilingual data without any restrictions f.i. customers with one WW central SAP system interfaces open the door to a global customer base, and IT systems must consequently be able to support multiple local languages simultaneously
Web
3/31/2004
54
J2EE integration, mySAP components fully support web standards, and with Unicode, it now can take full advantage of XML and Java Unicode makes it possible to seamlessly integrate inhomogeneous SAP and non-SAP system landscapes NetWeaver
Only
3/31/2004
55
Technology in Depth
3/31/2004
56
3. Technology in Depth
Unicode
& Operating Systems Unicode & Databases SAP Unicode-based Code Pages How to Unicode-enable a program Unicode-enabled ABAP Migrating to Unicode enabled ABAP Unicode Conversion, IMIG Lab Test SAP System-to-System communication Printing & Output Management
3/31/2004 57
Unicode locales in the HP-UX operating environment are based on the UTF-8 format locale includes a base language in the UTF-8 code set and the regional data related to this base language includes local formatting rules, text messages, help messages, and other related files locale also supports several other scripts for input, display, code conversion, and printing
58
Each
This
Each
3/31/2004
Unicode support has been included in Microsoft Windows since Windows 95, and Windows NT 4 2000 and Windows XP/2003 are based on Unicode instead of the ANSI or WGL4 character sets Win2K, your version of Windows may have used a different character set if you live in a country such as Egypt, Greece, Israel, Russia or Thailand that uses a non-Latin alphabet
Windows
Before
3/31/2004
59
first 128 characters were the same as in ANSI, but many of the places in the second set of 128 were taken by characters from the Arabic, Greek, Hebrew, Cyrillic or Thai alphabets caused and still causes problems when moving documents between operating systems such as DOS, Windows, Mac OS and UNIX or exchanging documents electronically that were created on computers using different character sets
This
3/31/2004
60
UTF-8 emerged, Linux users all over the world had to use various different languagespecific extensions of ASCII popular were ISO 8859-1 and ISO 8859-2 in Europe, ISO 8859-7 in Greece, KOI-8 / ISO 8859-5 / CP1251 in Russia, EUC and Shift-JIS in Japan, BIG5 in Taiwan, etc. made the exchange of files difficult and application software had to worry about various small differences between these encodings
61
Most
This
3/31/2004
of these difficulties, major Linux distributors and application developers have now started to phase out these older legacy encodings in favor of UTF-8 support has improved dramatically over the last few years and ever more people now use UTF8 on a daily basis in
text files (source code, HTML files, email messages, etc.) file names standard input and standard output, pipes
UTF-8
3/31/2004
62
UTF-8 mode, terminal emulators (such as xterm) transform every keystroke into the corresponding UTF-8 sequence and send it to the stdin of the foreground process any output of a process on stdout is sent to the terminal emulator, where it is processed with a UTF-8 decoder and then displayed using a 16bit font
63
Similarly,
3/31/2004
you start experimenting with UTF-8 under Linux, update your installation to a recent distribution with up-to-date UTF-8 support is particular the case if you use an installation older than SuSE 8.1 or Red Hat 8.0 these, UTF-8 support was far too limited and experimental to be recommendable for daily use
64
This
Before
3/31/2004
and Unicode are first of all just code tables that assign integer numbers to characters exist several alternatives for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes two most obvious encodings store Unicode text as sequences of either 2 or 4 bytes sequences
There
The
3/31/2004
65
official terms for these encodings are UCS2 and UCS-4, respectively otherwise specified, the most significant byte comes first in these (Big Endian convention) ASCII or Latin-1 file can be transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every ASCII byte we want to have a UCS-4 file, we have to insert three 0x00 bytes instead before every ASCII byte
66
Unless
An
If
3/31/2004
Character
UTF-8 / CESU-8
[Little Endian]
A
3/31/2004
41
41 00
00 41
C3 84 C4 00 00 C4 CE B1 B1 03 03 B1 D7 90 D0 05 05 D0 E6 99 93 53 66 66 53
67
--
Unsupported in general
Win2K HP-UX Solaris AIX OS/400 OS/390 Linux SQL Server Oracle DB2 SAP DB P P P P -P P P -P P P -P P P --P ----P P P
?
--
3/31/2004
68
Version Encodings
2000 UTF-16 7.2 UTF-8 8 UTF-8 9i UTF-8 / UTF-16 10g UTF-8 / UTF-16
DB2
SAP DB
3/31/2004
69
the Unicode enablement of mySAP.com components (check chapter #1), the old code page management had to be changed of using SAP character numbers all code pages are now based on Unicode character Ids 5 digit SAP Character numbers no longer adequate
Instead
3/31/2004
71
Connection between SAP character number & Unicode character ID is found in table TCP01 You can see the connection in the SPAD character section NOTE: not every character has a corresponding Unicode character ID! f.i.
3/31/2004
72
migration of all SAP code pages from the old to the new format was done using report RSCP0126 definition of code pages is still in TCP00
The
Customers must migrate their own code pages (9xxx) using RSCP0126 themselves!
3/31/2004
73
R/3
ABAP source
Non-Unicode R/3
(types C, N, D, T, STRING)
Unicode R/3
(types C, N, D, T, STRING)
No explicit Unicode data type in ABAP Single ABAP source for Unicode and non-Unicode systems
3/31/2004 74
part of ABAP coding is ready for Unicode without any changes part of ABAP coding has to be adapted to comply with Unicode restrictions (f.i. syntactical restrictions)
3/31/2004
75
3/31/2004
76
world
Minimize costs for Unicode enabling of ABAP Programs
Main Features
Clear
3/31/2004
77
3/31/2004
78
non-Unicode system
Adapt
all ABAP programs to Unicode syntax and runtime restrictions attribute "Unicode enabled" for all programs
Set
3/31/2004
79
Do runtime tests in Unicode system Check for runtime errors Look for semantic errors Check ABAP list layout with former double byte characters
3/31/2004 80
errors
Untyped field symbols Offset with variable length Generic access to database tables
Set Do
Unicode program attribute using UCCHECK or SE38 / SE24 / ... additional checks with SLIN (e.g. matching of actual and formal parameters in function modules)
81
3/31/2004
3/31/2004
82
3/31/2004
83
Upgrade to Unicode
Upgrade to Unicode
With
Unicode, there are no limitations on users, and all languages in the ISO639 standard can be used is technically supported as of Basis Release 6.20, see Note 0379940 for more information single code page system (standard or Unambiguous Blended Code Page) can be upgraded to Unicode using the normal upgrade method
85
Unicode
3/31/2004
3/31/2004
86
3/31/2004
87
For systems that support multiple languages, special emphasis needs to be placed on cross-language handling during the test phase. Correction tools are provided by SAP, which can be used in the case that conversion did not run properly.
3/31/2004 88
Additional Tool: SAP Data Management - reducing the database size and growth To keep your database costs in check, the SAP Data Management service frees up valuable database resources by showing you how to reduce the size and growth of your database by typically 25 % (see details).
3/31/2004
89
Set up the Unicode Conversion Project Check Prerequisites Data Analysis for downtime minimization special MDMP treatment Enabling of Customer Developments
Highly automated System will be down during database conversion Unload /reload process for small databases Minimum downtime tool for large databases
Unicode system is up and running Verification of Data Consistency Integration Testing focused on language handling
3/31/2004
90
R/3 3.1i
R/3 Enterprise
Conversion
Unicode
R/3 4.5b
l First upgrade, then conversion to Unicode R/3 4.6b l R/3 Enterprise Ramp-Up started 2002-07 l Unicode availability follows a phase of restricted shipment with pilot customers
91
R/3 4.6c
3/31/2004
BW 2.0B
BW 3.1
Conversion
Unicode
l Interfacing R/3 MDMP on a project base only l Unicode BEXGUI restrictions apply l First upgrade, then conversion to Unicode l BW 3.1 Ramp-Up starting 2002-12 BW 3.0
3/31/2004
CRM 2.0C
CRM 3.1
Conversion
Unicode
l Selected scenarios only cooperation with SAP GBU CRM required l First upgrade, then conversion to Unicode l CRM 3.1 Ramp-Up starting 2002-12 CRM 3.0
3/31/2004
Set up the Unicode Conversion Project Check Prerequisites Data Analysis for downtime minimization special MDMP treatment Enabling of Customer Developments
Highly automated System will be down during database conversion Unload /reload process for small databases Minimum downtime tool for large databases
Unicode system is up and running Verification of Data Consistency Integration Testing focused on language handling
3/31/2004
94
OSS Note 548016 Conversion from Unicode to non-Unicode is not possible The Unicode Conversion of MDMP AND also Ambiguous Code page systems ( Code Page numbers 6100, 6200 and 6500 ) is only supported on project basis with SAP involvement
OSS Note 543715 The Unicode Conversion of a BW 3.1 system requires additional steps regarding the system copy OSS Note 573044 If you are using HR functionality within R/3 Enterprise , also additional steps are mandatory
3/31/2004 95
3/31/2004
96
Set up the Unicode Conversion Project Check Prerequisites Data Analysis for downtime minimization special MDMP treatment Enabling of Customer Developments
Highly automated System will be down during database conversion Unload /reload process for small databases Minimum downtime tool for large databases
Unicode system is up and running Verification of Data Consistency Integration Testing focused on language handling
3/31/2004
97
Whitepaper:
SAP R/3 incremental migration test
http://saphpcc.bbn.hp.com/Global/Compet/migration/migration.HTM
3/31/2004
98
3/31/2004
99
Only one source code exists for Unicode-based and nonUnicode-based systems, new developments can be smoothly exchanged The interfaces (e.g. RFC) have been extended, so that communication between other Unicode-based systems or non-Unicode-based systems is possible. Furthermore, SAP provides standard tools for the installation of (and conversion to) Unicode-based systems that can also be used for checking and Unicode-enabling of customer developments
3/31/2004 100
solid lines: receiver can receive all characters dotted lines: receiver cannot receive characters, which are not in its own code page. But as long as you restrict the character set, data can be sent from everywhere to everywhere.
Unicode R/3
Latin-1
SJIS
http/RFC
MDMP R/3
WWW
SJIS
Latin-1 http/RFC
Non-Unicode
SJIS
R/3
3/31/2004
101
the Unicode side converts from/ to the code page of the non Unicode side MDMP is converted with a languages key System settings allow the configuration of error handling
3/31/2004 102
3/31/2004
103
3/31/2004
104
configuration file for the SAP printer driver that ensures proper functionality between the SAP data stream and the printer or output device where the data is sent
In R/3, a distinction is made between "printer driver" and "device type A device type consists of a variety of attributes defined for an output device One of these attributes is the printer driver to be used by SAPscript (R/3 forms processor) for this particular printer
3/31/2004 105
device types cover aspects such as control commands for font selection, page size, character set selection, character set used and so on a device type must be specified to enable directprinting from the SAP applications for every new printer defined in SAP environment device types are created by SAP for the entire HP LaserJet printer family on the basis of PCL5, PCL6 and PostScript SAP develops, tests and supports device types for HP products that can be found here:
http://h40045.www4.hp.com/printing_solutions/Device_Types.html
3/31/2004 106
3/31/2004
107
LEXMARK is going into HP accounts, claiming that only LEXMARK could support SAP UNICODE printing. in order to support UNICODE character-sets on an HP printer, customers need to have a UNICODE compliant printer and a SAP UNICODE device-type UNICODE compliant printer are defined by firmware support for UTF8 and/or UTF16 and UNICODE fonts loaded on the printer today LEXMARK is the preferred vendor for SAP UNICODE printing
3/31/2004 108
Background:
all OZ based printers (LJ2300 and higher) support by default UNICODE UTF16 fonts in PCL6 the LJ2300, CLJ9500 and future products will support UTF8 fonts in PCL5 firmware role is planned to also support all current OZ based printers (LJ4200/4300, LJ9000, CLJ4600, CLJ5500) to support UTF-8 in PCL5 furthermore the UNICODE fonts need to be loaded on the printer (e.g. stored on internal hard-disc) today we have a UNICODE-prototype-solution available to print from an SAP environment for more information, contact Alan Cooke (U.S.) or Stephen Westberg (EMEA)
3/31/2004 109
3/31/2004
110
3/31/2004
112
3/31/2004
113
3/31/2004
114
A
1 Byte
Encoding Manufacturers UTF-8 CESU-8 UTF-16 Oracle, SAP DB (8.0) DB/2 (AIX) SQL Server, DB/2 (AS400), SAP DB (7.0)
Network load: (draft results) <7% for Latin-1, about 15% for Japanese, 25% for other Asian languages
3/31/2004 115
CPU Memory
1 1
+20% +20%
+5% +5%
Disk
+10%
+10%
NON-Unicode
3/31/2004 116
1 1 1
Unicode
3/31/2004 117