|
DB2 UDB for z/OS V8 AND UNICODE:
IT’S NO TOWER OF BABEL
Introduction.
DB2 UDB for z/OS Version 8 is the largest, most
profound, highest impact version in the product’s history. Consider:
-
There
are more new lines of code in DB2 V8 than there were total lines of code
in DB2 Version 1.1.
-
Some
individuals say migrating to DB2 V8 is a migration equivalent to DB2 V6
and V7 combined.
-
The
Quality Partnership Program – a beta program – lasted 14 months, the
longest ever for DB2.
-
There
are more changes to the catalog than any other release or version of DB2.
-
DB2
V8 requires 64-bit addressing. A
DB2 V8 address space can be up to 16 exobytes, 8 billion times the storage
available with V7 and prior.
-
DB2
V8 entities have been increased by magnitudes.
For example:
-
Maximum
table size has been increased from 16 terabytes to 128 terabytes.
-
Table
names can now be up to 128 bytes instead of 18 bytes.
-
The
number of partitions for a table has been increased from 254 to 4096.
-
SQL
statement length can now be up to 2 MB instead of 32KB.
-
Archive
logs have been increased to 10,000 from 1,000.
-
And
the list goes on and on.
-
The
installation methodology of DB2 has – optionally – been re-worked
through the implementation of msys for Setup as an installation technique.
Yet
for all these product changes and enhancements, one of the largest
modifications in DB2 V8 is the implementation of Unicode.
While earlier versions of DB2 had Unicode capability, DB2 V8 has
extensively integrated character conversion into nearly every aspect of
the product. For example:
Once
DB2 V8 is moved to New-Function Mode, DB2 catalog information is almost
exclusively in Unicode. This
may cause the catalog to increase in size.
-
The
DB2 parser operates exclusively in Unicode.
-
The
DB2 precompiler generates Unicode DBRMs in NFM.
-
Data
in user tables can optionally be stored in Unicode.
-
SQL
statements using multiple CCSIDs usually perform string value comparison
in Unicode.
-
SQL
predicates’ collating sequence may be in Unicode, a collating sequence
different than EBCDIC.
-
Especially
if you convert DB2 data bases to Unicode, storage requirements may
increase.
History
and Data Types.
The
Unicode data type and standard were created in the early 1990’s by 2
organizations – the International Organization for Standardization (ISO)
10646 and the Unicode project – to standardize and consolidate all of
the data types that had evolved as information technology evolved. Some of these disparate data types are:
An
8-bit EBCDIC data representation on S390 and AS400 platforms, and a 7-bit
ASCII representation on Intel and UNIX platforms.
There are many ASCII representations, such as PC ASCII, AIX ASCII,
and Windows ASCII. The
representation used on a particular platform is called the encoding
scheme.
-
A
Single Byte Character Set (SBCS) used for most traditional alphanumeric
data, a Double Byte Character Set used for graphic data and the Korean,
Chinese, and Japanese character sets, and Mixed Data that combines both
SBCS and DBCS (which is specified on the DB2 DSNTIPF installation panel).
-
National
languages, which use different special characters such as phonetic symbols
(like the German Umlaut), or different alphabets (like Russian).
-
Ideographic
and symbol data types in addition to alphanumeric data types.
-
Binary,
hexadecimal, packed decimal, and zoned decimal numeric data types.
Since
developing 2 sets of standards for Unicode would have created a cumbersome
situation, ISO and the Unicode project joined forces to create a single
standard. The current
standard, Version 4.0 (ISO 10646-2:2003), contains 96,248 graphic and
format characters, with 3 forms of the Unicode Transformation Standard:
Unicode
will continue to expand and evolve, especially as ancient and obsolete
languages are added to the standard.
However, no existing characters will be deleted from the current
standard.
Terms
and Concepts.
It
is necessary to understand some basic aspects of Character Conversion:
Code
Page
– A code page is a matrix whereby a character is assigned a unique value
– or code point – within a code page.
For example, in the U.S. EBCDIC Code Page 37, “A” maps to
x’C1’ (hexadecimal C1), “B” maps to x’C2’, “J” maps to
x’D1’, “0” maps to x’F0’, etc.
Here’s how it looks:
Code
Point
– A code point is a specific position within a code page matrix that
maps a character to a value.
CCSID
– CCSID stands for Coded Character Set Identifier, a unique value that
identifies a code page. For
example, CCSID 37 equates to U.S. EBCDIC, CCSID 437 equates to U.S. ASCII
PC, CCSID 819 equates to U.S. AIX ASCII, and CCSID 1208 equates to
Unicode. DB2 also uses a
System CCSID (SCCSID), specified on the DSNTIPF Installation Panel, and
stored in DSNHDECP. This is
the default CCSID for the DB2 system.
CCSID
Set
– In the case of the Chinese, Japanese, and Korean languages, both SBCS
and DBCS characters are required to represent the entire language due to
the range and complexity of their symbols.
This requires a combination of CCSIDs (one for SBCS, one for mixed
data, and one for DBCS) to define the entire language; the group of CCSIDs
is collectively referred to as a CCSID Set.
Encoding
Scheme
– As mentioned above, either 8-bit EBCDIC or 7-bit ASCII is used as the
encoding scheme for a particular platform.
The 2 data representations have different collating sequences:
EBCDIC collates in the sequence of space, lower case letters, capital
letters, and numbers, while ASCII collates in the sequence of space,
numbers, capital letters, and lower case letters.
Application
Encoding Scheme
– This is the default encoding scheme that determines how application
programs “see” the data. Terminal
emulators and programs must all use the same encoding scheme to compare
and manipulate data – unless CCSID is appropriately specified in SQL
statements – to be consistent.
Application
encoding schemes can be specified via the following facilities:
-
The
default DB2 application system encoding scheme (SCCSID) as specified in
DSNHDECP.
-
Force
all users to use the same CCSID on their terminal emulators
-
Specifying
the ENCODING BIND parameter when a program is bound.
-
Specifying
the CCSID in the SQLDA.
-
Specifying
CCSID when executing a DECLARE VARIABLE.
-
Set
the CURRENT APPLICATION CODING SCHEME special register.
Data
Storage Encoding Scheme
– The data storage encoding scheme determines what CCSID is used to
store the data, (US EBCDIC, German ASCII, Unicode, etc.).
Table spaces within a DB2 data base can use different CCSIDs, but
all tables within a table space must use the same CCSID.
Data
storage encoding schemes can be determined via the following facilities:
-
The
ENSCHEME value in DSNHDECP (default is EBCDIC). This is the default used in the CREATE DATABASE, CREATE
TABLESPACE, CREATE DISTINCT TYPE, CREATE FUNCTION, CREATE GLOBAL TEMPORARY
TABLE, and DECLARE GLOBAL TEMPORARY TABLE DDL statements.
-
Specifying
CCSID on the CREATE DATABASE DDL statement.
-
Specifying
CCSID on the CREATE TABLESPACE DDL statement.
-
Specifying
CCSID on the CREATE TABLE DDL statement.
-
Specifying
CCSID on the CREATE PROCEDURE DDL statement.
Character
Conversion Implementation.
Character
conversion is necessary when different data types are processed in a given
transaction, program, or stored procedure.
Data inputted via a terminal emulator may use a different code page
than the data in a table used for comparison or storage. Data in one table may use a different CCSID than data in
another table that it is processed against.
Data in one server may use a different code page than the server it
is transferred to. Through
the proper use of encoding schemes and CCSIDs using the above methods, DB2
has the information needed to convert data to a different type, in both
intermediate and final steps. The
number of scenarios are many, depending on the DB2 functions involved
(processing tables versus views, use of ORDER BY, CAST, BETWEEN, etc.) and
the number of different data types involved in a process.
In fact, the number of character conversions is so numerous that
it’s beyond the scope of this article.
A good discussion of this topic can be found in DB2
UDB for z/OS Version 8: Everything You Ever Wanted To Know…, And More
– Chapter 6: Unicode in DB2 for z/OS, at http://www.redbooks.ibm.com/redbooks/pdfs/sg246079.pdf.
There
are 2 different techniques for performing character conversion: use of
SYSIBM.SYSTRINGS, or z/OS Conversion Services.
SYSIBM.SYSSTRINGS is the original character conversion technique,
but starting with DB2 V7, z/OS Conversion Services is another option, and
is the recommended method. z/OS
Conversion Services must be set up for DB2, and APAR OA04069 provides a
default way to do this. OA04069
requires 9,862 fixed, real pages, or 39 Megabytes, to store 96,248
characters, because it provides for all languages, encoding schemes, etc.
Most customized conversion images take around 2 Megabytes, because
only a few languages and encoding schemes are usually needed, so use of
customized conversion images is recommended.
In addition, character conversion performance is significantly
improved with z/OS V1.4 – as opposed to z/OS V1.3 – so V1.4 is highly
recommended. Another
character conversion performance enhancement is the fact that DB2 UDB V8
will perform inline code conversion in some cases.
Lastly, the zSeries line of processors has new hardware
instructions that are used in character conversion, further improving
performance.
DASD
and real storage requirements are another consideration, because they will
increase by some amount when Unicode is used.
Converting EBCDIC or ASCII to Unicode will cause certain 1-byte
characters to require 2 bytes in Unicode.
This is called an expanding conversion (a contracting conversion is
where a 2 byte Unicode character is a single byte in ASCII or EBCDIC).
Any ASCII character over x’127’ (or EBCDIC equivalent) will
take 2 bytes in Unicode. Since
the space, 0-9, a-z, A-Z and most punctuation and symbols (., !, $, #,
etc.) are all represented in Unicode by a single byte, conversion of these
characters will not increase storage requirements.
Only special characters such as the cent, English pound, and
pronunciation characters such as the German Umlaut take 2 bytes. Languages
such as German and French are impacted to a greater degree than U.S.
English.
Application
Programming Considerations and Impact.
Collating
sequences are impacted when converting data to Unicode.
In the case of ASCII, this impact is minimal, because the first 127
code points are identical between ASCII and Unicode.
EBCDIC is another story, however.
While the character sequence in both Unicode and ASCII is space,
0-9, A-Z, then a-z, the character sequence in EBCDIC is space, a-z, A-Z,
and 0-9. As a consequence,
the collating sequence of an ORDER BY, BETWEEN, MIN/MAX, and other similar
SQL functions are different with Unicode than with EBCDIC.
However, this is only in the situation where the columns being
collated contain a mix of numeric, upper case and lower case alphabetic
characters (and possibly symbols). One
area that may be immediately impacted after you move to New-Function Mode
is queries against the DB2 catalog, because it is converted to Unicode in
the NFM migration step. Queries against existing data tables won’t be affected
until they are converted to Unicode.
A
technique you can use to cause Unicode data to be collated as EBCDIC data
is through the use of CAST, although doing so makes it impossible to use
an index to replace the ORDER BY in the following statement:
SELECT CAST(column_name AS CHAR(nn) CCSID EBCDIC) AS
new_column_name
FROM Unicode_table_name
WHERE column_name LIKE
‘xxxxx’
ORDER BY new_column_name
Here
are some other application considerations:
-
Special
characters, such as $, @, #, |, etc. vary by EBCDIC code page.
SQL statements containing these characters are interpreted
differently depending on the CCSID or encoding scheme that is specified.
-
Using
DECLARE VARIABLE with the correct CCSID is one way to avoid this
situation.
-
Processing
these characters in hexadecimal eliminates this problem.
-
Unicode
eliminates this problem.
-
In
all modes (CM, ENFM, NFM), the DB2 precompiler converts program source to
UTF-8, precompiles the source, then converts it back to its original
format. The resultant DBRM
may be in EBCDIC or Unicode.
-
The
code page of the source code is either the default (SCCSID) or the CCSID
precompiler option.
-
The
default DBRM data type in CM and ENFM mode is EBCDIC, Unicode in NFM mode.
NEWFUN(NO|YES) can override this.
NO = EBCDIC, YES = Unicode, plus new SQL functions can be used (NFM
mode only).
-
The
DBRM and Package/Plan in SYSIBM.SYSSTMT and SYSIBM.SYSPACKSTMT are more
difficult to read when they’re in Unicode.
-
Visual
Explain, the DB2 Administration Tool, or an ASCII emulator makes this
information legible.
-
Downloading
the data and using NOTEPAD to view it makes it legible.
-
CAST
can be used to make this data legible.
-
Certain
Functions and Routines are changed.
-
UX
(Unicode UTF-16 hexadecimal string) and GX (Unicode Graphic Hexadecimal
string) are new data types. Each
must be a multiple of 4 digits.
-
LENGTH
returns the actual storage length, not the number of characters in a
string.
-
CHARACTER_LENGTH
is a new function that returns the number of characters in a string.
-
POSITION
and SUBSTRING are also new functions.
-
CODEUNITS16,
CODEUNITS32, and OCTETS are new keywords for these new functions.
-
Multiple
CCSIDs can be used in a single SQL statement.
-
Unicode,
ASCII, and EBCDIC data from different languages can be used in a single
SQL statement.
-
The
default encoding scheme is used for data without a specified CCSID.
-
SBCS,
mixed, and DBCS data can be compared in a single SQL statement.
SYSIBM.SYSCOLUMNS determines the data type of a column.
-
An
encoding scheme or CCSID can change the data type of columns used in a
view. WARNING: this can cause different results from a view than working
directly with the tables. This
is implemented via a new column, APP_ENCODING_CCSID, in SYSIBM.SYSVIEWS.
-
Unicode
intermediate and final answer sets – which may be different than the
EBCDIC equivalents – can affect the results of a JOIN.
-
The
sequence in which the CCSIDs are specified for columns and string data
determines the character conversions that are performed by DB2 before
doing comparisons.
References.
Summary.
With
DB2 UDB for z/OS V8, Unicode is something you must understand.
With a parser and catalog that are both Unicode, V8 is
fundamentally different than V7 and prior releases.
As more and more organizations become global entities, the value of
Unicode will become abundantly apparent.
|