HOME
THE FIRM
Why BCS?
CLIENT LIST
Capabilities
Expertise
Proven Services
SUCCESS STORIES
ARTICLES
PRESENTATIONS

DB2 UDB for z/OS V8 AND UNICODE:
IT’S NO TOWER OF BABEL

Introduction.

DB2 UDB for z/OS Version 8 is the largest, most profound, highest impact version in the product’s history.  Consider:

  • There are more new lines of code in DB2 V8 than there were total lines of code in DB2 Version 1.1.

  • Some individuals say migrating to DB2 V8 is a migration equivalent to DB2 V6 and V7 combined.

  • The Quality Partnership Program – a beta program – lasted 14 months, the longest ever for DB2.

  • There are more changes to the catalog than any other release or version of DB2.

  • DB2 V8 requires 64-bit addressing.  A DB2 V8 address space can be up to 16 exobytes, 8 billion times the storage available with V7 and prior.

  • DB2 V8 entities have been increased by magnitudes.  For example:

    • Maximum table size has been increased from 16 terabytes to 128 terabytes.

    • Table names can now be up to 128 bytes instead of 18 bytes.

    • The number of partitions for a table has been increased from 254 to 4096.

    • SQL statement length can now be up to 2 MB instead of 32KB.

    • Archive logs have been increased to 10,000 from 1,000.

    •  And the list goes on and on.

  • The installation methodology of DB2 has – optionally – been re-worked through the implementation of msys for Setup as an installation technique.

Yet for all these product changes and enhancements, one of the largest modifications in DB2 V8 is the implementation of Unicode.  While earlier versions of DB2 had Unicode capability, DB2 V8 has extensively integrated character conversion into nearly every aspect of the product.  For example:

Once DB2 V8 is moved to New-Function Mode, DB2 catalog information is almost exclusively in Unicode.  This may cause the catalog to increase in size.

  • The DB2 parser operates exclusively in Unicode.

  • The DB2 precompiler generates Unicode DBRMs in NFM.

  • Data in user tables can optionally be stored in Unicode.

  • SQL statements using multiple CCSIDs usually perform string value comparison in Unicode.

  • SQL predicates’ collating sequence may be in Unicode, a collating sequence different than EBCDIC.

  • Especially if you convert DB2 data bases to Unicode, storage requirements may increase.

History and Data Types.

The Unicode data type and standard were created in the early 1990’s by 2 organizations – the International Organization for Standardization (ISO) 10646 and the Unicode project – to standardize and consolidate all of the data types that had evolved as information technology evolved.  Some of these disparate data types are:

An 8-bit EBCDIC data representation on S390 and AS400 platforms, and a 7-bit ASCII representation on Intel and UNIX platforms.  There are many ASCII representations, such as PC ASCII, AIX ASCII, and Windows ASCII.  The representation used on a particular platform is called the encoding scheme.

  • A Single Byte Character Set (SBCS) used for most traditional alphanumeric data, a Double Byte Character Set used for graphic data and the Korean, Chinese, and Japanese character sets, and Mixed Data that combines both SBCS and DBCS (which is specified on the DB2 DSNTIPF installation panel).

  • National languages, which use different special characters such as phonetic symbols (like the German Umlaut), or different alphabets (like Russian).

  • Ideographic and symbol data types in addition to alphanumeric data types.

  • Binary, hexadecimal, packed decimal, and zoned decimal numeric data types.

Since developing 2 sets of standards for Unicode would have created a cumbersome situation, ISO and the Unicode project joined forces to create a single standard.  The current standard, Version 4.0 (ISO 10646-2:2003), contains 96,248 graphic and format characters, with 3 forms of the Unicode Transformation Standard:

Unicode will continue to expand and evolve, especially as ancient and obsolete languages are added to the standard.  However, no existing characters will be deleted from the current standard.

Terms and Concepts.

It is necessary to understand some basic aspects of Character Conversion:

Code Page – A code page is a matrix whereby a character is assigned a unique value – or code point – within a code page.  For example, in the U.S. EBCDIC Code Page 37, “A” maps to x’C1’ (hexadecimal C1), “B” maps to x’C2’, “J” maps to x’D1’, “0” maps to x’F0’, etc.  Here’s how it looks:

                 

Code Point – A code point is a specific position within a code page matrix that maps a character to a value.

CCSID – CCSID stands for Coded Character Set Identifier, a unique value that identifies a code page.  For example, CCSID 37 equates to U.S. EBCDIC, CCSID 437 equates to U.S. ASCII PC, CCSID 819 equates to U.S. AIX ASCII, and CCSID 1208 equates to Unicode.  DB2 also uses a System CCSID (SCCSID), specified on the DSNTIPF Installation Panel, and stored in DSNHDECP.  This is the default CCSID for the DB2 system.

CCSID Set – In the case of the Chinese, Japanese, and Korean languages, both SBCS and DBCS characters are required to represent the entire language due to the range and complexity of their symbols.  This requires a combination of CCSIDs (one for SBCS, one for mixed data, and one for DBCS) to define the entire language; the group of CCSIDs is collectively referred to as a CCSID Set.

Encoding Scheme – As mentioned above, either 8-bit EBCDIC or 7-bit ASCII is used as the encoding scheme for a particular platform.  The 2 data representations have different collating sequences: EBCDIC collates in the sequence of space, lower case letters, capital letters, and numbers, while ASCII collates in the sequence of space, numbers, capital letters, and lower case letters.

Application Encoding Scheme – This is the default encoding scheme that determines how application programs “see” the data.  Terminal emulators and programs must all use the same encoding scheme to compare and manipulate data – unless CCSID is appropriately specified in SQL statements – to be consistent.

Application encoding schemes can be specified via the following facilities:

  • The default DB2 application system encoding scheme (SCCSID) as specified in DSNHDECP.

  • Force all users to use the same CCSID on their terminal emulators

  • Specifying the ENCODING BIND parameter when a program is bound.

  • Specifying the CCSID in the SQLDA.

  • Specifying CCSID when executing a DECLARE VARIABLE.

  • Set the CURRENT APPLICATION CODING SCHEME special register.

Data Storage Encoding Scheme – The data storage encoding scheme determines what CCSID is used to store the data, (US EBCDIC, German ASCII, Unicode, etc.).  Table spaces within a DB2 data base can use different CCSIDs, but all tables within a table space must use the same CCSID.

Data storage encoding schemes can be determined via the following facilities:

  • The ENSCHEME value in DSNHDECP (default is EBCDIC).  This is the default used in the CREATE DATABASE, CREATE TABLESPACE, CREATE DISTINCT TYPE, CREATE FUNCTION, CREATE GLOBAL TEMPORARY TABLE, and DECLARE GLOBAL TEMPORARY TABLE DDL statements.

  • Specifying CCSID on the CREATE DATABASE DDL statement.

  • Specifying CCSID on the CREATE TABLESPACE DDL statement.

  • Specifying CCSID on the CREATE TABLE DDL statement.

  • Specifying CCSID on the CREATE PROCEDURE DDL statement.

Character Conversion Implementation.

Character conversion is necessary when different data types are processed in a given transaction, program, or stored procedure.  Data inputted via a terminal emulator may use a different code page than the data in a table used for comparison or storage.  Data in one table may use a different CCSID than data in another table that it is processed against.  Data in one server may use a different code page than the server it is transferred to.  Through the proper use of encoding schemes and CCSIDs using the above methods, DB2 has the information needed to convert data to a different type, in both intermediate and final steps.  The number of scenarios are many, depending on the DB2 functions involved (processing tables versus views, use of ORDER BY, CAST, BETWEEN, etc.) and the number of different data types involved in a process.  In fact, the number of character conversions is so numerous that it’s beyond the scope of this article.  A good discussion of this topic can be found in DB2 UDB for z/OS Version 8: Everything You Ever Wanted To Know…, And More – Chapter 6: Unicode in DB2 for z/OS, at http://www.redbooks.ibm.com/redbooks/pdfs/sg246079.pdf.

There are 2 different techniques for performing character conversion: use of SYSIBM.SYSTRINGS, or z/OS Conversion Services.  SYSIBM.SYSSTRINGS is the original character conversion technique, but starting with DB2 V7, z/OS Conversion Services is another option, and is the recommended method.  z/OS Conversion Services must be set up for DB2, and APAR OA04069 provides a default way to do this.  OA04069 requires 9,862 fixed, real pages, or 39 Megabytes, to store 96,248 characters, because it provides for all languages, encoding schemes, etc.  Most customized conversion images take around 2 Megabytes, because only a few languages and encoding schemes are usually needed, so use of customized conversion images is recommended.  In addition, character conversion performance is significantly improved with z/OS V1.4 – as opposed to z/OS V1.3 – so V1.4 is highly recommended.  Another character conversion performance enhancement is the fact that DB2 UDB V8 will perform inline code conversion in some cases.  Lastly, the zSeries line of processors has new hardware instructions that are used in character conversion, further improving performance.

DASD and real storage requirements are another consideration, because they will increase by some amount when Unicode is used.  Converting EBCDIC or ASCII to Unicode will cause certain 1-byte characters to require 2 bytes in Unicode.  This is called an expanding conversion (a contracting conversion is where a 2 byte Unicode character is a single byte in ASCII or EBCDIC).  Any ASCII character over x’127’ (or EBCDIC equivalent) will take 2 bytes in Unicode.  Since the space, 0-9, a-z, A-Z and most punctuation and symbols (., !, $, #, etc.) are all represented in Unicode by a single byte, conversion of these characters will not increase storage requirements.  Only special characters such as the cent, English pound, and pronunciation characters such as the German Umlaut take 2 bytes. Languages such as German and French are impacted to a greater degree than U.S. English.

Application Programming Considerations and Impact.

Collating sequences are impacted when converting data to Unicode.  In the case of ASCII, this impact is minimal, because the first 127 code points are identical between ASCII and Unicode.  EBCDIC is another story, however.  While the character sequence in both Unicode and ASCII is space, 0-9, A-Z, then a-z, the character sequence in EBCDIC is space, a-z, A-Z, and 0-9.  As a consequence, the collating sequence of an ORDER BY, BETWEEN, MIN/MAX, and other similar SQL functions are different with Unicode than with EBCDIC.  However, this is only in the situation where the columns being collated contain a mix of numeric, upper case and lower case alphabetic characters (and possibly symbols).  One area that may be immediately impacted after you move to New-Function Mode is queries against the DB2 catalog, because it is converted to Unicode in the NFM migration step.  Queries against existing data tables won’t be affected until they are converted to Unicode.

A technique you can use to cause Unicode data to be collated as EBCDIC data is through the use of CAST, although doing so makes it impossible to use an index to replace the ORDER BY in the following statement: 

   SELECT CAST(column_name AS CHAR(nn) CCSID EBCDIC) AS new_column_name                    FROM Unicode_table_name  
            
       WHERE column_name LIKE ‘xxxxx’  
            
   ORDER BY new_column_name

Here are some other application considerations:

  • Special characters, such as $, @, #, |, etc. vary by EBCDIC code page.  SQL statements containing these characters are interpreted differently depending on the CCSID or encoding scheme that is specified.

    • Using DECLARE VARIABLE with the correct CCSID is one way to avoid this situation.

    • Processing these characters in hexadecimal eliminates this problem.

    • Unicode eliminates this problem.

  • In all modes (CM, ENFM, NFM), the DB2 precompiler converts program source to UTF-8, precompiles the source, then converts it back to its original format.  The resultant DBRM may be in EBCDIC or Unicode.

    • The code page of the source code is either the default (SCCSID) or the CCSID precompiler option.

    •       The default DBRM data type in CM and ENFM mode is EBCDIC, Unicode in NFM mode.  NEWFUN(NO|YES) can override this.  NO = EBCDIC, YES = Unicode, plus new SQL functions can be used (NFM mode only).

    •       The DBRM and Package/Plan in SYSIBM.SYSSTMT and SYSIBM.SYSPACKSTMT are more difficult to read when they’re in Unicode.

      • Visual Explain, the DB2 Administration Tool, or an ASCII emulator makes this information legible.

      • Downloading the data and using NOTEPAD to view it makes it legible.

      • CAST can be used to make this data legible.

  • Certain Functions and Routines are changed.

    • UX (Unicode UTF-16 hexadecimal string) and GX (Unicode Graphic Hexadecimal string) are new data types.  Each must be a multiple of 4 digits.

    • LENGTH returns the actual storage length, not the number of characters in a string.

    • CHARACTER_LENGTH is a new function that returns the number of characters in a string.

    • POSITION and SUBSTRING are also new functions.

    • CODEUNITS16, CODEUNITS32, and OCTETS are new keywords for these new functions.

  • Multiple CCSIDs can be used in a single SQL statement.

    • Unicode, ASCII, and EBCDIC data from different languages can be used in a single SQL statement.

    • The default encoding scheme is used for data without a specified CCSID.

    • SBCS, mixed, and DBCS data can be compared in a single SQL statement.  SYSIBM.SYSCOLUMNS determines the data type of a column.

    • An encoding scheme or CCSID can change the data type of columns used in a view.  WARNING: this can cause different results from a view than working directly with the tables.  This is implemented via a new column, APP_ENCODING_CCSID, in SYSIBM.SYSVIEWS.

    • Unicode intermediate and final answer sets – which may be different than the EBCDIC equivalents – can affect the results of a JOIN.

    • The sequence in which the CCSIDs are specified for columns and string data determines the character conversions that are performed by DB2 before doing comparisons.

References.

Title

ID

URL

Unicode Website

  http://www.unicode.org

Unicode code charts

  http://www.unicode.org/charts

Unicode downloadable book

  http://www.unicode.org/unibook

DB2 V8: Everything You Ever Wanted To Know,… and More: Chapter 6 – Unicode in DB2 for z/OS

SG24-6079 http://www.redbooks.ibm.com/redbooks/pdfs/sg246079.pdf
DB2 V8 SQL Reference: Section 1.18 – Character Conversion SC18-7426

 

http://publibfp.boulder.ibm.com/cgi-bin/bookmgr/BOOKS/DSNSQJ10/1.18

 

DB2 V8 Installation Guide: Appendix A – Character Conversion GC18-7418

 

http://publibfp.boulder.ibm.com/cgi-bin/bookmgr/BOOKS/DSNIGJ10/APPENDIX1.1
DB2 V8 Release Planning Guide: Section 2.3 – Unicode Enhancements SC18-7425

 

http://publibfp.boulder.ibm.com/cgi-bin/bookmgr/BOOKS/DSNRGJ10/2.3

 

DB2 V8 Technical Preview, Appendix A – Unicode Definitions

SG24-6871 http://www.redbooks.ibm.com/redbooks/pdfs/sg246871.pdf

 

z/OS Support for Unicode™. Using Conversion Services

SA22-7649

 

http://publibfp.boulder.ibm.com/cgi-bin/bookmgr/BOOKS/IEA2UN20/CCONTENTS

DB2 V7 and V8 Unicode Support

  ftp://ftp.software.ibm.com/software/db2storedprocedure/db2zos390/techdocs/F10.pdf
Unicode Performance in DB2 for z/OS   http://www.idug.org/idug/member/journal/aug04/article04.cfm
All For One and One For All – Introducing Unicode (Part 1)   http://www.idug.org/idug/member/journal/nov03/article06.cfm

 

All For One and One For All – Introducing Unicode (Part 2)   http://www.idug.org/idug/member/journal/may04/article03.cfm

 

Summary.

With DB2 UDB for z/OS V8, Unicode is something you must understand.  With a parser and catalog that are both Unicode, V8 is fundamentally different than V7 and prior releases.  As more and more organizations become global entities, the value of Unicode will become abundantly apparent.

 

 

Send mail to solution@mainframehelp.com with questions or comments about this web site.
Copyright © 2006 Best Customer Solutions, Inc.
Last modified: 06/13/2006