United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6183404 Many eudc characters are incorrectly mapped in MS936 and GBK converter
JDK-6183404 : Many eudc characters are incorrectly mapped in MS936 and GBK converter

Details
Type:
Bug
Submit Date:
2004-10-22
Status:
Closed
Updated Date:
2013-06-26
Project Name:
JDK
Resolved Date:
2012-06-12
Component:
core-libs
OS:
windows_2000
Sub-Component:
java.nio.charsets
CPU:
x86
Priority:
P4
Resolution:
Fixed
Affected Versions:
6
Fixed Versions:

Related Reports
Backport:
Backport:

Sub Tasks

Description
Some 538 UDC wrongly mapped and some 412 non-UDC characters missing in Java's MS936 converter.

For example, 0xA7A0 is supposed to be mapped to 0xE765 in Unicode (0xEE9DA5), whereas Java's MS936 maps it to 0xE79F in Unicode.

A comparison of MS936 to GBK mappings indicates that the Microsoft code page 936 is slightly different from GBK in terms of UDC mapping. However, Java's implementation of MS936 appears to be same as GBK.

A list of the incorrect mappings and missing characters is being provided to Sun separately.

The list is attached to this CR.
###@###.### 10/22/04 21:36 GMT
###@###.### 10/22/04 21:42 GMT

                                    

Comments
EVALUATION

See thread http://mail.openjdk.java.net/pipermail/i18n-dev/2012-May/000630.html. Here are some notes regarding the changes.

(1) MS936 and GBK mappings have been updated to follow maps used by MS MultiByteToWideChar and WideCharToMultiByte
 (especially for those eudc/pua entries in 0xA140 - 0xA7A0 range, which is also the same as the mapping GB18030 uses.
(2) For GBK.map added the eoro sin entry U+20AC to 0xA2E3, which follows GB18030 (MS936 maps 20ac to 0x80)
(3) "412 non-UDC characters missing" suggestion obvious is in-accurate. It appears the code points cited in attached CodePage936 are the result of incorrect use of WideCharToMultiByte. Those are the "best fit" result when WideCharToMultiByte is invoked without flag WC_NO_BEST_FIT_CHARS is specified.
(4) These code point mapping changes are "in-compatible" in nature. However given the fact that the GB18030 (the spec owner of Chinese encoding) and MS936/icov-gbk (vendor implementation) all use the same mappings, it does not seem reasonble for Java MS936/GBK to stick with the "wrong" mappings, especially these are eudc/pua code points. So I believe we should go ahead and just udpate them.
                                     
2012-06-05
The charset are generated by tool (tool has been tested/verified separately) based on these updated mapping table. No new test is needed.
                                     
2013-04-12
verified in b85
                                     
2013-04-12
EVALUATION

Need regenerate the MS936 mapping table. But not for mustang
###@###.### 2005-07-21 22:58:28 GMT
                                     
2005-07-21



Hardware and Software, Engineered to Work Together