JDK-6183404 : Many eudc characters are incorrectly mapped in MS936 and GBK converter
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 6
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: windows_2000
  • CPU: x86
  • Submitted: 2004-10-22
  • Updated: 2014-09-24
  • Resolved: 2012-06-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
7u40Fixed 8 b43Fixed
Some 538 UDC wrongly mapped and some 412 non-UDC characters missing in Java's MS936 converter.

For example, 0xA7A0 is supposed to be mapped to 0xE765 in Unicode (0xEE9DA5), whereas Java's MS936 maps it to 0xE79F in Unicode.

A comparison of MS936 to GBK mappings indicates that the Microsoft code page 936 is slightly different from GBK in terms of UDC mapping. However, Java's implementation of MS936 appears to be same as GBK.

A list of the incorrect mappings and missing characters is being provided to Sun separately.

The list is attached to this CR.
###@###.### 10/22/04 21:36 GMT
###@###.### 10/22/04 21:42 GMT

verified in b85

The charset are generated by tool (tool has been tested/verified separately) based on these updated mapping table. No new test is needed.

EVALUATION See thread http://mail.openjdk.java.net/pipermail/i18n-dev/2012-May/000630.html. Here are some notes regarding the changes. (1) MS936 and GBK mappings have been updated to follow maps used by MS MultiByteToWideChar and WideCharToMultiByte (especially for those eudc/pua entries in 0xA140 - 0xA7A0 range, which is also the same as the mapping GB18030 uses. (2) For GBK.map added the eoro sin entry U+20AC to 0xA2E3, which follows GB18030 (MS936 maps 20ac to 0x80) (3) "412 non-UDC characters missing" suggestion obvious is in-accurate. It appears the code points cited in attached CodePage936 are the result of incorrect use of WideCharToMultiByte. Those are the "best fit" result when WideCharToMultiByte is invoked without flag WC_NO_BEST_FIT_CHARS is specified. (4) These code point mapping changes are "in-compatible" in nature. However given the fact that the GB18030 (the spec owner of Chinese encoding) and MS936/icov-gbk (vendor implementation) all use the same mappings, it does not seem reasonble for Java MS936/GBK to stick with the "wrong" mappings, especially these are eudc/pua code points. So I believe we should go ahead and just udpate them.

EVALUATION Need regenerate the MS936 mapping table. But not for mustang ###@###.### 2005-07-21 22:58:28 GMT