JDK-6974189 : Re-open bug #4950409, make GB2312 an alias of GBK
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 6u21
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: linux
  • CPU: x86
  • Submitted: 2010-08-03
  • Updated: 2012-03-20
  • Resolved: 2010-08-17
Related Reports
Relates :  
Description
A DESCRIPTION OF THE REQUEST :
Re-open bug #4950409 which requests that the GB2312 be made an alias of GBK instead of EUC_CN.

This RFE was (in our opinion) erroneously marked as a duplicate of #4914869 when in fact it describes a different issue.

JUSTIFICATION :
GBK is an extension of the GB2312 character encoding and is fully backwards compatible. It allows encoding of additional Chinese characters in comparison to GB2312 and its derivatives.

As mentioned in the original RFE, many programs (mail clients in particular) use gb2312 as the encoding name when they mean GBK.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
That "gb2312" (at the very least) be added as an alias to the GBK character set.

This would result in Charset.forName("gb2312") returning an instance of the GBK Charset.

This would allow Java programs to correctly decode data encoded using GBK which list "gb2312" as their encoding (which as the original submitter remarked appears to be common practice)
ACTUAL -
Charset.forName("gb2312") returns the EUC_CN charset, which leads to "unmappable" error characters when decoding Chinese text which is marked as having been encoded using gb2312 when in fact it contains GBK encoding.

CUSTOMER SUBMITTED WORKAROUND :
Only one, which is to manually modify the charsets.jar located in the JRE's /lib director, replacing the existing EUC_CN.class with a modified one which contains the following code:

public class EUC_CN extends GBK {

}

This has the effect of replacing the EUC_CN charset with the GBK charset, which ,as the latter is backwards compatible with the former, should not be a problem.

This is a very ugly hack, but seems to be the only workaround that works, as these mappings are very much hardcoded into the JRE.

Comments
EVALUATION It was a mistake to close #4950409 as a dup of #4914869. Those two are idfferent issues. However, I'm closing this one (again) with "will not fix". The charset GB2312 in Java's charset repository is an implementation of the GB 2312 (one of China's national standard for encoding). While it is indeed a subset of GBK, GB2312 and GBK are two indenpendently specified charsets and their names are two separately registered iana charset names. You can't simply "alias GB2312 to GBK. The "aliasing" to workaround the wrong usage of GB2312 name situation should be done at application level not at platform level.
17-08-2010