JDK-4950409 : GB2312 should an alias of GBK, instead of EUC_CN
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 1.4.1
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: windows_2000
  • CPU: x86
  • Submitted: 2003-11-07
  • Updated: 2003-11-10
  • Resolved: 2003-11-10
Related Reports
Duplicate :  
Relates :  
Description

Name: rmT116609			Date: 11/06/2003


A DESCRIPTION OF THE REQUEST :
Sun Microsystems sun.io.CharacterEncoding.java in JDK 1.4.1 has the following statement:
aliasTable.put("gb2312", "EUC_CN");

In other words, JDK 1.4.1 sets GB2312 as an alias of EUC_CN.

Technically, there is nothing wrong because Sun Microsystem follows all specification precisely. However, most of parties in the software industry do not following such mapping.

Microsoft, Google, and all major Web sites in China (PRC), including Sina, Sohu, and Netease, map GB2312 as the MIME name for GBK. Although I cannot access to their source code to provide evidence, I can give a good example to indicate their mapping. (If they do not have mapping, then they use GB2312 as the declaration or MIME name of GBK.)

Recall that the name of the former Chinese Prime Minister is Zhu Rongji. His Chinese name is������������������F������������ (U+6731 U+9555 U+57FA). When Zhu took the office in 1998, none of major Chinese Web sites could display his name properly. Especially, ������F (U+9555) was shown as two characters. Here is an article from Sina that shows such problem (dated as June 26, 2000) http://news.sina.com.cn/china/2000-06-26/100918.html.

One of the reason is that������F (U+9555) is not part of GB2312, which was (and is) the character set declaration for all major Web sites. ������F (U+9555) is defined in GBK.

Now all major Chinese Web sites are able to display ������F (U+9555) properly. Here is a Web site containing the name of the former Chinese Prime Minister http://news.gd.sina.com.cn/finance/2003-11-01/198823.html. The way they did is to use GBK. However, their character set declaration is still <meta http-equiv="content-type" content="text/html; charset=gb2312">.

Besides all major Chinese Web sites, Microsoft and Google map GB2312 to GBK. People can send ������F (U+9555) with GB2312 in Outlook Express 6.0. Google can search������F (U+9555) if a Web sites has <meta http-equiv="content-type" content="text/html; charset=gb2312">.

The conclusion is that Sun Microsystem maps GB2312 to EUC_CN, while the rest of parties in the software industry map GB2312 to GBK.

Such difference causes not only Java developers������������ headache, but also bugs in Sun Microsystem������������s tools. For example, users might send������F (U+9555) with GB2312 in Microsoft Outlook Express 6.0. And then JavaMail cannot recognize������F (U+9555). (Developers can try the example from http://developer.java.sun.com/developer/onlineTraining/JavaMail/exercises/MailSetup/). Changing JavaMail������������s javamail.charset.map does not help.

Perhaps the only ally of Sun Microsystem is IBM because ICU 2.4 also maps GB1232 to EUC, but not GBK.

  To achieve better behavior instead of simply follow specification, I suggest that GB2312 should be set as an alias of GBK, instead of EUC_CN.

Changing from EUC_CN to GBK does not have any backward compatibility problems because GBK is compatible with EUC_CN (GB 2312-80). Here are some technically data from Ken Lunde, CJKV Information Processing.

������������From a compatibility point of view, there is comfort in knowing that every character in GB 2312-80 is at the same code point in GBK������������, Ken Lunde, CJKV Information Processing, page 89.

������������Note that the EUC_CN code set 1 encoding range, 0xA1A1 through 0xFEFE, forms a subset of GBK encoding. This is by design, and has the benefit of providing backward compatibility with EUC_CN encoding.������������ Ken Lunde, CJKV Information Processing, page 171.

I believe that GB2312 is just one of the IANA names having problem. There are more IANA names having similar problem.


JUSTIFICATION :
All major parties in software industry, including Microsoft, Google, Sina, Sohu, and Netease, use GB2312 as the preferred MIME name for GBK. In other words, GB2312 should an an alas of GBK, not EUC_CN.

Java developers have problems with GB2312=EUC_CN. Especially, Chinese developers feel a little shame on U+9555, one of the characters of their former Prime Minister Chu Rongji.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
set GB2312 to be an alias of GBK, instead of EUC_CN
(Incident Review ID: 224221) 
======================================================================

Comments
EVALUATION This is a known issue and there is an imminent fix in the pipeline for 1.5 Closing out as duplicate of 4914869 ###@###.### 2003-11-10
10-11-2003