Bug ID: JDK-4838512 (cs) Default charsets must be hardwired

JDK-4838512 : (cs) Default charsets must be hardwired

Type: Bug
Component: core-libs
Sub-Component: java.nio
Affected Version: 1.4.1,1.4.1_03,1.4.2,1.4.2_04

Priority: P3
Status: Resolved
Resolution: Fixed
OS: linux,solaris,solaris_8
CPU: generic,x86,sparc

Submitted: 2003-03-27
Updated: 2004-04-07
Resolved: 2003-10-24

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	Other
1.4.1_07 07Fixed	1.4.2_05Fixed

Related Reports

Duplicate :	JDK-5033591 - I18N - different behavior when reading invalid char in different locale
Duplicate :	JDK-5031167 - NLS: CONVERTING BYTE ARRAY TO STRING USING EUC_CN CHARSET IS NOT CONSISTENT
Duplicate :	JDK-5036209 - ByteToCharGB18030 exception in 1.4.2_04-b05
Relates :	JDK-4954023 - missing/dropped character in zh_CN.GB18030 locale
Relates :	JDK-5080151 - (cs) 1.4.2_0X: IllegalStateException: recursive invocation on non-english locales

Description

PROBLEM:
 On Multi-processor environment (linux), the attached sample code(A.java)
 outputs the attached exception strace.


TEST PROGRAM:
==== A.java ===
public class A {
    public static void main(String arg[]) throws Exception {
	Thread t1 = new Test();
	Thread t2 = new Test();

	t1.start();
	t2.start();
    }

    static class Test extends Thread {
	public void run() {
	    while (!interrupted()) {
		try {
		    "a".getBytes("ASCII");
		    "a".getBytes("EUC-JP-LINUX");
		} catch (Exception e) {
		    e.printStackTrace();
		}
	    }
	}
    }
}    
===============

LOG DATA:
==== log ===
java.lang.Error: java.nio.charset.UnsupportedCharsetException: EUC-JP-LINUX
        at java.lang.StringCoding.lookupCharset(StringCoding.java:84)
        at java.lang.StringCoding.encode(StringCoding.java:361)
        at java.lang.StringCoding.encode(StringCoding.java:378)
        at java.lang.String.getBytes(String.java:608)
        at java.io.UnixFileSystem.canonicalize(Native Method)
        at java.io.File.getCanonicalPath(File.java:513)
        at java.io.FilePermission$1.run(FilePermission.java:209)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.io.FilePermission.init(FilePermission.java:203)
        at java.io.FilePermission.<init>(FilePermission.java:253)
        at sun.net.www.protocol.file.FileURLConnection.getPermission(FileURLConn
ection.java:193)
        at sun.net.www.protocol.jar.JarFileFactory.getPermission(JarFileFactory.
java:111)
        at sun.net.www.protocol.jar.JarFileFactory.getCachedJarFile(JarFileFacto
ry.java:81)
        at sun.net.www.protocol.jar.JarFileFactory.get(JarFileFactory.java:50)
        at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.ja
va:85)
        at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnec
tion.java:105)
        at java.net.URL.openStream(URL.java:960)
        at sun.misc.Service.parse(Service.java:203)
        at sun.misc.Service.access$100(Service.java:111)
        at sun.misc.Service$LazyIterator.hasNext(Service.java:257)
        at java.nio.charset.Charset$1.getNext(Charset.java:301)
        at java.nio.charset.Charset$1.hasNext(Charset.java:316)
        at java.nio.charset.Charset$2.run(Charset.java:359)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.nio.charset.Charset.lookupViaProviders(Charset.java:356)
        at java.nio.charset.Charset.lookup(Charset.java:383)
        at java.nio.charset.Charset.isSupported(Charset.java:405)
        at java.lang.StringCoding.lookupCharset(StringCoding.java:80)
        at java.lang.StringCoding.encode(StringCoding.java:361)
        at java.lang.String.getBytes(String.java:591)
        at A$Test.run(A.java:27)
Caused by: java.nio.charset.UnsupportedCharsetException: EUC-JP-LINUX
        at java.nio.charset.Charset.forName(Charset.java:428)
        at java.lang.StringCoding.lookupCharset(StringCoding.java:82)
        ... 30 more
============


CONFIGRATION:
 
 - MPU: Pentium III 800MHz X 2
 - OS : Turbo Linux 8 (kernel 2.4.18-5smp)
 - JRE: JDK1.4.1_01, 1.4.2(b15)


REPORT:
 
They suspect this is caused when several threads tried to modify the cache
in Charset.
Specifically speaking, java.nio.Charset#isSupported and java.nio.Charset#forName
should be atomic, but they are not.

The followings are the possible senario. Thread-A, B are created in the 
test program.

Thread-A     Cache     Thread-B
--------    -------    ---------
  
              ASCII
    
   (1)          
                              (2)
             EUC-JP-LINUX
   (3)
                              (4)
               ASCII
   (5)


(1) In thread-A, "a".getBytes("EUC-JP-LINUX") runs as follows.

  "a".getBytes("EUC-JP-LINUX")
     -> StringCoding#lookupCharset
      -> Charset#isSupported("EUC-JP-LINUX")
       -> Charset#lookup("EUC-JP-LINUX")
  
  At this stage, Charset.cache is "ASCII", cache-miss occurs and
  calls Charset#lookupViaProviders.

(2) A sequence of "a".getBytes("EUC-JP-LINUX") finishes in threda-B.
    Here, Cache is set to "EUC-JP-LINUX".

(3) During the execution in (1), StringCoding#lookupCharset is called.
    (This is from the information in the above log.)
    In lookupCharset, Charset#isSupported("EUC-JP-LINUX") is called again
    and returns true because of cahce-hit (cahche is set to "EUC-JP-LINUX"
    at (2))

(4) A sequence of "a".getBytes("ASCII") finishes.
    Here, Cache is set to "ASCII".

(5) Charset#forName("EUC-JP-LINUX") is called after
    Charset#isSupported("EUC-JP-LINUX") at (3).
    Here, cache-miss occurs and Charset#lookupViaProviders is called.
    However,  lookupViaProviders is not re-entrant and returns null.
    As the result, UnsupportedCharsetException seems to happen.

===========================================================================

Comments

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.1_07 1.4.2_05 generic tiger-beta FIXED IN: 1.4.1_07 1.4.2_05 tiger-beta INTEGRATED IN: 1.4.1_07 1.4.2_05 tiger-b26 tiger-beta
14-06-2004
SUGGESTED FIX Please see the attached webrevs. There are two slightly different fixes, one for 1.4.1 and another for 1.4.2 and later. For convenience these webrevs may also be viewed online at http://nio.sfba/rev/4838512. -- ###@###.### 2003/10/22
10-11-0188
WORK AROUND Set the default encoding on the command line to force the use of the old sun.io EUC-JP-LINUX converter, e.g., % java -Dfile.encoding='^AEUC-JP-LINUX' Foo where ^A represents the ASCII character control-A, i.e., \u0001. This causes the old sun.io converter for EUC-JP-LINUX to be used whenever the default encoding is required, thereby preventing the recursive provider lookups which cause the reported problem. Note that the system property "file.encoding" is implementation-private. The redefinition of this property is not, in general, guaranteed to work, and will likely fail to work in J2SE 1.5 or later releases. -- ###@###.### 2003/10/5
10-11-0171
EVALUATION The analysis given by the submitter is on the right track, but the change required is more than a simple matter of making the Charset.isSupported and .forName methods atomic. That would not actually solve the reported problem. The suggested fix would mask the problem, but would fail in a future release when the old sun.io converters are removed. The root cause of this bug is the fact that a platform's default charset cannot be loaded via the charset-provider mechanism. The default charset is used to translate filenames from Java UTF-16 strings into platform-specific strings. The provider mechanism itself needs to translate filenames in order to discover providers, hence a provider cannot provide the charset which is needed to discover and load itself. This is why the lookup code in the Charset class disallows recursive provider lookups. In 1.4.1 and later releases the EUC-JP-LINUX charset is provided by the sun.nio.cs.ext.ExtendedCharsets provider. In contexts in which EUC-JP-LINUX is the default charset (e.g., LC_ALL=ja_JP on Linux) it would seem that this charset should appear to be unsupported, but in fact it works much of the time. The reason for this is the existence in the 1.4.x releases of a dual charset lookup mechanism which falls back to the old sun.io converters when a charset is not supported by the java.nio.charset APIs. To see how this works, consider the following example. The evaluation of the expression "a".getBytes("EUC-JP-LINUX") first causes the code in the internal java.lang.StringCoding class to invoke the Charset.isSupported() method to see if that charset is supported. EUC-JP-LINUX is not a standard charset, so the lookup code in java.nio.charset.Charset tries to look it up via the provider mechanism. This lookup eventually results in a recursive invocation of the String.getBytes method on the same thread, this time to encode the filename of the charsets.jar file into EUC-JP-LINUX (since it's the default charset), which in turn results in a recursive provider lookup. This fails, since such lookups are disallowed, hence the String.getBytes method falls back to the old sun.io EUC-JP-LINUX converter. The initial provider lookup then succeeds, since it uses the old converter to encode the filename. On a multiprocessor this scheme can break down if the timing is just right. As observed by the submitter, the Charset class contains a global cache of the most recently-returned charset. At the end of the scenario described above this cache will hold a reference to the EUC-JP-LINUX charset. If one thread causes the EUC-JP-LINUX charset to be removed from the cache in between another thread's invocations of the Charset.isSupported and .forName methods during the recursive provider lookup then an UnsupportedCharsetException will be thrown, as reported. The solution suggested by the submitter will solve the problem, but at the cost of a synchronization operation and in a way that will fail when the sun.io converters are removed in a future release. A better solution is to recognize this fundamental limitation of the charset-provider mechanism and "hardwire" the ExtendedCharsets provider into the java.nio.charset.Charset lookup logic. The diffs for this change are in the suggested-fix section of this bug report. An alternative solution would be to rework the sun.misc.Service code so that it does not load provider-descriptor files via URLs. It does this only because that's the only way to load multiple resource files of the same name. Since charset providers are, by definition, already on the class path, there's really no need to do another permission check on each provider-description file as is currently done by the clumsy JarURLConnection code. This solution would, however, most likely require a more complex and risky set of changes, to the Service, JarURLConnection, and (possibly) java.lang.ClassLoader classes, hence it is not proposed here. -- ###@###.### 2003/10/5
10-11-0171