Bug ID: JDK-4802209 String and OutputStreamWriter classes sometimes encode UTF-8 incorrectly

JDK-4802209 : String and OutputStreamWriter classes sometimes encode UTF-8 incorrectly

Type: Bug
Component: core-libs
Sub-Component: java.io
Affected Version: 1.4.2

Priority: P3
Status: Closed
Resolution: Fixed
OS: solaris_8
CPU: generic

Submitted: 2003-01-13
Updated: 2007-12-11
Resolved: 2003-11-11

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
5.0 b28Fixed

Related Reports

Relates :

JDK-4752992 - (cs) Looking up non-NIO charsets is very slow

Description


Name: dfR10049			Date: 01/13/2003


The JCK tests for java.net/URL[Encoder/Decoder] fail if run after
JCK tests for java_io package in the same JVM, on Solaris 2.8 
with LC_CTYPE set to "en_US.UTF-8".

The bug is: URLEncoder.encode method with "UTF-8" encoding
incorrectly processes surrogate pairs if an instance of InputStreamReader
is created and new URL(http_url).openConnection().connect() is called before.

So creating of InputStreamReader instance and connecting to the
http url affect on the output of the following call:
   URLEncoder.encode("\uD800\uDC00 \uD801\uDC01 ", "UTF-8")


I wrote the minimal as possible test demonstrating the bug:
----------------- EncTest.java ------------------
import java.io.*;
import java.net.*;

public class EncTest {

    public static void main(String args[]) {
        try {
            String toEncode = "\uD800\uDC00 \uD801\uDC01 ";
            String enc1 = URLEncoder.encode(toEncode, "UTF-8");

	    byte bytes[] = {};
	    ByteArrayInputStream bais = new ByteArrayInputStream( bytes );
	    InputStreamReader reader = new InputStreamReader( bais, "8859_1" );

            new URL(args[0]).openConnection().connect();

            String enc2 = URLEncoder.encode(toEncode, "UTF-8");
            if (enc1.equals(enc2)) {
                System.out.println("Test passed: ");
            } else {
                System.out.println("Test failed: ");
            }
            System.out.println("    enc1: " + enc1);
            System.out.println("    enc2: " + enc2);

        } catch (Exception e) {
            System.out.println(e);
        }

    }

}
-----------------------------------------
#> uname -a
SunOS matmech 5.8 Generic_108528-14 sun4u sparc SUNW,Ultra-5_10

#> echo $LC_CTYPE 
en_US.UTF-8

#> java -version
java version "1.4.2-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-beta-b12)
Java HotSpot(TM) Client VM (build 1.4.2-beta-b12, mixed mode)

#> java EncTest <SOME AVAILABLE HTTP URL>
Test failed: 
    enc1: %F0%90%80%80+%F0%90%90%81+
    enc2: ++

Note: the bug is reproducible with jdk1.4.2 b11 and b12.

======================================================================

Comments

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: tiger tiger-beta FIXED IN: tiger-beta INTEGRATED IN: tiger-b28 tiger-beta VERIFIED IN: tiger-rc

13-09-2004

EVALUATION I can reproduce this also with mantis b11 and b12 on S8 and S9 in any UTF-8 locale. b10 is not affected. I cannot reproduce with my own workspace, so I am currently rebuilding an original b11 workspace. I doubt it is a networking problem since we have not changed anything in this area recently. ###@###.### 2003-01-13 The problem results from nio bugfix 4752992, which changes the way character converters are cached in the platform. A side effect of that bug fix is that whenever an old-style (non-nio) converter is cached that converter is always used subsequently in preference to the new style converter. There is a specific bug in the old UTF-8 converter with surrogate pair characters, and this test case exposes that bug. Another important factor, is that in NetworkClient we need to test if the platforms default encoding is compatible with US-ASCII, and we use the Converter API directly for that purpose. One solution would be to change this to just use public String apis, which would avoid (this particular manifestation of) the problem. This is a simple change (which should have been done this way from the start) but there are other places in the platform where the Converters are called directly and there may be problems there as well. Need to check with nio team if they intend fixing the problem or if we need to change our code. ###@###.### 2003-01-14 This problem could affect any code -- not just java.net -- that converts from characters to UTF-8 using the String.getBytes methods or the OutputStreamWriter class. For the bug to appear requires the old-style UTF-8 char-to-byte converter to be specifically requested via the internal sun.io API, as is done during the opening of a URL connection in the en_US.UTF-8 locale. A simple fix, which is not specific to the networking code, is just to disable the caching of the (broken) sun.io UTF8 char-to-byte converter. See suggested fix for diff. -- ###@###.### 2003/1/20

01-11-0186

SUGGESTED FIX !sccsdiff ../../src/share/classes/sun/net/NetworkClient.java 1.33 175 lines 2c2 < * @(#)NetworkClient.java 1.33 01/12/03 --- > * %W% %E% 22c22 < * @version 1.33, 12/03/01 --- > * @version %I%, %G% 108,109c108 < CharToByteConverter ctob = CharToByteConverter.getConverter (encoding); < byte[] b = ctob.convertAll (chkS.toCharArray()); --- > byte[] b = chkS.getBytes (encoding); ==== Diff for sun.io.Converters: *** /tmp/geta26505 Mon Jan 20 18:23:20 2003 --- Converters.java Mon Jan 20 15:42:58 2003 *************** *** 246,252 **** c = cache(type, enc); if (c == null) { c = getConverterClass(type, enc); ! cache(type, enc, c); } } return newConverter(enc, c); --- 246,253 ---- c = cache(type, enc); if (c == null) { c = getConverterClass(type, enc); ! if (!c.getName().equals("sun.io.CharToByteUTF8")) ! cache(type, enc, c); } } return newConverter(enc, c); -- ###@###.### 2003/1/20

01-11-0186