JDK-4974686 : Only in 1.4.2, URLEncoder outputs exception in Japanese locale
  • Type: Bug
  • Status: Closed
  • Resolution: Fixed
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Priority: P3
  • Affected Version: 1.4.2_02
  • OS: solaris_8
  • CPU: sparc
  • Submit Date: 2004-01-07
  • Updated Date: 2004-04-22
  • Resolved Date: 2004-03-30
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availabitlity Release.

To download the current JDK release, click here.
Other
1.4.2_05 05Fixed
Related Reports
Relates :  
Relates :  
Description
The attached program outputs UnknownCharacterException in Japanese locale.

REPRODUCE:

 (1) Compile the following program with -deprecation.

====>
import java.io.PrintWriter;
import java.net.*;
import java.io.*;

public class DecodeTests {

    final static String stringSetUTF = 
        "\uD800\uDC00";

               // a string of surrogate pairs can be expressed as 4 bytes

    /* standalone interface */
    public static void main(String argv[]) {
        String encoded = URLEncoder.encode(stringSetUTF);
    }
}

<===


 (2) Launch "java DecodeTests" with LANG=ja, 
     then you will see the below exception.
 
goedel[38]% java DecodeTests
Exception in thread "main" java.lang.Error: UnknownCharacterException thrown in substititution mode
        at sun.io.CharToByteConverter.convertAny(CharToByteConverter.java:160)
        at sun.nio.cs.StreamEncoder$ConverterSE.implWrite(StreamEncoder.java:210)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:124)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:178)
        at java.net.URLEncoder.encode(URLEncoder.java:234)
        at java.net.URLEncoder.encode(URLEncoder.java:149)
        at DecodeTests.main(DecodeTests.java:15)
Caused by: sun.io.UnknownCharacterException
        at sun.io.CharToByteEUC_JP_Solaris.convert(CharToByteEUC_JP_Solaris.java:97)
        at sun.io.CharToByteConverter.convertAny(CharToByteConverter.java:139)
        ... 7 more

NOTE: 

 1) Locale dependency

   If you change the locale to english, the test program seems to work well.

goedel[39]% setenv LANG C
goedel[40]% java DecodeTests
goedel[41]% 


 2) JDK version dependency
    This issue occurs only in 1.4.2(_0X), not in 1.4.1 and 1.5b32.


CONFIGURATION:
  OS : SunOS goedel 5.8 Generic_108528-13 sun4u sparc SUNW,Ultra-60
  JDK : 1.4.2_02
    java version "1.4.2_02"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_02-b03)
    Java HotSpot(TM) Client VM (build 1.4.2_02-b03, mixed mode)

==================================================================

Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.2_05 generic FIXED IN: 1.4.2_05 INTEGRATED IN: 1.4.2_05 VERIFIED IN: 1.4.2_05
2004-06-14

SUGGESTED FIX ./src/share/classes/sun/io/CharToByteEUC_JP_Solaris between lines 97-107 and additional check should be put in along these lines if (subMode) { outputSize = subBytes.length; if (byteOff + outputSize > outEnd) throw new ConversionBufferFullException(); for (int i = 0; i < subBytes.length; i++) outputBytes[byteOff++] = subBytes[i]; charOff += 1; } else { badInputLength = 1; throw new UnknownCharacterException(); } inputSize = 2; This will ensure that by default if a surrogate pair straddles two successive invocations of the sun.io.CharToByteEUC_JP_Solaris converter (eg, when OutputStreamWriter write() method is called twice in succession without a flush and with each individual character of the surrogate pair for each invocation) then substitution will happen and an exception won't be thrown. ###@###.### 2004-02-19
2004-02-19

EVALUATION Found the cause of the problem. In URLEncoder.java, the writer that was used in 1.4.1 was BufferredWriter, but in 1.4.2 it has been changed to OutputStreamWriter. OutputStreamWriter handles the each 8bytes of surrogate pair separately and gives UnknownCharacterException. I made the writer in URLEncoder class as BufferedWriter and it works fine. I am not sure why BufferedWriter was removed in 1.4.2 ###@###.### 2004-02-18 Problem is actually within the EUC_JP_Solaris converter implementation (added as part of 4765370). The converter implementation is not correctly performing substitution when it encounters a surrogate pair which straddles two successive invocations of the converters, convert(...) method. EUC_JP_Solaris doesn't support encoding of surrogates but the converter should by default map any occurrence of them to the default subst byte which is value = 0x3f. See suggested fix for more details. Note that the EUC_JP_Solaris support in 1.5.0 is provided by a java.nio charset implementation which does handle substitution correctly when surrogate pairs appear anywhere within input. So this issue and bug is exclusive to J2SE 1.4.2 and won't affect tiger. ###@###.### 2004-02-19
2004-02-19