United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-4974686 Only in 1.4.2, URLEncoder outputs exception in Japanese locale
JDK-4974686 : Only in 1.4.2, URLEncoder outputs exception in Japanese locale

Details
Type:
Bug
Submit Date:
2004-01-07
Status:
Closed
Updated Date:
2004-04-22
Project Name:
JDK
Resolved Date:
2004-03-30
Component:
core-libs
OS:
solaris_8
Sub-Component:
java.nio.charsets
CPU:
sparc
Priority:
P3
Resolution:
Fixed
Affected Versions:
1.4.2_02
Fixed Versions:
1.4.2_05 (05)

Related Reports
Relates:
Relates:

Sub Tasks

Description
The attached program outputs UnknownCharacterException in Japanese locale.

REPRODUCE:

 (1) Compile the following program with -deprecation.

====>
import java.io.PrintWriter;
import java.net.*;
import java.io.*;

public class DecodeTests {

    final static String stringSetUTF = 
        "\uD800\uDC00";

               // a string of surrogate pairs can be expressed as 4 bytes

    /* standalone interface */
    public static void main(String argv[]) {
        String encoded = URLEncoder.encode(stringSetUTF);
    }
}

<===


 (2) Launch "java DecodeTests" with LANG=ja, 
     then you will see the below exception.
 
goedel[38]% java DecodeTests
Exception in thread "main" java.lang.Error: UnknownCharacterException thrown in substititution mode
        at sun.io.CharToByteConverter.convertAny(CharToByteConverter.java:160)
        at sun.nio.cs.StreamEncoder$ConverterSE.implWrite(StreamEncoder.java:210)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:124)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:178)
        at java.net.URLEncoder.encode(URLEncoder.java:234)
        at java.net.URLEncoder.encode(URLEncoder.java:149)
        at DecodeTests.main(DecodeTests.java:15)
Caused by: sun.io.UnknownCharacterException
        at sun.io.CharToByteEUC_JP_Solaris.convert(CharToByteEUC_JP_Solaris.java:97)
        at sun.io.CharToByteConverter.convertAny(CharToByteConverter.java:139)
        ... 7 more

NOTE: 

 1) Locale dependency

   If you change the locale to english, the test program seems to work well.

goedel[39]% setenv LANG C
goedel[40]% java DecodeTests
goedel[41]% 


 2) JDK version dependency
    This issue occurs only in 1.4.2(_0X), not in 1.4.1 and 1.5b32.


CONFIGURATION:
  OS : SunOS goedel 5.8 Generic_108528-13 sun4u sparc SUNW,Ultra-60
  JDK : 1.4.2_02
    java version "1.4.2_02"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_02-b03)
    Java HotSpot(TM) Client VM (build 1.4.2_02-b03, mixed mode)

==================================================================

                                    

Comments
SUGGESTED FIX

./src/share/classes/sun/io/CharToByteEUC_JP_Solaris

between lines 97-107 and additional check should be put in along these lines

if (subMode) {
    outputSize = subBytes.length;
    if (byteOff + outputSize > outEnd)
        throw new ConversionBufferFullException();
    for (int i = 0; i < subBytes.length; i++)
        outputBytes[byteOff++] = subBytes[i];
    charOff += 1;
} else {
     badInputLength = 1;
     throw new UnknownCharacterException();
} 
    inputSize = 2;


This will ensure that by default if a surrogate pair straddles two successive invocations
of the sun.io.CharToByteEUC_JP_Solaris converter (eg, when OutputStreamWriter write() 
method is called twice in succession without a flush and with each individual character
of the surrogate pair for each invocation) then substitution will happen and an
exception won't be thrown.
###@###.### 2004-02-19
                                     
2004-02-19
EVALUATION

Found the cause of the problem. In URLEncoder.java, the writer that was used in 1.4.1 was BufferredWriter, but in 1.4.2 it has been changed to OutputStreamWriter. OutputStreamWriter handles the each 8bytes of surrogate pair separately and gives UnknownCharacterException. I made the writer in URLEncoder class as BufferedWriter and it works fine.

I am not sure why BufferedWriter was removed in 1.4.2

###@###.### 2004-02-18

Problem is actually within the EUC_JP_Solaris converter implementation 
(added as part of 4765370). The converter implementation is not correctly
performing substitution when it encounters a surrogate pair which straddles
two successive invocations of the converters, convert(...) method.
EUC_JP_Solaris doesn't support encoding of surrogates but the converter
should by default map any occurrence of them to the default subst byte
which is value = 0x3f. See suggested fix for more details.

Note that the EUC_JP_Solaris support in 1.5.0 is provided by a java.nio charset
implementation which does handle substitution correctly when surrogate pairs
appear anywhere within input. So this issue and bug is exclusive to J2SE 1.4.2
and won't affect tiger.

###@###.### 2004-02-19
                                     
2004-02-19
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
1.4.2_05
generic

FIXED IN:
1.4.2_05

INTEGRATED IN:
1.4.2_05

VERIFIED IN:
1.4.2_05


                                     
2004-06-14



Hardware and Software, Engineered to Work Together