Bug ID: JDK-4407610 java.net.URLDecode.decode(st,"UTF-16") works incorrectly on '+' sign

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
1.4.0 betaFixed


Name: dfR10049			Date: 01/24/2001



URLDecode.decode(st,"UTF-16") works incorrectly on '+' sign. If this sign occures in
the encoded string and the following sequence contains encoded unsafe characters (%xy)
these characters will be decoded incorrectly.

Please see an example demonstrating the bug below:
----------- DecoderTest.java ----------------
import java.net.*;

public class Decoder {

   public static void main(String args[]) {

        boolean passed = true;
        String enc = "UTF-16";
        String strings[] = {
            "\u0100\u0101", 
            "\u0100 \u0101", 
            "\u0100 \u0101\u0102", 
            "\u0100 \u0101 \u0102", 
            "\u0100\u0101\u0102",
        };

        try {
             for (int i = 0; i < strings.length; i++) {
                 String encoded = URLEncoder.encode(strings[i], enc);
                 System.out.println("ecnoded: " + encoded);
                 String decoded = URLDecoder.decode(encoded, enc);
                 System.out.print("init:    ");
                 printString(strings[i]);
                 System.out.print("decoded: ");
                 printString(decoded);
                 if (strings[i].equals(decoded)) {
                      System.out.println(" - correct - \n");
                 } else {
                      System.out.println(" - incorrect - \n");
                      passed = false;
                 }
             }
        } catch (Exception e) {
            System.out.println("  exception: " + e);
        }

        System.out.println("Test " + (passed ? "passed" : "failed"));
    }   

    static void printString(String s) {
         for (int i = 0; i < s.length(); i++) {
              System.out.print((int)s.charAt(i) + " ");
         }
         System.out.println();
    }
}
#----------------- output from the test ----------------------

#> java -version
java version "1.4.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b48)
Java HotSpot(TM) Client VM (build 1.4beta-B48, mixed mode)

#> java Decoder
ecnoded: %FF%FE%00%01%01%01
init:    256 257 
decoded: 256 257 
 - correct - 

ecnoded: %FF%FE%00%01+%01%01
init:    256 32 257 
decoded: 256 32 65533 
 - incorrect - 

ecnoded: %FF%FE%00%01+%01%01%02%01
init:    256 32 257 258 
decoded: 256 32 65533 
 - incorrect - 

ecnoded: %FF%FE%00%01+%01%01+%02%01
init:    256 32 257 32 258 
decoded: 256 32 65533 32 65533 
 - incorrect - 

ecnoded: %FF%FE%00%01%01%01%02%01
init:    256 257 258 
decoded: 256 257 258 
 - correct - 

Test failed

#-------------------------------------------------------------
======================================================================

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: merlin-beta FIXED IN: merlin-beta INTEGRATED IN: merlin-beta

14-06-2004

EVALUATION The problem here is that one encoder instance is created per-call to URLEncode.encode() whereas a separate decoder instance is created by URLDecoder.decode() for each encoded segment (ie. each sequence of %xy%xy chars). For most encodings this is not a problem, because there is no encoding state dependent on the encoder instance state. However, for UTF-16 (unlike UTF-16BE, or UTF-16LE) a pair of bytes %FE%FF are output when the encoder is first created before the first character is output. Therefore, for this encoding the decoder gets out of step with the encoder. The solution adopted here is to force URLEncoder.encode() to create a new encoder instance for each sequence of characters to be encoded, thus keeping the encoder and decoder in sync. This means the %FE%FF bytes will be generated at the beginning of each encoded sequence. The choice is somewhat arbitrary and only really underlines the point that UTF-16 should not be used as an encoding for URLs. However, we need to guarantee that the decoder produces the same string as the un-encoded original. Note, the problem is not restricted to sequences separated by a '+'. The problem happens for any sequence separated by a non-reserved ASCII character as well. michael.mcmahon@ireland 2001-01-26

26-01-2001

Relates :	JDK-4402456 - URLDecoder.decode(String s, String enc) fails with certain input
Relates :	JDK-6415062 - 30 MB memory trashed to get 30 kb string url encoded
Relates :	JDK-4725737 - REGRESSION: URLEncoder degraded performance in a multithreaded environment