Bug ID: JDK-6521166 Exception when opening file URLConnection with percent encoded 4 byte UTF8 char

Type: Bug
Component: core-libs
Sub-Component: java.net
Affected Version: 6

Priority: P3
Status: Closed
Resolution: Fixed
OS: windows_xp
CPU: x86

Submitted: 2007-02-05
Updated: 2011-05-17
Resolved: 2011-05-17

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 7
7 b10Fixed

When URL includes surrogate pair char., program encodes in URF-8 and accesses in URLConnection,
an exception appears.

REPRODUCE:
Compile the attached TEST.java and invoke java TEST. The following exception shows up.

K:\shares2\hitachi\URLencoding-UTF8>java TEST
1st connect...
URL=file:///c:/temp/%F0%A1%88%BD.xml
java.lang.IllegalArgumentException
        at sun.net.www.ParseUtil.decode(ParseUtil.java:185)
        at sun.net.www.protocol.file.Handler.openConnection(Handler.java:65)
        at sun.net.www.protocol.file.Handler.openConnection(Handler.java:55)
        at java.net.URL.openConnection(URL.java:945)
        at TEST.main(TEST.java:12)
2nd connect...
succeeded

SUGGESTED FIX see sun.net.www.ParseUtil.decode

14-02-2007

EVALUATION sun.net.www.protocol.file.Handler calls sun.net.www.ParseUtil.decode to decode any percent encoded characters. This fails with IllegalArgumentException as the it only handles UTF-8 encoding up to 3 bytes. Clearly this is not sufficient to decode UTF-8 encoding with 4 bytes (or greater), and this is what happens in the testcase. The testcase also tries access the same unicode char in its second attempt, but this time the percent encoding passed is actually a 6 byte modified UTF-8 encoding (first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded with UTF-8 individually in sequence). This works, as internally the decoder sees this as two 3 byte UTF-8 encodings (and internally characters outside of the Basic Multilingual Plane are represented like this). The fix is to provide a more robust implementation for sun.net.www.ParseUtil.decode.

14-02-2007