JDK-6521166 : Exception when opening file URLConnection with percent encoded 4 byte UTF8 char
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.net
  • Affected Version: 6
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2007-02-05
  • Updated: 2011-05-17
  • Resolved: 2011-05-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7
7 b10Fixed
Related Reports
Relates :  
Description
When URL includes surrogate pair char., program encodes in URF-8 and accesses in URLConnection,
an exception appears.

REPRODUCE:
Compile the attached TEST.java and invoke java TEST. The following exception shows up.

K:\shares2\hitachi\URLencoding-UTF8>java TEST
1st connect...
URL=file:///c:/temp/%F0%A1%88%BD.xml
java.lang.IllegalArgumentException
        at sun.net.www.ParseUtil.decode(ParseUtil.java:185)
        at sun.net.www.protocol.file.Handler.openConnection(Handler.java:65)
        at sun.net.www.protocol.file.Handler.openConnection(Handler.java:55)
        at java.net.URL.openConnection(URL.java:945)
        at TEST.main(TEST.java:12)
2nd connect...
succeeded

Comments
SUGGESTED FIX see sun.net.www.ParseUtil.decode
14-02-2007

EVALUATION sun.net.www.protocol.file.Handler calls sun.net.www.ParseUtil.decode to decode any percent encoded characters. This fails with IllegalArgumentException as the it only handles UTF-8 encoding up to 3 bytes. Clearly this is not sufficient to decode UTF-8 encoding with 4 bytes (or greater), and this is what happens in the testcase. The testcase also tries access the same unicode char in its second attempt, but this time the percent encoding passed is actually a 6 byte modified UTF-8 encoding (first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded with UTF-8 individually in sequence). This works, as internally the decoder sees this as two 3 byte UTF-8 encodings (and internally characters outside of the Basic Multilingual Plane are represented like this). The fix is to provide a more robust implementation for sun.net.www.ParseUtil.decode.
14-02-2007