Bug ID: JDK-4297837 Silent Recorvery from bad UTF-8

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 1.2.0

Priority: P4
Status: Closed
Resolution: Duplicate
OS: generic
CPU: generic

Submitted: 1999-12-08
Updated: 2002-01-16
Resolved: 2002-01-16

Name: mf23781			Date: 12/08/99


Java. At least the Sun JDK 1.2 UTF-8 converter will gladly convert 
0xED 0xA0 0x88 0xED 0xBD 0x85 (bad UTF-8 for U+12345) to 
"\uD808\uDF45" (correct UTF-16 for the same).

>  Are such programs considered useful or harmful?

Good question!  It can "repair" some things that otherwise wouldn't 
work, but security and reliability often depend on some things *not* 
working when they shouldn't.  Similar to non-minimal UTF-8 encodings 
(of ASCII nulls, for instance).

BTW, I had thought that Java had no support at all for UTF-16, but 
I just verified that JDK 1.2 will correctly transcode "\uD808\uDF45" 
to UTF-8 0xF0 0x92 0x8D 0x85.  Converting back, however, yields 
nothing: no characters, no exception, nothing at all.

(Review ID: 98805)

======================================================================

EVALUATION There are two issues covered by this bug. Firstly, it is true to say that the UTF-8 decoder in Java <=1.4 will decode a six-byte representation of a surrogate pair. A separate bug (4486841) tracks the possible inclusion of a stricter decoder in accordance with the corrigendum of Unicode 3.0.1 which would reject 6-byte representation and would only decode 4-byte representation of surrogate pairs in UTF-8. The second issue relates to a bug whereby prior to 1.4 a single char read from an InputStreamReader would return (char) -1 when the input byte stream contained a legal 4 byte surrogate pair encoded in UTF-8. This has been fixed in 1.4 so that subsequent single char reads return the high and low surrogate chars. See integrated bugID:4251997 ###@###.### 2002-01-16

16-01-2002

Duplicate :	JDK-4251997 - UTF-8 Surrogate Decoding is Broken
Relates :	JDK-4391895 - UTF8 Decoder Broken