JDK-4297837 : Silent Recorvery from bad UTF-8
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 1.2.0
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: generic
  • CPU: generic
  • Submitted: 1999-12-08
  • Updated: 2002-01-16
  • Resolved: 2002-01-16
Related Reports
Duplicate :  
Relates :  
Description

Name: mf23781			Date: 12/08/99


Java. At least the Sun JDK 1.2 UTF-8 converter will gladly convert 
0xED 0xA0 0x88 0xED 0xBD 0x85 (bad UTF-8 for U+12345) to 
"\uD808\uDF45" (correct UTF-16 for the same).

>  Are such programs considered useful or harmful?

Good question!  It can "repair" some things that otherwise wouldn't 
work, but security and reliability often depend on some things *not* 
working when they shouldn't.  Similar to non-minimal UTF-8 encodings 
(of ASCII nulls, for instance).

BTW, I had thought that Java had no support at all for UTF-16, but 
I just verified that JDK 1.2 will correctly transcode "\uD808\uDF45" 
to UTF-8 0xF0 0x92 0x8D 0x85.  Converting back, however, yields 
nothing: no characters, no exception, nothing at all.

(Review ID: 98805)

======================================================================

Comments
EVALUATION There are two issues covered by this bug. Firstly, it is true to say that the UTF-8 decoder in Java <=1.4 will decode a six-byte representation of a surrogate pair. A separate bug (4486841) tracks the possible inclusion of a stricter decoder in accordance with the corrigendum of Unicode 3.0.1 which would reject 6-byte representation and would only decode 4-byte representation of surrogate pairs in UTF-8. The second issue relates to a bug whereby prior to 1.4 a single char read from an InputStreamReader would return (char) -1 when the input byte stream contained a legal 4 byte surrogate pair encoded in UTF-8. This has been fixed in 1.4 so that subsequent single char reads return the high and low surrogate chars. See integrated bugID:4251997 ###@###.### 2002-01-16
16-01-2002