Bug ID: JDK-2213354 Incorrect UTF8 conversion for sequence ED 31

Type: Backport
Backport of: JDK-7082884
Component: core-libs
Sub-Component: java.nio.charsets

Priority: P3
Status: Closed
Resolution: Fixed

Submitted: 2011-08-24
Updated: 2012-05-30
Resolved: 2012-05-30

JDK 8
8 b14Fixed

EVALUATION Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8" in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding how a "conformance" implementation should handle ill-formed UTF-8 byte sequence. Basically it says (1) the conversion process should not interprets any ill-formed code unit sequence *(2) such process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences (3) and recommend a "best practice" of "maximal valid subpart" for replacement The new UTF-8 charset implementation we put in JDK7 (and back-ported to previous release since then) follows the new constraints in most cases. Except 2 corner cases (we are aware of so far for now"). 7082884 is one of them. The current implementation decode new String(new byte[]{(byte)0xed, 31}, "UTF8") into one single char "\ufffd". while it should be "\ufffd\u001f" instead, according to the new constraints (the first 0xed is ill-formed, and the second 0x1f is still valid non-ill-formed, so it should not be consumed, even the first byte 0xed is a leading byte a three-byte utf-8 byte sequence). The reason I called it a "corner case" is because then new UTF-8 implementation handles it correctly in most cases, for example new String(new byte[]{(byte)0xed, 31, 'a'}, "UTF8"); does return the expected result "\ufffd\u001f\u0061" The corner case here is that the 0xed is the leading byte of a three-byte utf-8 byte sequence, but we actually only 2 bytes total in pipe. The current UTF-8 decoder implementation will not even look into the following bytes when it has a leading byte of a 3-byte utf-8 sequence and it has less than 3 bytes to work on. In this case it simply returns "underflow", means "I need more bytes". Unfortunately its upper level implementation, CharsetDecoder, will simply treat this "underflow" status as a malformed byte sequence of "2" (it's reasonable for CharsetDecoder to make such a decision as well, see the decoder does not have enough bytes to handle these 2 bytes, and we don't have any more bytes following, so the rest are all "malformed"). The fix is to further look into the following bytes when we have a leading byte, even don't have enough bytes to complete the conversion. The webrev is at http://cr.openjdk.java.net/~sherman/7082884/webrev Another corner case is how to deal with the old 5-6 bytes byte sequence, such as "fc 80 80 8f bf bf", we are now treating them as 1 malformed utf-8 byte sequence, so any of these 5-6 bytes "old" formed will be treated one malformed character and then replaced by one "\ufffd". But according to the new "best practice" recommendation, it probably should be replaced by 6 \ufffd. if I understand the recommendation correctly. Personally I feel the existing implementation is a more reasonable approach, opinion?

24-08-2011