The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
EVALUATION
Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8"
in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding
how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says
(1) the conversion process should not interprets any ill-formed code unit sequence
*(2) such process must not treat any adjacent well-formed code unit sequences
as being part of those ill-formed code unit sequences
(3) and recommend a "best practice" of "maximal valid subpart" for replacement
The new UTF-8 charset implementation we put in JDK7 (and back-ported to previous
release since then) follows the new constraints in most cases. Except 2 corner cases
(we are aware of so far for now"). 7082884 is one of them.
The current implementation decode
new String(new byte[]{(byte)0xed, 31}, "UTF8")
into one single char "\ufffd".
while it should be "\ufffd\u001f" instead, according to the new constraints (the first
0xed is ill-formed, and the second 0x1f is still valid non-ill-formed, so it should not
be consumed, even the first byte 0xed is a leading byte a three-byte utf-8 byte
sequence).
The reason I called it a "corner case" is because then new UTF-8 implementation handles
it correctly in most cases, for example
new String(new byte[]{(byte)0xed, 31, 'a'}, "UTF8");
does return the expected result "\ufffd\u001f\u0061"
The corner case here is that the 0xed is the leading byte of a three-byte utf-8 byte sequence,
but we actually only 2 bytes total in pipe. The current UTF-8 decoder implementation
will not even look into the following bytes when it has a leading byte of a 3-byte utf-8
sequence and it has less than 3 bytes to work on. In this case it simply returns "underflow",
means "I need more bytes". Unfortunately its upper level implementation, CharsetDecoder,
will simply treat this "underflow" status as a malformed byte sequence of "2" (it's reasonable
for CharsetDecoder to make such a decision as well, see the decoder does not have enough
bytes to handle these 2 bytes, and we don't have any more bytes following, so the rest are
all "malformed").
The fix is to further look into the following bytes when we have a leading byte, even don't
have enough bytes to complete the conversion.
The webrev is at
http://cr.openjdk.java.net/~sherman/7082884/webrev
Another corner case is how to deal with the old 5-6 bytes byte sequence, such as
"fc 80 80 8f bf bf", we are now treating them as 1 malformed utf-8 byte sequence, so any
of these 5-6 bytes "old" formed will be treated one malformed character and then replaced
by one "\ufffd". But according to the new "best practice" recommendation, it probably should
be replaced by 6 \ufffd. if I understand the recommendation correctly. Personally I feel the
existing implementation is a more reasonable approach, opinion?