JDK-8150449 : "A 'reversed byte-order mark' cannot occur within middle of stream" is not correct
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 8u66,9
  • Priority: P4
  • Status: Resolved
  • Resolution: Not an Issue
  • OS: generic
  • CPU: generic
  • Submitted: 2016-02-20
  • Updated: 2019-01-04
  • Resolved: 2016-05-26
Related Reports
Relates :  
Relates :  
Description
FULL PRODUCT VERSION :
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)


A DESCRIPTION OF THE PROBLEM :
In reference to: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/cs/UnicodeDecoder.java#94

The comment states "A reversed BOM cannot occur within middle of stream", which has not been true since Unicode 1.0, *before* the introduction of UTF-16.  (http://www.unicode.org/faq/private_use.html#sentinel6)

To fix this bug, the corresponding test & error return should simply be removed.

From Unicode 3.0, Chapter 3, p. 46:

To ensure that round-trip transcoding is possible, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE, FFFF, and unpaired surrogates.

and clarified in Unicode 4.0:

To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.

http://www.unicode.org/faq/utf_bom.html#utf16-7
http://www.unicode.org/faq/utf_bom.html#utf16-8


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
new String("\ufffe".getBytes("UTF-16"), "UTF-16").equals("\ufffe")


REPRODUCIBILITY :
This bug can be reproduced always.


Comments
The faq below indicates the u+feff should NOT occur in the middle of a stream if it is not "protocoled" as a BOM (the default protocol is it's a BOM and it is at the beginning of the stream). So arguably it's NOT incorrect to treat it as a "malformed" byte sequence. While it might not be harmful to just take it asis when in the middle, but I don't see a compelling reason to go with such an "incompatible" change after those decoders behave this way for decades. Closed as "not an issue". Reopen it and add more supporting material, if disagree. http://www.unicode.org/faq/utf_bom.html#bom6 Q: What should I do with U+FEFF in the middle of a file? A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character. [AF]
26-05-2016

Attached the test case provided by submitter. output on JDK 8u76 and 9 ea - False.
23-02-2016