Bug ID: JDK-6378911 UTF-8 decoder handling of byte-order mark has changed

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 9.0pe,6,6u1

Priority: P2
Status: Resolved
Resolution: Fixed
OS: linux_redhat_3.0,linux_redhat_4.0
CPU: generic,x86

Submitted: 2006-01-31
Updated: 2010-05-11
Resolved: 2006-02-18

JDK 6
6 b73Fixed

Consider this small test program:

import java.io.*;

public class test {

    /*
     * Make sure 0xFEFF is encoded as this byte sequence: EF BB BF, when
     * UTF-8 is being used, and parsed back into 0xFEFF.
     */
    public static void main(String[] args) throws Exception {

        /*
         * Write
         */
        FileOutputStream fos = new FileOutputStream("bom.txt");
        OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF8");
        osw.write(0xFEFF);
        osw.close();

        /*
         * Parse
         */
        FileInputStream fis = new FileInputStream("bom.txt");
        InputStreamReader isr = new InputStreamReader(fis, "UTF8");
        char bomChar = (char) isr.read();
        System.out.println("Parsed: "
                           + Integer.toHexString(bomChar).toUpperCase());
        if (bomChar != 0xFEFF) {
            throw new Exception("Invalid BOM: "
                                + Integer.toHexString(bomChar).toUpperCase());
        }
        isr.close();

    }

}

On Linux JDK1.6 Beta b59d, the char is parsed correctly. 
However, on Linux JDK1.6 rc b69, the program throws this exception:

  Exception in thread "main" java.lang.Exception: Invalid BOM: FFFF
          at test.main(test.java:28)

See 6378267 for more details.

EVALUATION The update we made to recognized the BOM in UTF-8 (4508058) is correct according to the Unicode Standard. Our assumption was that our change should rarely break any real-world applications because that would imply that they were not following the Unicode Standard. Unfortunately, we have found a common application where our assumption was incorrect. We will back out the changes associated with 4508058, thus reverting to our previous behaviour of ignoring the BOM for UTF-8. No changes were ever made to BOM handling for UTF-16 or UTF-32 as these double-byte encodings require its processing.
02-02-2006
EVALUATION Sorry if I wasn't clear: We currently detect the encoding of a JSP file from its BOM, reset the input stream, and pass the input stream to the appropriate parser, based on the JSP file's syntax. Notice that the SAXParser (invoked for XML syntax) would choke on a BOM-free stream. If the JSP page is in classic syntax, we remember if a BOM was present by setting a flag, set the stream's encoding to that derived from the BOM, and have our parser read and parse the JSP page from the stream. If the BOM flag is set, the parser knows to discard the first char. With JDK 1.6, this approach no longer works, because the decoder will already have discarded the BOM, so our parser will look at the wrong char. Also, I never got an answer if the automatic BOM-stripping is now also done in the case of UTF-32 and UTF-16, or just UTF-8.
31-01-2006
EVALUATION I don't understand why the technique that works with 1.6, namely "discarding the BOM manually", doesn't also work with 1.5. If you consistently pass a BOM-free stream to the decoder, the behavior will be unchanged.
31-01-2006
EVALUATION The "encoding autodetection" approach mentioned above is exactly what we have been doing. However, after having deduced an encoding from the BOM bytes (if present), we need to reset the input stream so we can later pass it to the javax.xml.parsers.SAXParser (for JSP pages in XML syntax) or our "hand-written" parser (for JSP pages in classic syntax). This means that in the classic syntax case, we must discard the BOM manually (in the XML syntax case, the javax.xml.parsers.SAXParser is taking care of this) when running against a JRE with a version < 1.6, but must rely on the decoder to do this for us as of JRE 1.6. The problem is that we cannot implement different behaviour depending on the JRE version we're running against. Also, do these encodings also discard a BOM as of JDK 1.6: UTF-32 BE UTF-32 LE UTF-16 BE UTF-16 LE or was the change made to UTF-8 only?
31-01-2006
EVALUATION It sounds like the customer code is doing "encoding detection". But specifying UTF-8 is already choosing an encoding... If you really want to examine the input to auto-detect encodings (in general, an impossible task), then read the first few bytes from the input stream directly as bytes, and compare to the various encodings of BOM. If a BOM is detected, deduce the encoding, discard the BOM, and read the rest of the input using the detected encoding. If you don't find something that looks like a BOM, guess and pray. Auto-detecting encodings are never reliable; only heuristics. The change to UTF-8 is incompatible, but a strong case can be made that UTF-8 is specified by a standard, and so the change was simply a bug fix. What do other implementations of UTF-8 do?
31-01-2006
EVALUATION I am reopening this bug because it is breaking backwards-compatibility. The JSP container in the Java EE 5 RI and SJSAS 9.0 has been relying on detecting a BOM, setting the appropriate encoding, and discarding the BOM bytes before reading the input. The purpose of the test program I provided with the bug report was to demonstrate the issue. It is not just a matter of changing the test program to make things work. This has worked up until JDK 1.6, and I expect it to continue to work with that JDK release. If you want to support the new functionality of automatically detecting and discarding a BOM, this should be enabled with a flag, but not by default. We cannot have our container implement one behaviour when running on JDK 1.5, and a different behaviour when running on JDK 1.6. Just curious, have the following encodings been updated as well: UTF-32 BE UTF-32 LE UTF-16 BE UTF-16 LE
31-01-2006
EVALUATION UTF8 charset has been updated to recognize sequence EF BB BF as a "BOM" as specified at http://www.unicode.org/faq/utf_bom.html#BOM, so this utf8 signature is being skipped out during decoding if it appears at the beginning of the input stream. See#4508058. So the assumption of "parsed back into 0xFEFF" no longer stands, suggest the test case get. updated accordingly
31-01-2006
EVALUATION *** (#1 of 1): [ UNSAVED ] ###@###.###
31-01-2006

Relates :	JDK-4508058 - UTF-8 encoding does not recognize initial BOM
Relates :	JDK-6959785 - UTF-8 encoding does not recognize initial BOM