United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6378911 : UTF-8 decoder handling of byte-order mark has changed

Details
Type:
Bug
Submit Date:
2006-01-31
Status:
Resolved
Updated Date:
2010-05-11
Project Name:
JDK
Resolved Date:
2006-02-18
Component:
core-libs
OS:
linux_redhat_4.0,linux_redhat_3.0
Sub-Component:
java.nio.charsets
CPU:
x86,generic
Priority:
P2
Resolution:
Fixed
Affected Versions:
9.0pe,6,6u1
Fixed Versions:

Related Reports
Relates:
Relates:

Sub Tasks

Description
Consider this small test program:

import java.io.*;

public class test {

    /*
     * Make sure 0xFEFF is encoded as this byte sequence: EF BB BF, when
     * UTF-8 is being used, and parsed back into 0xFEFF.
     */
    public static void main(String[] args) throws Exception {

        /*
         * Write
         */
        FileOutputStream fos = new FileOutputStream("bom.txt");
        OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF8");
        osw.write(0xFEFF);
        osw.close();

        /*
         * Parse
         */
        FileInputStream fis = new FileInputStream("bom.txt");
        InputStreamReader isr = new InputStreamReader(fis, "UTF8");
        char bomChar = (char) isr.read();
        System.out.println("Parsed: "
                           + Integer.toHexString(bomChar).toUpperCase());
        if (bomChar != 0xFEFF) {
            throw new Exception("Invalid BOM: "
                                + Integer.toHexString(bomChar).toUpperCase());
        }
        isr.close();

    }

}

On Linux JDK1.6 Beta b59d, the char is parsed correctly. 
However, on Linux JDK1.6 rc b69, the program throws this exception:

  Exception in thread "main" java.lang.Exception: Invalid BOM: FFFF
          at test.main(test.java:28)

See 6378267 for more details.

                                    

Comments
EVALUATION

UTF8 charset has been updated to recognize sequence EF BB BF as a "BOM"
as specified at http://www.unicode.org/faq/utf_bom.html#BOM, so this
utf8 signature is being skipped out during decoding if it appears at
the beginning of the input stream. See#4508058. So the assumption of
"parsed back into 0xFEFF" no longer stands, suggest the test case get.
updated accordingly
                                     
2006-01-31
EVALUATION

*** (#1 of 1): [ UNSAVED ] ###@###.###
                                     
2006-01-31
EVALUATION

I am reopening this bug because it is breaking backwards-compatibility. The JSP container in the Java EE 5 RI and SJSAS 9.0 has been relying on detecting a BOM, setting the appropriate encoding, and discarding the BOM bytes before reading the input. The purpose of the test program I provided with the bug report was to demonstrate the issue. It is not just a matter of changing the test program to make things work.

This has worked up until JDK 1.6, and I expect it to continue to work with that JDK release.

If you want to support the new functionality of automatically detecting and discarding a BOM, this should be enabled with a flag, but not by default. We cannot have our container implement one behaviour when running on JDK 1.5, and a different behaviour when running on JDK 1.6.

Just curious, have the following encodings been updated as well:

  UTF-32 BE
  UTF-32 LE
  UTF-16 BE
  UTF-16 LE
                                     
2006-01-31
EVALUATION

It sounds like the customer code is doing "encoding detection".
But specifying UTF-8 is already choosing an encoding...

If you really want to examine the input to auto-detect encodings
(in general, an impossible task), then read the first few bytes
from the input stream directly *as bytes*, and compare to the
various encodings of BOM.  
If a BOM is detected, deduce the encoding, discard the BOM,
and read the rest of the input using the detected encoding.
If you don't find something that looks like a BOM, guess and pray.
Auto-detecting encodings are never reliable; only heuristics.

The change to UTF-8 is incompatible, but a strong case can be made
that UTF-8 is specified by a standard, and so the change was simply a bug fix.
What do other implementations of UTF-8 do?
                                     
2006-01-31
EVALUATION

The "encoding autodetection" approach mentioned above is exactly what we have been doing.

However, after having deduced an encoding from the BOM bytes (if present), we need to reset the input stream so we can later pass it to the javax.xml.parsers.SAXParser (for JSP pages in XML syntax) or our "hand-written" parser (for JSP pages in classic syntax).

This means that in the classic syntax case, we must discard the BOM manually (in the XML syntax case, the javax.xml.parsers.SAXParser is taking care of this) when running against a JRE with a version < 1.6, but must rely on the decoder to do this for us as of JRE 1.6. The problem is that we cannot implement different behaviour depending on the JRE version we're running against. 

Also, do these encodings also discard a BOM as of JDK 1.6:

  UTF-32 BE
  UTF-32 LE
  UTF-16 BE
  UTF-16 LE

or was the change made to UTF-8 only?
                                     
2006-01-31
EVALUATION

I don't understand why the technique that works with 1.6,
namely "discarding the BOM manually", doesn't also work with 1.5.
If you consistently pass a BOM-free stream to the decoder, the
behavior will be unchanged.
                                     
2006-01-31
EVALUATION

Sorry if I wasn't clear: We currently detect the encoding of a JSP file from its BOM, reset the input stream, and pass the input stream to the appropriate parser, based on the JSP file's syntax. Notice that the SAXParser (invoked for XML syntax) would choke on a BOM-free stream. If the JSP page is in classic syntax, we remember if a BOM was present by setting a flag, set the stream's encoding to that derived from the BOM, and have our parser read and parse the JSP page from the stream. If the BOM flag is set, the parser knows to discard the first char. With JDK 1.6, this approach no longer works, because the decoder will already have discarded the BOM, so our parser will look at the wrong char.

Also, I never got an answer if the automatic BOM-stripping is now also done in the case of UTF-32 and UTF-16, or just UTF-8.
                                     
2006-01-31
EVALUATION

The update we made to recognized the BOM in UTF-8 (4508058) is
correct according to the Unicode Standard.  Our assumption was that
our change should rarely break any real-world applications because
that would imply that they were not following the Unicode Standard.

Unfortunately, we have found a common application where our
assumption was incorrect.  We will back out the changes associated
with 4508058, thus reverting to our previous behaviour of ignoring
the BOM for UTF-8.

No changes were ever made to BOM handling for UTF-16 or UTF-32 as
these double-byte encodings require its processing.
                                     
2006-02-02



Hardware and Software, Engineered to Work Together