JDK-6959785 : UTF-8 encoding does not recognize initial BOM
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 6u10,8
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: generic,windows_xp
  • CPU: generic,x86
  • Submitted: 2010-06-09
  • Updated: 2019-03-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Description
FULL PRODUCT VERSION :


ADDITIONAL OS VERSION INFORMATION :
all OS

A DESCRIPTION OF THE PROBLEM :
A Utf-8 stream can optionally beign with a byte order mark (see, for example http://www.unicode.org.unicode/faq/utf_bom.html).  This is the character FEFF, which is represented as EF BB BF in utf-8. Java's utf-8 encoding does not recognize this character as a BOM, though; the result of reading such a stream is a set of characters bginning with FEFF.

see bug:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

look mat the comments too.
look at the vote number.


REPRODUCIBILITY :
This bug can be reproduced always.

CUSTOMER SUBMITTED WORKAROUND :
Application code must recognize and skip the BOM itself.

Comments
http://unicode.org/faq/utf_bom.html#BOM
26-03-2019

My earlier suggestion (from 2012-04-16) is probably incorrect, as the character properties are determined by the Unicode standard and can't be changed arbitrarily by Java. As shown by JDK-4508058 and JDK-6378911, eliding the BOM by default is also probably the wrong approach. I can think of a couple possibilities. 1) Have a new Reader, or a mode of an existing Reader, that ignores an initial BOM. 2) Have a UTF-8 charset decoder that ignores an initial BOM. An additional possibility would be to have whatever mechanism is chosen ignore all BOMs and not just an initial one. Strictly speaking a BOM should appear only at the start of the text, however, various tools might copy BOM characters into other files, resulting in BOM characters being sprinkled throughout a file. An option to ignore all BOM characters throughout a file might be warranted.
26-03-2019

This has received a little bit of hay in the blogosphere: http://weblogs.java.net/blog/cayhorstmann/archive/2012/04/10/bomed-out-notepad-and-javac I don't think this is actually javac's problem though. It seems silly to have to modify every application (in this case javac is an application) to ignore the BOM. The fix for 4508058 is to have the streams code elide the BOM, but this was considered incompatible and was backed out by 6378911. Another approach the library might take is to update the character properties for the BOM. For example, Character.isWhitespace(BOM) and Character.isSpaceChar(BOM) return false. If they returned true then javac, which presumably ignores whitespace, would ignore the BOM as well. But I don't know if Unicode permits this.
23-03-2019

EVALUATION maybe we should give it a try gain in jdk7
16-08-2010