JDK-6206835 : Byte Order Mark not supported by default SaxParser
  • Type: Enhancement
  • Component: xml
  • Sub-Component: org.xml.sax
  • Affected Version: 1.4.2
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2004-12-10
  • Updated: 2012-04-25
  • Resolved: 2007-12-07
Description
A DESCRIPTION OF THE REQUEST :
Most editors in windows save files in UTF-8. UTF-16BE, UTF-16LE etc with an initial "Byte Order Mark" (BOM).

It is part of the "W3C Recommendation 04 February 2004":
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing

Described more easily understandable here:
http://www.i18ngurus.com/encyclopedia/byte_order_mark.html :
"
byte order mark

Also known as BOM.

Name given to the Unicode character U+FEFF when used at the beginning of a Unicode byte stream. This invisible character generally know as ZERO WIDTH NO-BREAK SPACE (ZWNBSP) serves to identify unambiguously the Unicode transformation form used (and especially the byte order) for the stream. Indeed U+FFFE is a noncharacter so there is no risk of misinterpretation.

The following represents the byte signature of the character U+FEFF with the various Unicode Transformation Forms:
Bytes Encoding
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
"

But the default SaxParser in jdk 1.4.2_03 does not support to take care of this. I think it is a Crimson SaxParser that is initialised when you do the following:

SAXParserFactory factory = SAXParserFactory.newInstance();
SaxParser saxParser = factory.newSAXParser();

I get errors if I try to parse a file saved in UTF-8 (or some other Unicode format such as UTF-16BE and UTF-16LE) in this way:

import java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

  public static void testXML(){
    try{
      File ansiiFile=new File("Test ansii.xml");//File is saved ni ISO-8859-1
      File utf8File=new File("Test utf8.xml");//File is saved in UTF-8 with initial ByteOrder Mark

      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser saxParser = factory.newSAXParser();

      //try ISO-8859-1
      InputSource in = new InputSource(new InputStreamReader(new FileInputStream(ansiiFile),"ISO-8859-1"));
      saxParser.parse(in,new DefaultHandler());

      //try UTF-8
      in = new InputSource(new InputStreamReader(new FileInputStream(utf8File),"UTF-8"));
      saxParser.parse(in,new DefaultHandler());
    }catch(Exception e){
      e.printStackTrace();
    }
  }

The first file parses well, but when I parse the file encoded in UTF-8 I get this error:

org.xml.sax.SAXParseException: The Document root element is missing.
at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3182)
at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3170)
at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:501)
at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
at test.Test.testXML(Test.java:44)
at test.Test.main(Test.java:25)

That is because the xml-parser expects a '<', but, but really should take care of the case when the first character is a Byte Order Mark.


JUSTIFICATION :
Since the Byte Order Mark is recomended by "W3C Recommendation 04 February 2004", sun should make sure that the SaxParser included the jdk, should support this.

Anyone will be very confused when the file "looks" well no matter the editor you open it in (NotePad, EditPlus 2 etc in windows), since they take care of the Byte Order Mark, but the SaxParser just crashes.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The default SaxParser in the jdk should take care of initial Byte Order Mark in xml files when parsing them.
ACTUAL -
The default SaxParser included id jdk 1.4.2_03 cannot handle UTF-8 files with initial Byte Order Mark

---------- BEGIN SOURCE ----------
mport java.io.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

  public static void testXML(){
    try{
      File ansiiFile=new File("Test ansii.xml");//File is saved ni ISO-8859-1
      File utf8File=new File("Test utf8.xml");//File is saved in UTF-8 with initial ByteOrder Mark

      SAXParserFactory factory = SAXParserFactory.newInstance();
      SAXParser saxParser = factory.newSAXParser();
public class Test{
      //try ISO-8859-1
      InputSource in = new InputSource(new InputStreamReader(new FileInputStream(ansiiFile),"ISO-8859-1"));
      saxParser.parse(in,new DefaultHandler());

      //try UTF-8
      in = new InputSource(new InputStreamReader(new FileInputStream(utf8File),"UTF-8"));
      saxParser.parse(in,new DefaultHandler());
    }catch(Exception e){
      e.printStackTrace();
    }
  }

  public static void main(String[] args){
    testXML();
  }
}


---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
This code will work:
------------------------- Test -------------------
public class Test{
    //.... initialise you handler
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();
    //Open a InputSource in this BOM-controlled way instead
    in=in = new InputSource(BOMUtil.getReader(_f,"UTF-8"));
    saxParser.parse(in,handler);
}
------------------------- BOMUtil -------------------
import java.io.*;

public class BOMUtil {
  public final static int NONE=-1;
  public final static int UTF32BE=0;
  public final static int UTF32LE=1;
  public final static int UTF16BE=2;
  public final static int UTF16LE=3;
  public final static int UTF8=4;

  public final static byte[] UTF32BEBOMBYTES = new byte[]{(byte)0x00 ,(byte)0x00 ,(byte)0xFE ,(byte)0xFF ,};
  public final static byte[] UTF32LEBOMBYTES = new byte[]{(byte)0xFF ,(byte)0xFE ,(byte)0x00 ,(byte)0x00 ,};
  public final static byte[] UTF16BEBOMBYTES = new byte[]{(byte)0xFE ,(byte)0xFF ,};
  public final static byte[] UTF16LEBOMBYTES = new byte[]{(byte)0xFF ,(byte)0xFE ,};
  public final static byte[] UTF8BOMBYTES    = new byte[]{(byte)0xEF ,(byte)0xBB ,(byte)0xBF ,};

  public final static byte[][] BOMBYTES=new byte[][]{
    UTF32BEBOMBYTES,
    UTF32LEBOMBYTES,
    UTF16BEBOMBYTES,
    UTF16LEBOMBYTES,
    UTF8BOMBYTES,
  };

  public final static int MAXBOMBYTES=4;//no bom sequence is longer than 4 byte

  public static int getBOMType(byte[] _bomBytes){
    return getBOMType(_bomBytes,_bomBytes.length);
  }

  public static int getBOMType(byte[] _bomBytes, int _length){
    for (int i = 0; i < BOMBYTES.length; i++) {
      for(int j=0; j<_length && j<BOMBYTES[i].length; j++){
        if(_bomBytes[j]!=BOMBYTES[i][j]) break;
        if(_bomBytes[j]==BOMBYTES[i][j] && j==BOMBYTES[i].length-1) return i;
      }
    }
    return NONE;
  }

  public static int getBOMType(File _f) throws IOException{
    FileInputStream fIn=new FileInputStream(_f);
    byte[] buff=new byte[MAXBOMBYTES];
    int read=fIn.read(buff);
    int BOMType=getBOMType(buff,read);
    fIn.close();
    return BOMType;
  }

  public static int getSkipBytes(int BOMType){
    if(BOMType<0 || BOMType>=BOMBYTES.length) return 0;
    return BOMBYTES[BOMType].length;
  }

  /**
   * Just reads necessary bytes from the stream
   * @param _fIn
   */
  public static Reader getReader(File _f, String encoding) throws IOException{
    int BOMType=getBOMType(_f);
    int skipBytes=getSkipBytes(BOMType);
    FileInputStream fIn=new FileInputStream(_f);
    fIn.skip(skipBytes);
    Reader reader=new InputStreamReader(fIn,encoding);
    return reader;
  }
}
###@###.### 2004-12-10 00:52:07 GMT

Comments
EVALUATION This is a very old bug. Crimson is no longer supported or part of JAXP. As part of the CR garbage collection process, we are closing all these old bugs.
07-12-2007

EVALUATION supporting UTF w/BOM is part of the spec so bug is being accepted.
10-12-2005