A DESCRIPTION OF THE REQUEST : Most editors in windows save files in UTF-8. UTF-16BE, UTF-16LE etc with an initial "Byte Order Mark" (BOM). It is part of the "W3C Recommendation 04 February 2004": http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing Described more easily understandable here: http://www.i18ngurus.com/encyclopedia/byte_order_mark.html : " byte order mark Also known as BOM. Name given to the Unicode character U+FEFF when used at the beginning of a Unicode byte stream. This invisible character generally know as ZERO WIDTH NO-BREAK SPACE (ZWNBSP) serves to identify unambiguously the Unicode transformation form used (and especially the byte order) for the stream. Indeed U+FFFE is a noncharacter so there is no risk of misinterpretation. The following represents the byte signature of the character U+FEFF with the various Unicode Transformation Forms: Bytes Encoding 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian FE FF UTF-16, big-endian FF FE UTF-16, little-endian EF BB BF UTF-8 " But the default SaxParser in jdk 1.4.2_03 does not support to take care of this. I think it is a Crimson SaxParser that is initialised when you do the following: SAXParserFactory factory = SAXParserFactory.newInstance(); SaxParser saxParser = factory.newSAXParser(); I get errors if I try to parse a file saved in UTF-8 (or some other Unicode format such as UTF-16BE and UTF-16LE) in this way: import java.io.*; import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public static void testXML(){ try{ File ansiiFile=new File("Test ansii.xml");//File is saved ni ISO-8859-1 File utf8File=new File("Test utf8.xml");//File is saved in UTF-8 with initial ByteOrder Mark SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); //try ISO-8859-1 InputSource in = new InputSource(new InputStreamReader(new FileInputStream(ansiiFile),"ISO-8859-1")); saxParser.parse(in,new DefaultHandler()); //try UTF-8 in = new InputSource(new InputStreamReader(new FileInputStream(utf8File),"UTF-8")); saxParser.parse(in,new DefaultHandler()); }catch(Exception e){ e.printStackTrace(); } } The first file parses well, but when I parse the file encoded in UTF-8 I get this error: org.xml.sax.SAXParseException: The Document root element is missing. at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3182) at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3170) at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:501) at org.apache.crimson.parser.Parser2.parse(Parser2.java:305) at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442) at javax.xml.parsers.SAXParser.parse(SAXParser.java:345) at test.Test.testXML(Test.java:44) at test.Test.main(Test.java:25) That is because the xml-parser expects a '<', but, but really should take care of the case when the first character is a Byte Order Mark. JUSTIFICATION : Since the Byte Order Mark is recomended by "W3C Recommendation 04 February 2004", sun should make sure that the SaxParser included the jdk, should support this. Anyone will be very confused when the file "looks" well no matter the editor you open it in (NotePad, EditPlus 2 etc in windows), since they take care of the Byte Order Mark, but the SaxParser just crashes. EXPECTED VERSUS ACTUAL BEHAVIOR : EXPECTED - The default SaxParser in the jdk should take care of initial Byte Order Mark in xml files when parsing them. ACTUAL - The default SaxParser included id jdk 1.4.2_03 cannot handle UTF-8 files with initial Byte Order Mark ---------- BEGIN SOURCE ---------- mport java.io.*; import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public static void testXML(){ try{ File ansiiFile=new File("Test ansii.xml");//File is saved ni ISO-8859-1 File utf8File=new File("Test utf8.xml");//File is saved in UTF-8 with initial ByteOrder Mark SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); public class Test{ //try ISO-8859-1 InputSource in = new InputSource(new InputStreamReader(new FileInputStream(ansiiFile),"ISO-8859-1")); saxParser.parse(in,new DefaultHandler()); //try UTF-8 in = new InputSource(new InputStreamReader(new FileInputStream(utf8File),"UTF-8")); saxParser.parse(in,new DefaultHandler()); }catch(Exception e){ e.printStackTrace(); } } public static void main(String[] args){ testXML(); } } ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : This code will work: ------------------------- Test ------------------- public class Test{ //.... initialise you handler SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); //Open a InputSource in this BOM-controlled way instead in=in = new InputSource(BOMUtil.getReader(_f,"UTF-8")); saxParser.parse(in,handler); } ------------------------- BOMUtil ------------------- import java.io.*; public class BOMUtil { public final static int NONE=-1; public final static int UTF32BE=0; public final static int UTF32LE=1; public final static int UTF16BE=2; public final static int UTF16LE=3; public final static int UTF8=4; public final static byte[] UTF32BEBOMBYTES = new byte[]{(byte)0x00 ,(byte)0x00 ,(byte)0xFE ,(byte)0xFF ,}; public final static byte[] UTF32LEBOMBYTES = new byte[]{(byte)0xFF ,(byte)0xFE ,(byte)0x00 ,(byte)0x00 ,}; public final static byte[] UTF16BEBOMBYTES = new byte[]{(byte)0xFE ,(byte)0xFF ,}; public final static byte[] UTF16LEBOMBYTES = new byte[]{(byte)0xFF ,(byte)0xFE ,}; public final static byte[] UTF8BOMBYTES = new byte[]{(byte)0xEF ,(byte)0xBB ,(byte)0xBF ,}; public final static byte[][] BOMBYTES=new byte[][]{ UTF32BEBOMBYTES, UTF32LEBOMBYTES, UTF16BEBOMBYTES, UTF16LEBOMBYTES, UTF8BOMBYTES, }; public final static int MAXBOMBYTES=4;//no bom sequence is longer than 4 byte public static int getBOMType(byte[] _bomBytes){ return getBOMType(_bomBytes,_bomBytes.length); } public static int getBOMType(byte[] _bomBytes, int _length){ for (int i = 0; i < BOMBYTES.length; i++) { for(int j=0; j<_length && j<BOMBYTES[i].length; j++){ if(_bomBytes[j]!=BOMBYTES[i][j]) break; if(_bomBytes[j]==BOMBYTES[i][j] && j==BOMBYTES[i].length-1) return i; } } return NONE; } public static int getBOMType(File _f) throws IOException{ FileInputStream fIn=new FileInputStream(_f); byte[] buff=new byte[MAXBOMBYTES]; int read=fIn.read(buff); int BOMType=getBOMType(buff,read); fIn.close(); return BOMType; } public static int getSkipBytes(int BOMType){ if(BOMType<0 || BOMType>=BOMBYTES.length) return 0; return BOMBYTES[BOMType].length; } /** * Just reads necessary bytes from the stream * @param _fIn */ public static Reader getReader(File _f, String encoding) throws IOException{ int BOMType=getBOMType(_f); int skipBytes=getSkipBytes(BOMType); FileInputStream fIn=new FileInputStream(_f); fIn.skip(skipBytes); Reader reader=new InputStreamReader(fIn,encoding); return reader; } } ###@###.### 2004-12-10 00:52:07 GMT
|