Name: gm110360 Date: 10/15/2002 FULL PRODUCT VERSION : java version "1.4.1" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-b21) Java HotSpot(TM) Client VM (build 1.4.1-b21, mixed mode) FULL OPERATING SYSTEM VERSION : Redhat Linux 7.1, glibc-2.2.4-24, 2.4.9-34 A DESCRIPTION OF THE PROBLEM : If crimson reads from slow InputStreams, the encoding is set to UTF-8 (fallback), no matter what encoding is specified. I took a look at the crimson sources, and it seems, as if the crimson code is buggy, code is like this: "try to read 5 characters" "if that worked, parse the encoding" "if you read less than 5 characters, fall back to UTF-8" Thats fatal, if you use another encoding. The problem can be avoided by using a BufferedInputStream, but the problem should be resolved (took us three days to figure out). On Slow Inputstreams the problem occurs rarely, the testcase provided shows the error every time. This bug probably occurs on any plattform. I'd really like to supply a patch, but since it seems the crimson code is generated by a parser generator or someting like that, it would be useless. Additionally there are several places where the above code is used inside crimson. We fixed it by supplying an new PushBackInputStream class in the same package hiding and taking use of the original one, but thats not the way it should be done i think... STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : 1. get Test.java and test.xml from this bug report. 2. javac Test.java 3. java Test.java EXPECTED VERSUS ACTUAL BEHAVIOR : Expected result would be (no error): hmm, lets see what we got here: class org.apache.crimson.jaxp.DocumentBuilderImpl <?xml version="1.0" encoding="UTF-8"?> <user id="a"/> Actual result is (parses assumes UTF-8 and is wrong): hmm, lets see what we got here: class org.apache.crimson.jaxp.DocumentBuilderImpl Exception in thread "main" org.xml.sax.SAXParseException: Character conversion error: "Malformed UTF-8 char -- is an XML encoding declaration missing?" ERROR MESSAGES/STACK TRACES THAT OCCUR : Exception in thread "main" org.xml.sax.SAXParseException: Character conversion error: "Malformed UTF-8 char -- is an XML encoding declaration missing?" (line number may be too low). at org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1100) at org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1072) at org.apache.crimson.parser.InputEntity.isXmlDeclOrTextDeclPrefix(InputEntity.java:914) at org.apache.crimson.parser.Parser2.maybeXmlDecl(Parser2.java:1009) at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:486) at org.apache.crimson.parser.Parser2.parse(Parser2.java:305) at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442) at org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185) at Test.main(Test.java:50) REPRODUCIBILITY : This bug can be reproduced always. ---------- BEGIN SOURCE ---------- Test.java ---- snipp import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.stream.*; import javax.xml.transform.dom.*; import java.io.*; import org.xml.sax.*; import org.w3c.dom.*; class Test { public static class A extends InputStream { final InputStream m_in; public A ( InputStream in ) { m_in = in; } public int read () throws IOException { return m_in.read(); } public int read( byte[] buffer ) throws IOException { return read ( buffer, 0, buffer.length) ; } public int read ( byte[] buffer, int off, int len ) throws IOException { if ( len == 0 ) return 0; int b = read(); if ( b == -1 ) return -1; buffer[off] = (byte)b; return 1; } } public static void main(String[] args) throws Exception { A in = new A ( new FileInputStream ( "test.xml")); DocumentBuilderFactory fac = DocumentBuilderFactory.newInstance(); DocumentBuilder db = fac.newDocumentBuilder(); System.out.println("hmm, lets see what we got here: " + db.getClass()); Document o = db.parse(new InputSource (in)); TransformerFactory tf = TransformerFactory.newInstance(); Transformer t = tf.newTransformer(); t.transform ( new DOMSource ( o), new StreamResult ( System.out )); } } -- snipp test.xml -- snipp <?xml version="1.0" encoding="iso-8859-1"?> <user id="��"/> -- snipp The attribute id has a german umlaut as value, which is ok in iso8859-1. ---------- END SOURCE ---------- CUSTOMER WORKAROUND : Use BufferedInputStream instead of InputStream worked in 100% of my tries. (Review ID: 165807) ======================================================================
|