Bug ID: JDK-4763569 Crimson encoding detection fails on slow InputStreams.

Type: Bug
Component: xml
Sub-Component: org.xml.sax
Affected Version: 1.4.1
Priority: P3
Status: Closed
Resolution: Cannot Reproduce
OS: linux
CPU: x86
Submitted: 2002-10-16
Updated: 2012-04-25
Resolved: 2002-12-27
Name: gm110360			Date: 10/15/2002


FULL PRODUCT VERSION :
java version "1.4.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-b21)
Java HotSpot(TM) Client VM (build 1.4.1-b21, mixed mode)


FULL OPERATING SYSTEM VERSION :
Redhat Linux 7.1, glibc-2.2.4-24, 2.4.9-34

A DESCRIPTION OF THE PROBLEM :
If crimson reads from slow InputStreams, the encoding is
set to UTF-8 (fallback), no matter what encoding is
specified.
I took a look at the crimson sources, and it seems, as if
the crimson code is buggy, code is like this:
"try to read 5 characters"
"if that worked, parse the encoding"
"if you read less than 5 characters, fall back to UTF-8"

Thats fatal, if you use another encoding. The problem can
be avoided by using a BufferedInputStream, but the problem
should be resolved (took us three days to figure out).

On Slow Inputstreams the problem occurs rarely, the
testcase provided shows the error every time.

This bug probably occurs on any plattform.

I'd really like to supply a patch, but since it seems the
crimson code is generated by a parser generator or
someting like that, it would be useless. Additionally
there are several places where the above code is used
inside crimson. We fixed it by supplying an new
PushBackInputStream class in the same package hiding and
taking use of the original one, but thats not the way it
should be done i think...


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. get Test.java and test.xml from this bug report.
2. javac Test.java
3. java Test.java



EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected result would be (no error):
hmm, lets see what we got here: class
org.apache.crimson.jaxp.DocumentBuilderImpl
<?xml version="1.0" encoding="UTF-8"?>
<user id="a"/>

Actual result is (parses assumes UTF-8 and is wrong):
hmm, lets see what we got here: class
org.apache.crimson.jaxp.DocumentBuilderImpl
Exception in thread "main" org.xml.sax.SAXParseException:
Character conversion error: "Malformed UTF-8 char -- is an
XML encoding declaration missing?"


ERROR MESSAGES/STACK TRACES THAT OCCUR :
Exception in thread "main" org.xml.sax.SAXParseException: Character conversion
error: "Malformed UTF-8 char -- is an XML encoding declaration missing?" (line
number may be too low).
        at org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1100)
        at org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1072)
        at
org.apache.crimson.parser.InputEntity.isXmlDeclOrTextDeclPrefix(InputEntity.java:914)
        at org.apache.crimson.parser.Parser2.maybeXmlDecl(Parser2.java:1009)
        at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:486)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
        at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:442)
        at
org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185)
        at Test.main(Test.java:50)


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
Test.java
---- snipp
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.*;
import javax.xml.transform.dom.*;
import java.io.*;
import org.xml.sax.*;
import org.w3c.dom.*;

class Test {
        public static class A extends InputStream
        {
                final InputStream m_in;

                public A ( InputStream in )
                {
                        m_in = in;
                }
 
                public int read ()
                throws IOException
                {
                        return m_in.read();
                }
 
                public int read( byte[] buffer )
                throws IOException
                {
                        return read ( buffer, 0, buffer.length) ;
                }
 
                public int read ( byte[] buffer, int off, int len )
                throws IOException
                {
                        if ( len == 0 ) return 0;
                        int b = read();
                        if ( b == -1 ) return -1;
                        buffer[off] = (byte)b;
                        return 1;
                }
        }
 
public static void main(String[] args) throws Exception {
                A in = new A ( new FileInputStream ( "test.xml"));
 
                DocumentBuilderFactory fac =
DocumentBuilderFactory.newInstance();
                DocumentBuilder db = fac.newDocumentBuilder();
 
                System.out.println("hmm, lets see what we got here: " +
db.getClass());
 
                Document o  = db.parse(new InputSource (in));
                TransformerFactory tf = TransformerFactory.newInstance();
                Transformer t = tf.newTransformer();
                t.transform ( new DOMSource ( o), new StreamResult (
System.out ));
}
 
}
-- snipp

test.xml
-- snipp
<?xml version="1.0" encoding="iso-8859-1"?>
<user id="��"/>
-- snipp

The attribute id has a german umlaut as value, which is ok in iso8859-1.

---------- END SOURCE ----------

CUSTOMER WORKAROUND :
Use BufferedInputStream instead of InputStream worked in
100% of my tries.
(Review ID: 165807) 
======================================================================
EVALUATION Sorry, I couldn't reproduce this problem. But, I would suggest to use Xerces 2. We don't plan to fix bugs in crimson going forward. ###@###.### 2002-10-30
30-10-2002