FULL PRODUCT VERSION :
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b15)
Java HotSpot(TM) Client VM (build 25.45-b02, mixed mode, sharing)
ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows [Version 6.1.7601]
SunOS xxxxx 5.10 Generic_147440-07 sun4v sparc sun4v
Linux yyyyy 2.6.18-348.el5xen #1 SMP Wed Nov 28 22:04:26 EST 2012 i686 i686 i386 GNU/Linux
A DESCRIPTION OF THE PROBLEM :
There appears to be a bug in the Apache XERCES UTF8Reader function included in Java, that can happen when the input file contains 4-byte UTF8 characters.
See https://issues.apache.org/jira/browse/XERCESJ-1257
Apparently this bug has existed in the Apache XERCES source since 2007 and still has not been fixed.
In 2007 Robert Stojnic posted a patch, which still had a problem in it. So Michael Glavassevich committed a different but bad fix (and closed the issue), which then caused a different "org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence" to occur (and the issue was reopened).
Robert Stojnic gave a slight modification to his original patch in a later comment (09/Jul/07) that gave a correct fix, but he neglected to update his posted UTF8Reader.patch file or attach a new patch file, and so the issue has just been languishing in limbo since then and the bug in the Apache XERCES sources has never been properly fixed (and now it's 2015 and the bug was originally reported in 2007)!
Apparently others using the "broken" Apache source code have resorted to patching themselves the Apache source that they are incorporating into their products. I don't know how to get the Apache XERCES "people" to fix this bug (that's been around now since 2007), so I'm suggesting that perhaps the Oracle Java developers can do the same thing and patch the XERCES code included with Oracle's Java, unless you know how to get the Apache XERCES people to fix it (and then incorporate the fix).
This is the revised patch (from Robert Stojnic) that fixes the problem. Again, it's slightly revised from the UTF8Reader.patch attachment posted to the referenced Apache XERCES bug posting, but Robert did propose the change to his patch (in a comment at 09/Jul/07 13:07 ) and at https://issues.apache.org/jira/browse/LUCENE-1591 in a comment at 18/Apr/09 12:13 by Michael McCandless, Michael indicated that this patch was used to fix the bug in the XERCES sources included in the LUCENE product:
--- src/org/apache/xerces/impl/io/UTF8Reader.java 2006-11-23 00:36:53.000000000 +0100
+++ ../../xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java 2007-06-28 02:02:44.000000000 +0200
@@ -534,6 +534,16 @@
invalidByte(4, 4, b2);
}
+ // check if output buffer is large enough to hold 2 surrogate chars
+ if(out + 1 >= offset + length ){
+ fBuffer[0] = (byte)b0;
+ fBuffer[1] = (byte)b1;
+ fBuffer[2] = (byte)b2;
+ fBuffer[3] = (byte)b3;
+ fOffset = 4;
+ return out - offset;
+ }
+
// decode bytes into surrogate characters
int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003);
if (uuuuu > 0x10) {
ADDITIONAL REGRESSION INFORMATION:
Problem does not occur in the following version of Java:
Java version "1.5.0_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_12-b04)
Java HotSpot(TM) Client VM (build 1.5.0_12-b04, mixed mode, sharing)
Problem seems to start to occur with Java version 1.6
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
This is the java command that causes the ArrayIndexOutOfBoundsException in UTF8Reader:
java -Dxml.catalog.files=catalog.xml -cp xpp.jar:saxon9pe.jar:resolver.jar com.x
yenterprise.xpp.xslt.XppTransform -x:org.apache.xml.resolver.tools.ResolvingXMLR
eader -y:org.apache.xml.resolver.tools.ResolvingXMLReader -r:org.apache.xml.reso
lver.tools.CatalogResolver -s:divxml.xml -xsl:basic.xsl -o:output.xml
I don't see any way to attach the necessary files to reproduce the error? I need to somehow get you the catalog.xml, xpp.jar, saxon9pe.jar, resolver.jar, divxml.xml, and basic.xsl files.
Due to the nature of the bug, you need an "exact" (divxml.xml) input file that causes the problem to occur. Our example divxml.xml input file has a number of 4-byte UTF8 characters in it. But any modification to the input file, including changing from DOS line-endings to UNIX line endings, will prevent the error from occurring.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
A good output.xml result without getting an exception.
ACTUAL -
Noted exception in the UTF8Reader function.
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Warning: at xsl:stylesheet on line 1 column 80 of basic.xsl:
Running an XSLT 1.0 stylesheet with an XSLT 2.0 processor
java.lang.ArrayIndexOutOfBoundsException: 8192
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanLiteral(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanAttributeValue(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.xml.sax.helpers.XMLFilterImpl.parse(Unknown Source)
at org.apache.xml.resolver.tools.ResolvingXMLFilter.parse(ResolvingXMLFilter.java:141)
at net.sf.saxon.event.Sender.sendSAXSource(Sender.java:397)
at net.sf.saxon.event.Sender.send(Sender.java:156)
at net.sf.saxon.Controller.transform(Controller.java:1689)
at net.sf.saxon.Transform.processFile(Transform.java:1157)
at net.sf.saxon.Transform.doTransform(Transform.java:752)
at com.xyenterprise.xpp.xslt.XppTransform.main(XppTransform.java:58)
Fatal error during transformation: java.lang.ArrayIndexOutOfBoundsException: 8192
REPRODUCIBILITY :
This bug can be reproduced often.
---------- BEGIN SOURCE ----------
I need to somehow get you the catalog.xml, xpp.jar, saxon9pe.jar, resolver.jar, divxml.xml, and basic.xsl files for the java command that fails (detailed in the Steps to Reproduce).
Due to the nature of the bug, you need an "exact" (divxml.xml) input file that causes the problem to occur. Our example divxml.xml input file has a number of 4-byte UTF8 characters in it. But any modification to the input file, including changing from DOS line-endings to UNIX line endings, will prevent the error from occurring.
How do I get those files to you?
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
One workaround that we've found is to change the 4-byte UTF8 characters in the input file into numeric character entities. Then the buffer boundary problem in the UTF8Reader function does not occur.