FULL PRODUCT VERSION :
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Darwin boolean.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64
A DESCRIPTION OF THE PROBLEM :
SAX/Xerces rejects unicode characters (>= U+10000) in XML 1.0/1.1 comments. This was a bug in the original Apache codebase (XMLScanner), which was fixed with revision 319636 (2003-12-16).
Parsing the following XML snippet results in
SAXParseException; systemId: <omitted>; lineNumber: 1; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.
After removing the comment, the literal, with the same unicode character as in the comment, gets parsed just fine.
<!-- Entry for Kanji: 𠀋��� -->
<character>
<literal>𠀋���</literal>
<codepoint>
<cp_value cp_type="ucs">2000B</cp_value>
<cp_value cp_type="jis213">1-14-2</cp_value>
</codepoint>
</character>
The problem is in XMLScanner.scanComment(XMLStringBuffer). After a surrogate pair was detected and successfully parsed, an additional check on the current character is performed (isInvalidLiteral(c)), This check has to go to an else-branch, after XMLChar.isHighSurrogate(c).
The following diff from the Apache codebase summarizes the necessary change:
Index: src/org/apache/xerces/impl/XMLScanner.java
===================================================================
--- src/org/apache/xerces/impl/XMLScanner.java (revision 319635)
+++ src/org/apache/xerces/impl/XMLScanner.java (revision 319636)
@@ -757,7 +757,7 @@
if (XMLChar.isHighSurrogate(c)) {
scanSurrogates(text);
}
- if (isInvalidLiteral(c)) {
+ else if (isInvalidLiteral(c)) {
reportFatalError("InvalidCharInComment",
new Object[] { Integer.toHexString(c) });
fEntityScanner.scanChar();
@@ -951,6 +951,7 @@
}
}
else if (c != -1 && XMLChar.isHighSurrogate(c)) {
+ fStringBuffer3.clear();
if (scanSurrogates(fStringBuffer3)) {
fStringBuffer.append(fStringBuffer3);
if (entityDepth == fEntityDepth) {
@@ -1354,6 +1355,14 @@
return (XMLChar.isNameStart(value));
} // isValidNameStartChar(int): boolean
+ // returns true if the given character is
+ // a valid high surrogate for a nameStartChar
+ // with respect to the version of XML understood
+ // by this scanner.
+ protected boolean isValidNameStartHighSurrogate(int value) {
+ return false;
+ } // isValidNameStartHighSurrogate(int): boolean
+
protected boolean versionSupported(String version ) {
return version.equals("1.0");
} // version Supported
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Parse XML with supplemental characters in a comment position. See test case below.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Since supplemental characters are valid for XML 1.0/1.1 comments, the expected result is that such XML can be parsed with SAX/Xerces.
ACTUAL -
SAXParseException: An invalid XML character (Unicode: <omitted>) was found in the comment.
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:/Users/dehmer/Development/carbon/kanjidic2.small.xml; lineNumber: 1; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanComment(XMLScanner.java:789)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanComment(XMLDocumentFragmentScannerImpl.java:1038)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:904)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)
REPRODUCIBILITY :
This bug can be reproduced always.
---------- BEGIN SOURCE ----------
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.ByteArrayInputStream;
/**
* Unicode U+2000B should are valid in XML 1.0/1.1 comments.
* The corresponding surrogate pair is (high) 0xd840, (low) 0xdc0b.
*
* org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 8;
* An invalid XML character (Unicode: 0xd840) was found in the comment.
*/
public class XMLScannerSupplementalCharactersInComment {
private final static String XML[] = {
"<tag>\uD840\uDC0B</tag>", // passes, since char is not in comment position.
"<!-- \uD840\uDC0B --><dontCare/>" // fails => SAXParseException
};
public static void main(String[] args) throws Exception {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
for(String xml : XML) {
try (ByteArrayInputStream stream = new ByteArrayInputStream(xml.getBytes("UTF-8"))) {
System.out.print("parsing: '" + xml + "'... ");
parser.parse(stream, new DefaultHandler());
System.out.println("passed.");
}
catch(SAXParseException unexpected) {
System.out.println("failed. " + unexpected.getMessage());
}
}
}
}
---------- END SOURCE ----------