JDK-8072081 : Supplementary characters are rejected in comments
  • Type: Bug
  • Component: xml
  • Sub-Component: javax.xml.parsers
  • Affected Version: 8,9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: other
  • CPU: x86
  • Submitted: 2015-02-01
  • Updated: 2016-07-21
  • Resolved: 2015-12-10
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9
7u111Fixed 8u102Fixed 9 b97Fixed
Related Reports
Duplicate :  
Relates :  
Description
FULL PRODUCT VERSION :
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Darwin boolean.local 14.1.0 Darwin Kernel Version 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64 x86_64

A DESCRIPTION OF THE PROBLEM :
SAX/Xerces rejects unicode characters (>= U+10000) in XML 1.0/1.1 comments. This was a bug in the original Apache codebase (XMLScanner), which was fixed with revision 319636 (2003-12-16).

Parsing the following XML snippet results in 
SAXParseException; systemId: <omitted>; lineNumber: 1; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.

After removing the comment, the literal, with the same unicode character as in the comment, gets parsed just fine.

<!-- Entry for Kanji: 𠀋��� -->
<character>
<literal>𠀋���</literal>
<codepoint>
<cp_value cp_type="ucs">2000B</cp_value>
<cp_value cp_type="jis213">1-14-2</cp_value>
</codepoint>
</character>

The problem is in XMLScanner.scanComment(XMLStringBuffer). After a surrogate pair was detected and successfully parsed, an additional check on the current character is performed (isInvalidLiteral(c)), This check has to go to an else-branch, after XMLChar.isHighSurrogate(c).

The following diff from the Apache codebase summarizes the necessary change:

Index: src/org/apache/xerces/impl/XMLScanner.java
===================================================================
--- src/org/apache/xerces/impl/XMLScanner.java  (revision 319635)
+++ src/org/apache/xerces/impl/XMLScanner.java  (revision 319636)
@@ -757,7 +757,7 @@
                 if (XMLChar.isHighSurrogate(c)) {
                     scanSurrogates(text);
                 }
-                if (isInvalidLiteral(c)) {
+                else if (isInvalidLiteral(c)) {
                     reportFatalError("InvalidCharInComment",
                                      new Object[] { Integer.toHexString(c) }); 
                     fEntityScanner.scanChar();
@@ -951,6 +951,7 @@
                     }
                 }
                 else if (c != -1 && XMLChar.isHighSurrogate(c)) {
+                    fStringBuffer3.clear();
                     if (scanSurrogates(fStringBuffer3)) {
                         fStringBuffer.append(fStringBuffer3);
                         if (entityDepth == fEntityDepth) {
@@ -1354,6 +1355,14 @@
         return (XMLChar.isNameStart(value)); 
     } // isValidNameStartChar(int):  boolean
     
+    // returns true if the given character is 
+    // a valid high surrogate for a nameStartChar 
+    // with respect to the version of XML understood 
+    // by this scanner.
+    protected boolean isValidNameStartHighSurrogate(int value) {
+        return false; 
+    } // isValidNameStartHighSurrogate(int):  boolean
+    
     protected boolean versionSupported(String version ) {
         return version.equals("1.0");
     } // version Supported

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Parse XML with supplemental characters in a comment position. See test case below.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Since supplemental characters are valid for XML 1.0/1.1 comments, the expected result is that such XML can be parsed with SAX/Xerces.
ACTUAL -
SAXParseException: An invalid XML character (Unicode: <omitted>) was found in the comment.


ERROR MESSAGES/STACK TRACES THAT OCCUR :
Exception in thread "main" org.xml.sax.SAXParseException; systemId: file:/Users/dehmer/Development/carbon/kanjidic2.small.xml; lineNumber: 1; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
	at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
	at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1436)
	at com.sun.org.apache.xerces.internal.impl.XMLScanner.scanComment(XMLScanner.java:789)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanComment(XMLDocumentFragmentScannerImpl.java:1038)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:904)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
	at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:649)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:333)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import org.xml.sax.SAXParseException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.ByteArrayInputStream;

/**
 * Unicode U+2000B should are valid in XML 1.0/1.1 comments.
 * The corresponding surrogate pair is (high) 0xd840, (low) 0xdc0b.
 *
 * org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 8;
 * An invalid XML character (Unicode: 0xd840) was found in the comment.
 */
public class XMLScannerSupplementalCharactersInComment {
    private final static String XML[] = {
            "<tag>\uD840\uDC0B</tag>", // passes, since char is not in comment position.
            "<!-- \uD840\uDC0B --><dontCare/>" // fails => SAXParseException
    };

    public static void main(String[] args) throws Exception {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser parser = factory.newSAXParser();

        for(String xml : XML) {
            try (ByteArrayInputStream stream = new ByteArrayInputStream(xml.getBytes("UTF-8"))) {
                System.out.print("parsing: '" + xml + "'... ");
                parser.parse(stream, new DefaultHandler());
                System.out.println("passed.");
            }
            catch(SAXParseException unexpected) {
                System.out.println("failed. " + unexpected.getMessage());
            }
        }
    }
}

---------- END SOURCE ----------


Comments
Tested this with JDK 8u31 and 7u76 including JDK 7 and 8 and could reproduce the issue.
02-02-2015