JDK-8203810 : XML Transformer produces separately escaped surrogate pair instead of codepoint
  • Type: Bug
  • Component: xml
  • Sub-Component: javax.xml.transform
  • Affected Version: 8u171
  • Priority: P4
  • Status: Closed
  • Resolution: Not an Issue
  • OS: linux_ubuntu
  • CPU: x86_64
  • Submitted: 2018-05-24
  • Updated: 2018-05-29
  • Resolved: 2018-05-29
Description
ADDITIONAL SYSTEM INFORMATION :
Ubuntu 18.4, Oracle Java 1.8 171

A DESCRIPTION OF THE PROBLEM :
When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "&#55360" is an invalid XML character.


Ouput of my test:
Character: ��������
EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
  ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>&#55360;&#56331;</a>
EXPECTED PARSED CHAR ��������

This seems to be same issue https://stackoverflow.com/questions/41636186/xml-support-for-new-utf-8-like-smileys

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Serialize XML with char consisting of high surrogate followed by a low surrogate "\uD840\uDC0B"

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
&#131083;
ACTUAL -
&#55360;&#56331;

---------- BEGIN SOURCE ----------
        String value = "\uD840\uDC0B";
        System.out.println("Character: " + value);
        System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>");
        StringWriter writer = new StringWriter();

        final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document dom = documentBuilder.newDocument();
        final Element rootEl = dom.createElement("a");
        rootEl.setTextContent(value);
        dom.appendChild(rootEl);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
        String xml = writer.toString();
        System.out.println("  ACTUAL: " + xml);

        InputSource inputSource = new InputSource();
        inputSource.setCharacterStream(new StringReader(xml));
        System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
---------- END SOURCE ----------

FREQUENCY : always



Comments
From submitter: Thank you for a quick response, I should have been more through with investigation, its actually a bug in Xalan. https://issues.apache.org/jira/browse/XALANJ-2617
29-05-2018

Got the following expected output with JDK 8u172 on ubuntu 14.0.4 : $ jdk1.8.0_172/bin/java JI9053942 Character: 𠀋 EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a> ACTUAL: <?xml version="1.0" encoding="UTF-8" standalone="no"?><a>&#131083;</a> ACTUAL PARSED CHAR 𠀋 Sent a mail to submitter asking if there are any additional inputs I could use to reproduce the issue locally.
25-05-2018