ADDITIONAL SYSTEM INFORMATION : Ubuntu 18.4, Oracle Java 1.8 171 A DESCRIPTION OF THE PROBLEM : When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "�" is an invalid XML character. Ouput of my test: Character: �������� EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a> ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>��</a> EXPECTED PARSED CHAR �������� This seems to be same issue https://stackoverflow.com/questions/41636186/xml-support-for-new-utf-8-like-smileys STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : Serialize XML with char consisting of high surrogate followed by a low surrogate "\uD840\uDC0B" EXPECTED VERSUS ACTUAL BEHAVIOR : EXPECTED - 𠀋 ACTUAL - �� ---------- BEGIN SOURCE ---------- String value = "\uD840\uDC0B"; System.out.println("Character: " + value); System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>"); StringWriter writer = new StringWriter(); final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document dom = documentBuilder.newDocument(); final Element rootEl = dom.createElement("a"); rootEl.setTextContent(value); dom.appendChild(rootEl); Transformer transformer = TransformerFactory.newInstance().newTransformer(); transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer)); transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16"); String xml = writer.toString(); System.out.println(" ACTUAL: " + xml); InputSource inputSource = new InputSource(); inputSource.setCharacterStream(new StringReader(xml)); System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent()); ---------- END SOURCE ---------- FREQUENCY : always
|