JDK-8145974 : XMLStreamWriter produces invalid XML for surrogate pairs on OutputStreamWriter
  • Type: Bug
  • Component: xml
  • Sub-Component: jaxp
  • Affected Version: 7u60,8u72
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: windows_7
  • CPU: x86_64
  • Submitted: 2015-10-01
  • Updated: 2021-10-31
  • Resolved: 2016-05-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9
7u111Fixed 8u102Fixed 9 b119Fixed
Related Reports
Duplicate :  
Relates :  
Description
FULL PRODUCT VERSION :
1.7.0_65

ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows [Version 6.1.7601]

A DESCRIPTION OF THE PROBLEM :
I have narrowed down a problem where our application produced XML which it could not parse back. The XML contained "character references", but the reference had an invalid value (there are valid ranges fro them in XML). It turned out that these character references are generated specifically for characters outside the BMP, i.e. are encoded using a surrogate pair. Further investigation revealed that this happens only when constructing the XMLStreamWriter with an OutputStreamWriter. The surrogates are encoded as valid UTF-8 multibytes sequences when usign a plain OutputStream. The error can however not be in the OutputStreamWriter, since the character references are specific to XML files of which the OutputStreamWriter knows nothing.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
I am attaching a test program which clearly demonstrates the problem.


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
package com.dramaqueen.exporters;

import static org.junit.Assert.*;

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLStreamWriter;

import org.junit.Test;

import com.sun.xml.internal.messaging.saaj.util.ByteOutputStream;

@SuppressWarnings("nls")
public class StreamVersusWriterTest {

	@Test
	public void streamVersusWriter() {
		String charset = "UTF-8";

		ByteOutputStream streamA = new ByteOutputStream();
		ByteOutputStream streamB = new ByteOutputStream();

		XMLOutputFactory factory = XMLOutputFactory.newInstance();
		try {
			XMLStreamWriter writerA = factory.createXMLStreamWriter(streamA,
				charset);
			generateXML(writerA, charset);

			OutputStreamWriter streamWriter = new OutputStreamWriter(streamB,
				charset);
			XMLStreamWriter writerB = factory.createXMLStreamWriter(
				streamWriter);
			generateXML(writerB, charset);
			
			String outputA = streamA.toString();
			String outputB = streamB.toString();

			System.out.println("output using OutputStream      : " + outputA);
			System.out.println("output using OutputStreamWriter: " + outputB);
			
//			assertEquals(outputA, outputB);

			readXML(outputA.getBytes(charset), charset);
			readXML(outputB.getBytes(charset), charset);
		
		} catch (XMLStreamException e) {
			e.printStackTrace();
//			assertTrue(false);
		} catch (UnsupportedEncodingException e) {
			e.printStackTrace();
//			assertTrue(false);
		}
	}

	private void generateXML(XMLStreamWriter writer, String charset)
			throws XMLStreamException {
		// Char sequence containing a smiley which is encoded as a surrogate
		// pair in the Java string
		String sequence = "A😊�Bß";
		writer.writeStartDocument(charset, "1.0");
		writer.writeStartElement("a");
		writer.writeCharacters(sequence);
		writer.writeEndElement();
		writer.writeEndDocument();
		writer.flush();
	}

	private void readXML(byte[] xmlData, String charset)
			throws XMLStreamException {
		InputStream stream = new ByteArrayInputStream(xmlData);
		XMLInputFactory factory = XMLInputFactory.newInstance();
		XMLStreamReader xmlReader
			= factory.createXMLStreamReader(stream, charset);
		while (xmlReader.hasNext())
			xmlReader.next();
	}
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Use OutputStream, not OutpuStreamWriter


Comments
URL: http://hg.openjdk.java.net/jdk9/jdk9/jaxp/rev/f92e8518bb34 User: lana Date: 2016-05-18 20:42:16 +0000
18-05-2016

No issues related to the fix in the recent core-libs nightly. SQE OK to take it in PSU16_03.
16-05-2016

URL: http://hg.openjdk.java.net/jdk9/dev/jaxp/rev/f92e8518bb34 User: aefimov Date: 2016-05-12 22:22:15 +0000
12-05-2016

Patch with possible solution is attached.
06-05-2016

The bug reproducible on latest JDK9 builds.
03-02-2016

Attached Test case executed on following versions: JDK 7u60 - Fail JDK 8u66 - Fail JDK 8u72 - Fail JDK 9ea b93 - Pass Here is the output on failed versions: output using OutputStream : <?xml version="1.0" encoding="UTF-8"?><a>A😊�Bß</a> output using OutputStreamWriter: <?xml version="1.0" encoding="UTF-8"?><a>A&#xd83d;&#xde0a;�Bß</a> javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,51] Message: Character reference "&# at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598) at StreamVersusWriterTest.readXML(StreamVersusWriterTest.java:37) at StreamVersusWriterTest.main(StreamVersusWriterTest.java:68) Here is the output on JDK 9: output using OutputStream : <?xml version="1.0" encoding="UTF-8"?><a>AðŸË?Šï ¿½Bß</a> output using OutputStreamWriter: <?xml version="1.0" encoding="UTF-8"?><a>AÃ°Å¸Ë ?Šï¿½Bß</a>
22-12-2015