JDK-8158619 : Very large CDATA section in XML document causes OOME
  • Type: Bug
  • Component: xml
  • Sub-Component: org.xml.sax
  • Affected Version: 8,9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2016-02-04
  • Updated: 2018-08-21
  • Resolved: 2016-11-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b147Fixed
Related Reports
Relates :  
Sub Tasks
JDK-8175792 :  
Description
FULL PRODUCT VERSION :


A DESCRIPTION OF THE PROBLEM :
The XMLReader implementation loads the CDATA section completely into memory before sending it back to the application event handler. When operating on a very large CDATA section, the JVM will hit an OutOfMemoryError.

The third party Xerces implementation reads and sends back the CDATA section in chunks.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create an XML document with a CDATA section containing for instance 500MB of BASE64 content. Run the included sample program using -Xmx256m as the max heap size. The program will fail with an OutOfMemoryError because the CDATA section is fully loaded into a memory (custom version of String builder).

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The parser calls back with multiple chunks of the CDATA content, e.g.

characters: start=0 length=76
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
characters: start=0 length=77
...
ACTUAL -
The parser calls back with the complete CDATA content, e.g.

characters: start=0 length=24138365

ERROR MESSAGES/STACK TRACES THAT OCCUR :
OutOfMemoryError

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.io.*;

import org.xml.sax.*;;
import org.xml.sax.helpers.*;

class Test {
    public static void main(String[] args) throws Exception {
        XMLReader reader = XMLReaderFactory.createXMLReader();
        reader.setContentHandler(new ConsoleHandler());
        try (InputStream is = new FileInputStream("test.xml")) {
            reader.parse(new InputSource(is));
        }
    }

    static class ConsoleHandler extends DefaultHandler {
        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            System.out.printf("characters: start=%d length=%d%n", start, length);
        }
    }
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
The only workaround is to use the third party Xerces implementation.


Comments
I tried reproducing the issue mentioned in the bug report locally with JDK 8u72 and JDK 8u92 , however am unable to reproduce the issue. Following is the output I get, which shows that parser calls back with multiple chunks of CDATA and not as the complete CDATA content at once. Can you please let me know if the issue happens only in the case of base64 encoded data. If yes, can you please share your xml file. ������������������������������.. ������������������������������������ characters: start=0 length=305 characters: start=0 length=697 ������������������������������.. ������������������������������.. characters: start=0 length=5 characters: start=1587 length=1
03-11-2016

Response from submitter: --------------------------------------------------------------------------- This program below reproduces the problem using JDK 8u92, i.e. the CDATA section is not read in chunks. bash4.3$ ~/jdk1.8.0_92/bin/java -version java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode) bash4.3$ bash4.3$ cat Test.java import java.io.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public class Test { public static void main(String[] args) throws Exception { XMLReader r = XMLReaderFactory.createXMLReader(); r.setContentHandler(new DefaultHandler() { @Override public void characters(char[] ch, int start, int length) throws SAXException { System.out.printf("characters: start=%d length=%d%n", start, length); } }); try (InputStream in = new FileInputStream("test.xml")) { r.parse(new InputSource(in)); } } } For the test data, just put a big amount of text in a CDATA section; it doesn���t matter if it���s Base64 encoded or not. For example: <xml> <![CDATA[ Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. ]]> </xml>
03-11-2016