JDK-6341770 : Xerces cannot handle relative entity includes with non-ASCII base URL
  • Type: Bug
  • Component: xml
  • Sub-Component: org.xml.sax
  • Affected Version: 5.0
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux,windows_xp
  • CPU: x86
  • Submitted: 2005-10-25
  • Updated: 2012-04-25
  • Resolved: 2005-12-09
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6
6 b62Fixed
Related Reports
Relates :  
Description
I have a Fedora Core 4 Linux system which uses UTF-8 as the system locale. Consequently Java normally has no problems using non-ASCII characters in filenames (and neither does any other major software).

However run this test case:

---%<---
import java.io.File;
import java.io.FileWriter;
import java.io.PrintWriter;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class Test {
    public static void main(String[] args) throws Exception {
        File dir = File.createTempFile("sko\u0159ice", null);
        dir.delete();
        dir.mkdir();
        File main = new File(dir, "main.xml");
        PrintWriter w = new PrintWriter(new FileWriter(main));
        w.println("<!DOCTYPE r [<!ENTITY aux SYSTEM \"aux.xml\">]>");
        w.println("<r>&aux;</r>");
        w.flush();
        w.close();
        File aux = new File(dir, "aux.xml");
        w = new PrintWriter(new FileWriter(aux));
        w.println("<x/>");
        w.flush();
        w.close();
        System.out.println("Parsing: " + main);
        SAXParserFactory.newInstance().newSAXParser().parse(main, new DefaultHandler() {
            public void startElement(String uri, String localname, String qname, Attributes attr) throws SAXException {
                System.out.println("encountered <" + qname + ">");
            }
        });
        System.out.println("OK.");
    }
}
---%<---

On JDK 1.4.2 it works, on JDK 5.0+ it does not:

---%<---
java version "1.4.2_09"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_09-b05)
Java HotSpot(TM) Client VM (build 1.4.2_09-b05, mixed mode)

Parsing: /tmp/sko<<<U+0159 LATIN SMALL LETTER R WITH CARON>>>ice17343.tmp/main.xml
encountered <r>
encountered <x>
OK.
---%<---
java version "1.5.0_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_05-b05)
Java HotSpot(TM) Client VM (build 1.5.0_05-b05, mixed mode, sharing)

Parsing: /tmp/sko<<<U+0159>>>ice42181.tmp/main.xml
encountered <r>
Exception in thread "main" java.net.MalformedURLException: no protocol: aux.xml
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:968)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:905)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:843)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1334)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1756)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:311)
	at Test.main(Test.java:25)
---%<---
java version "1.6.0-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-ea-b57)
Java HotSpot(TM) Client VM (build 1.6.0-ea-b57, mixed mode, sharing)

Parsing: /tmp/sko<<<U+0159>>>ice26384.tmp/main.xml
encountered <r>
Exception in thread "main" java.net.MalformedURLException: no protocol: aux.xml
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:657)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1319)
	at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1256)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1896)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3019)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:664)
	at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:524)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807)
	at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
	at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107)
	at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
	at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:376)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:312)
	at Test.main(Test.java:25)
---%<---

Either (1) SAXParser.parse(File,...) is failing to take non-ASCII filenames and encode them as UTF-8 octets with %xx syntax, or (2) it is calling File.toURI which is supposed to be doing that but is not, and Crimson just did not check this condition; or (3) the non-ASCII character in the URI is OK and Xerces is incorrectly rejecting it. I suspect it is a combination of #1 and #2; there is another bug filed somewhere that File.toURI is not being called by JAXP, but even if it were, it seems that the result does not escape non-ASCII characters, which it seems it should if I read the RFC correctly.

Comments
EVALUATION Fix available in Mustang b62.
09-12-2005

EVALUATION Reopening. Evaluator writes "test case contains non-ascii character which is not a valid URI" but this is misleading. The test case does *not* create a URI. It creates a java.io.File whose path contains a non-ASCII character, which as far as I know is perfectly acceptable (if the OS supports such characters, which at least Linux does). The test then calls parse(File,...). As I already wrote in the description, if the non-ASCII character is forbidden in the URI, then there is a bug in the parse(File,...) method - it needs to create a valid URI for this file, perhaps by escaping octets in the normal way. Same for the InputSource(File) constructor, and for File.toURI() and new File(URI) - these methods bear responsibility for handling any Unicode restrictions in the URI specification (as does java.net.URI's constructors, probably). Furthermore, even if an invalid URI were passed to the parser directly (as a java.lang.String since these APIs predate java.net.URI), the parser ought to reject the malformed URI immediately - not work *most of the time* and then fail with an apparently meaningless error when the source document happens to use a relative entity include.
15-11-2005

EVALUATION Given test case contains non-ascii character which is not a valid URI(RFC2396). Hence parse() fails. This is expected behaviour.
15-11-2005

SUGGESTED FIX Given test case contains non-ascii character which is not a valid URI. Hence parse() fails. This is expected behaviour.
15-11-2005

WORK AROUND 1. Do not try to parse XML files in non-ASCII locations. 2. If you do, do not use relative entity includes. 3. If you must, use an entity resolver which produces an absolute URI, e.g.: final File aux = new File(dir, "aux.xml"); // ... SAXParserFactory.newInstance().newSAXParser().parse(main, new DefaultHandler() { public void startElement(String uri, String localname, String qname, Attributes attr) throws SAXException { System.out.println("encountered <" + qname + ">"); } public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException { if (systemId.equals("aux.xml")) { return new InputSource(aux.toURI().toString()); } else { return null; } } });
25-10-2005

SUGGESTED FIX Use File.toURI and fix that method to escape non-ASCII characters in filenames.
25-10-2005