United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4233012 : HTMLEditorKit can't cope with HTTP-EQUIV setting charset

Details
Type:
Bug
Submit Date:
1999-04-26
Status:
Closed
Updated Date:
1999-05-14
Project Name:
JDK
Resolved Date:
1999-05-14
Component:
client-libs
OS:
generic
Sub-Component:
javax.swing
CPU:
generic
Priority:
P4
Resolution:
Not an Issue
Affected Versions:
1.2.0
Fixed Versions:

Related Reports

Sub Tasks

Description

Name: vi73552			Date: 04/26/99


I have an InputStream containing an HTML file with the header line
   <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

I create a InputStreamReader from it, and invoke HTMLEditorKit.read()

I get a ChangedCharSetException with getCharSetSpec() returning text/html; charset=iso-8859-1

Attempting to create a InputStreamReader setting the encoding to this string gets UnsupportedEncodingException

Using just the last part (iso-8859-1), I still get the ChangedCharSetException.

Is this a bug in  HTMLEditorKit ?
----------------------------------
Oops! Sorry about the brief report. I was in a hurry to knock off last night.
I have now isolated the problem into a single piece of code for demonstration
purposes. This is cut down from a larger piece of software which uses the Java
Activation Framework (hence the use of InputStreams).


I have attached the code, and two trivial HTML documents. Untitled.html was
created with Netscape Communicator 4.5. Untitled2.html is the same file with
one head element removed (using emacs).

% javac tests/HTMLEditorKitBug.java
% java tests/HTMLEditorKitBug Untitled.html
Encoding: ISO8859_1
Char set changed to text/html; charset=iso-8859-1
Retrying with encoding iso-8859-1
Char set still changed to text/html; charset=iso-8859-1
Failed!
% java tests/HTMLEditorKitBug Untitled2.html
Encoding: ISO8859_1
Document read successfully

I have tested this with four JVMs on Solaris:
Solaris VM (build Solaris_JDK_1.2_01, native threads, sunwjit)
Classic VM (build JDK-1.2-V, green threads, sunwjit)
Classic VM (build JDK-1.2.1-K, green threads, sunwjit)
java full version "JDK1.1.7M" (using swing 1.1)

All JVMs tested produce identical results.

I would like to remove the two lines 
      if (encoding.startsWith("text/html; charset="))
	encoding = encoding.substring(19);
and use the result of getCharSetSpec() directly in the Reader constructor, but
at the moment, that doesn't seem to work.

I hope I've given enough info this time. Please contact me if you have further
questions or suggestions.

Cheers,
  Scott Davis
---------------------------------------------------
import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.BadLocationException;

/** This is a short program to demonstrate the problem reported in Bug
 * report Review ID: 56831 HTMLEditorKit can't cope with HTTP-EQUIV
 * setting charset. */
class HTMLEditorKitBug
{
  /** Main program - argument should be an HTML file */
  public static void main(String args[])
    throws FileNotFoundException {
    InputStream is = new FileInputStream(args[0]);
    EditorKit ek = new HTMLEditorKit();
    Document doc = ek.createDefaultDocument();
    try {
      InputStreamReader isr = new InputStreamReader(is);
      System.out.println("Encoding: " + isr.getEncoding());
      ek.read(new BufferedReader(isr), doc, 0);
      System.out.println("Document read successfully");
    } catch (ChangedCharSetException ccs) {
      String encoding = ccs.getCharSetSpec();
      System.err.println("Char set changed to " + encoding);
      if (encoding.startsWith("text/html; charset="))
	encoding = encoding.substring(19);
      try {
	System.out.println("Retrying with encoding " + encoding);
	is = new FileInputStream(args[0]);
	Reader isr = new InputStreamReader(is, encoding);
	ek.read(new BufferedReader(isr), doc, 0);
      } catch (ChangedCharSetException ccse) {
	System.err.println("Char set still changed to " + ccse.getCharSetSpec());
	System.err.println("Failed!");
      } catch (IOException ie) {
	System.err.println("IOException with different charset:\n\t" + ie);
      } catch (BadLocationException bl) {
	System.err.println("Bad Location with different charset:\n\t" + bl);
      }
     } catch (IOException ie) {
      System.err.println("IOException:" + ie);
    } catch (BadLocationException bl) {
      System.err.println("Bad Location:" + bl);
    }
    System.exit(0);
  }
  
}
-----------------------------------------------
Untitled.html
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="Author" content="Scott Davis">
   <meta name="GENERATOR" content="Mozilla/4.5 [en] (X11; I; SunOS 5.6 sun4m) [Netscape]">
</head>
<body>
Trial file
<p>This is a simple HTML file to test a bug in Java reding files which
contain
<br>&lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<p>Scott
<br>&nbsp;
</body>
</html>
-----------------------------------------------------
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta name="Author" content="Scott Davis">
   <meta name="GENERATOR" content="Mozilla/4.5 [en] (X11; I; SunOS 5.6 sun4m) [Netscape]">
</head>
<body>
Trial file
<p>This is a simple HTML file to test a bug in Java reding files which
contain
<br>&lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<p>Scott
<br>&nbsp;
</body>
</html>

(Review ID: 56831) 
======================================================================

                                    

Comments
PUBLIC COMMENTS

The parser will throw a ChangedCharSetException every time a char set tag is encountered, unless the IgnoreCharsetDirective property is set on the Document. To avoid this you need to put the property IngoreChangesetDirective with a value of Boolean.TRUE on the Document, eg:

	    doc.putProperty("IgnoreCharsetDirective", new Boolean(true));

This is what JEditorPane.read does.
scott.violet@eng 1999-05-14
                                     
1999-05-14
EVALUATION

The parser will throw a ChangedCharSetException every time a char set tag is encountered, unless the IgnoreCharsetDirective property is set on the Document. To avoid this you need to put the property IngoreChangesetDirective with a value of Boolean.TRUE on the Document, eg:

	    doc.putProperty("IgnoreCharsetDirective", new Boolean(true));

This is what JEditorPane.read does.
scott.violet@eng 1999-05-14
                                     
1999-05-14



Hardware and Software, Engineered to Work Together