United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6836089 Swing HTML parser can't properly decode codepoints outside the Unicode Plane 0 into a surrogate pair
JDK-6836089 : Swing HTML parser can't properly decode codepoints outside the Unicode Plane 0 into a surrogate pair

Details
Type:
Bug
Submit Date:
2009-04-30
Status:
Closed
Updated Date:
2012-09-07
Project Name:
JDK
Resolved Date:
2010-04-06
Component:
client-libs
OS:
generic
Sub-Component:
javax.swing
CPU:
generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
6u13
Fixed Versions:
6u19-rev (b07)

Related Reports
Backport:
Backport:
Backport:
Backport:
Backport:

Sub Tasks

Description
The statement 

   System.out.println("\ud840\udc00".codePointAt(0));

returns

   131072, because both \ud840 and \udc00 are surrogate characters.

If one say
 
   JTextPane htmlPane = new JTextPane();
   htmlPane.setEditorKit(new HTMLEditorKit());

   htmlPane.setText("<html><head></head><body>&#131072;</body></html>");

the entity reference won't be parsed correctly into a surrogate pair.

   System.out.println(htmlPane.getText());

returns

<html>
  <head>
    
  </head>
  <body>
    &#0;
  </body>
</html>

rather than

<html>
  <head>
    
  </head>
  <body>
    &#55360;&#56320;
  </body>
</html>


or at least

<html>
  <head>
    
  </head>
  <body>
    &#131072;
  </body>
</html>

                                    

Comments
SUGGESTED FIX

There's no check if code point is within BMP inside Parser.parseEntityReference() method, which is part of HTML parsing.

Suggested fix is to check if code point is within BMP and convert it into surrogate pair otherwise. The pseudocode looks as follows:

if(codepoint <= BMP_HIGHER_LIMIT) {
    //default behaviour
} else {
    //convert into surrogate pair
    //form string "&#HIGH_SURROGATE;&#LOW_SURROGATE;"
    //return toCharArray()
}
                                     
2009-06-16
EVALUATION

There is no check if code point is within BMP (Base Multilingual Plane) inside Parser.parseEntityReference() method, which is part of HTML parsing. So the parser is not able to convert CP into corresponding surrogate pair.
                                     
2009-06-20



Hardware and Software, Engineered to Work Together