JDK-6836089 : Swing HTML parser can't properly decode codepoints outside the Unicode Plane 0 into a surrogate pair
  • Type: Bug
  • Component: client-libs
  • Sub-Component: javax.swing
  • Affected Version: 6u13
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2009-04-30
  • Updated: 2012-09-07
  • Resolved: 2010-04-06
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 JDK 8
6u19-rev b07Fixed 7-poolFixed 8-poolFixed
Description
The statement 

   System.out.println("\ud840\udc00".codePointAt(0));

returns

   131072, because both \ud840 and \udc00 are surrogate characters.

If one say
 
   JTextPane htmlPane = new JTextPane();
   htmlPane.setEditorKit(new HTMLEditorKit());

   htmlPane.setText("<html><head></head><body>&#131072;</body></html>");

the entity reference won't be parsed correctly into a surrogate pair.

   System.out.println(htmlPane.getText());

returns

<html>
  <head>
    
  </head>
  <body>
    &#0;
  </body>
</html>

rather than

<html>
  <head>
    
  </head>
  <body>
    &#55360;&#56320;
  </body>
</html>


or at least

<html>
  <head>
    
  </head>
  <body>
    &#131072;
  </body>
</html>

Comments
EVALUATION There is no check if code point is within BMP (Base Multilingual Plane) inside Parser.parseEntityReference() method, which is part of HTML parsing. So the parser is not able to convert CP into corresponding surrogate pair.
20-06-2009

SUGGESTED FIX There's no check if code point is within BMP inside Parser.parseEntityReference() method, which is part of HTML parsing. Suggested fix is to check if code point is within BMP and convert it into surrogate pair otherwise. The pseudocode looks as follows: if(codepoint <= BMP_HIGHER_LIMIT) { //default behaviour } else { //convert into surrogate pair //form string "&#HIGH_SURROGATE;&#LOW_SURROGATE;" //return toCharArray() }
16-06-2009