JDK-6798514 : Charset UTF-8 accepts CESU-8 codings
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 6
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2009-01-28
  • Updated: 2011-02-16
  • Resolved: 2009-03-04
Description
FULL PRODUCT VERSION :
C:\Programme\Java\jdk1.6.0_03\bin>java -version
java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode)


ADDITIONAL OS VERSION INFORMATION :
Windows XP SR-2

A DESCRIPTION OF THE PROBLEM :
RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."

Current implementation of UTF-8 is not protected against invalid sequences from "ED A0 80" to "ED BF BF". Surrogate pairs are created instead, like CESU-8 does.

Maybe this is as designed. But at least this should be documented in highlighted position, and created surrogate pairs should be valid.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1.) Decode following byte sequence with UTF-8 decoder: "ED, A0, 80, ED, BF,BF"
2.) Decode following byte sequence with UTF-8 decoder: "ED, BF,BF, ED, A0, 80"


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
1.) CoderResult.isMalformed()
2.) CoderResult.isMalformed()

ACTUAL -
1.) valid surrogate pair: U+D800 + U+DFFF
2.) invalid surrogate pair: U+DFFF + U+D800


REPRODUCIBILITY :
This bug can be reproduced always.

Comments
EVALUATION The latest Unicode recommendation regarding this issue is at http://www.unicode.org/versions/corrigendum1.html in which it recommends "To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses." The "non_shortest forms" of supplementary characters are still "allowed" to be decoded (while not be generated in decoding). The UTF-8 charset implementation has been updated recently (#4486841) to follow the recommendation. The decision for now is that we are not going to udpate the implementation to prohibit the non-shortest forms for supplementary characters. Will reconsider this position should the Standard changes or new security concern raise, in the future.
04-03-2009