FULL PRODUCT VERSION :
java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode)
ADDITIONAL OS VERSION INFORMATION :
Windows XP SR-2
A DESCRIPTION OF THE PROBLEM :
RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."
Current implementation of UTF-8 is not protected against invalid sequences from "ED A0 80" to "ED BF BF". Surrogate pairs are created instead, like CESU-8 does.
Maybe this is as designed. But at least this should be documented in highlighted position, and created surrogate pairs should be valid.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1.) Decode following byte sequence with UTF-8 decoder: "ED, A0, 80, ED, BF,BF"
2.) Decode following byte sequence with UTF-8 decoder: "ED, BF,BF, ED, A0, 80"
EXPECTED VERSUS ACTUAL BEHAVIOR :
1.) valid surrogate pair: U+D800 + U+DFFF
2.) invalid surrogate pair: U+DFFF + U+D800
This bug can be reproduced always.