JDK-6392804 : Inappropriate output of \ufffd in various decoders
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 6
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2006-03-02
  • Updated: 2011-03-08
  • Resolved: 2011-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7
6u21-revFixed 7 b06Fixed
Description
\ufffd, aka REPLACE_CHAR, should be output by decoders
decodeXXXLoop only if the input bytes have the semantics of a REPLACE_CHAR,
not simply because the input is malformed or unmappable.  After all, there
is MALFORMED[n] and UNMAPPABLE[n].

--------------------------------------------------------
public class Decode {
    private static boolean isAscii(char c) {
	return c < '\u0080';
    }

    private static boolean isPrintable(char c) {
	return ('\u0020' < c) && (c < '\u007f');
    }

    public static void main(String[] args) throws Throwable {
	if (args.length < 2)
	    throw new Exception("Usage: java Decode CHARSET BYTE [BYTE ...]");
	String cs = args[0];
	byte[] bytes = new byte[args.length-1];
	for (int i = 1; i < args.length; i++) {
	    String arg = args[i];
	    bytes[i-1] =
		(arg.length() == 1 && isAscii(arg.charAt(0))) ?
		(byte) arg.charAt(0) :
		arg.equals("ESC") ? 0x1b :
		arg.equals("SO")  ? 0x0e :
		arg.equals("SI")  ? 0x0f :
		arg.equals("SS2") ? (byte) 0x8e :
		arg.equals("SS3") ? (byte) 0x8f :
		arg.matches("0x.*") ? Integer.decode(arg).byteValue() :
		Integer.decode("0x"+arg).byteValue();
	}
	String s = new String(bytes, cs);

	for (int j = 0; j < s.length(); j++) {
	    if (j > 0)
		System.out.print(' ');
	    char c = s.charAt(j);
	    if (isPrintable(c))
		System.out.print(c);
	    else if (c == '\u001b') System.out.print("ESC");
	    else
		System.out.printf("\\u%04x", (int) c);
	}
	System.out.print("\n");
    }
}
--------------------------------------------------------
 $ jver 6 javac Decode.java && for cs in ISO-2022-JP ISO-2022-JP-2 x-windows-50220 x-windows-50221 x-windows-iso2022jp ; do echo $cs;  jver 6 java Decode $cs ESC 24 40 00 00; done; echo EUC-TW ; jver 6 java Decode EUC-TW 8e 98 ad e5
ISO-2022-JP
\ufffd
ISO-2022-JP-2
\ufffd
x-windows-50220
\ufffd
x-windows-50221
\ufffd
x-windows-iso2022jp
\ufffd
EUC-TW
\ufffd

Comments
EVALUATION interestingly the attached test case actualy can NOT be used to verify the fix because (1)the b->c conversion in our String class always does replacing, so \ufffd will always be returned (for unmappable codepoint), even with the fix. (2)0x98 is an illegal value for the second byte of a cs2 euc-tw byte sequence but it does expose the problems.
2006-12-11

EVALUATION iso2022jp decoder needs check whether or not the return value from decode() is REPLACE_CHAR (semantically means unmappable char).
2006-03-04