Other |
---|
1.4.0 betaFixed |
Relates :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
Name: yyT116575 Date: 11/22/2000 java version "1.3.0" Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C) Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode) This bug is related to the folloing bugs: 4251997 - UTF-8 Surrogate Decoding is Broken 4297837 - Silent Recovery from bad UTF-8 4344267 - Broken UTF-8 conversion of split surrogate but I've included a test case to highlight the problem. While I've stated that this bug affects 1.3, it also affects all previous versions as well. The problem with the UTF8 decoder is that it does not properly handle surrogate characters. It is stated in the documentation that surrogates are not supported as of yet but 1) they should be, and 2) they seem to be supported anyway (or at least partially). It's alright to not support them at this time but the support should be consistent and the behavior should be defined. For example, when bytes are passed to the String constructor with an encoding name of "UTF8", the surrogate characters are decoded correctly. However, if the surrogates appear in a byte stream, the surrogates are silently skipped! Strange. I would at least have thought that both methods would use the same underlying decoder code. Also, the InputStreamReader decoding UTF8 silently skips surrogates in the input stream. If the decision to NOT support surrogates stands as is, then perhaps the reader should throw some kind of exception to signal the error. Passing over them silently can cause problems for the application. In addition, I believe that the UTF8 also does not support reading a UTF8 byte-order-mark (BOM) at the beginning of the input. (It *does* occur in the real-world -- e.g. Microsoft adds UTF8 BOMs to a lot of their documents.) It's not strictly disallowed; it's just weird and the decoder should be able to handle it. /* Test case. Doesn't test ability to detect BOM. */ import java.io.ByteArrayInputStream; import java.io.FilterInputStream; import java.io.InputStream; import java.io.InputStreamReader; import java.io.IOException; import java.io.Reader; public class BrokenUTF8 { // MAIN public static void main(String[] argv) throws Exception { System.out.println("#"); System.out.println("# Byte array"); System.out.println("#"); final byte[] bytes = { (byte)0xF0, (byte)0x90, (byte)0x80, (byte)0x80 }; for (int i = 0; i < bytes.length; i++) { int c = bytes[i] & 0x00FF; System.out.println("byte["+i+"]: 0x"+Integer.toHexString(c)); } System.out.println("#"); System.out.println("# Converting bytes: new String(bytes, \"UTF8\")"); System.out.println("#"); String s = new String(bytes, "UTF8"); int slen = s.length(); for (int i = 0; i < slen; i++) { int c = s.charAt(i); System.out.println("s.charAt("+i+"): 0x"+Integer.toHexString(c)); } System.out.println("#"); System.out.println("# Converting bytes: new InputStreamReader(stream,\"UTF8\")"); System.out.println("#"); InputStream stream = new ByteArrayInputStream(bytes); InputStream streamReporter = new InputStreamReporter(stream); Reader reader = new InputStreamReader(streamReporter, "UTF8"); int c = -1; int count = 0; do { c = reader.read(); String cs = c != -1 ? "0x"+Integer.toHexString(c) : "EOF"; System.out.println("Reader.read(): "+cs); } while (c != -1); System.out.println("#"); System.out.println("# Done."); System.out.println("#"); } // Classes static class InputStreamReporter extends FilterInputStream { // Constructors public InputStreamReporter(InputStream stream) { super(stream); } // InputStream methods public int read() throws IOException { int c = in.read(); System.out.print("InputStream.read(): 0x"); if (c != -1) { System.out.print(Integer.toHexString(c)); } else { System.out.print("EOF"); } System.out.println(); return c; } public int read(byte[] buffer, int offset, int length) throws IOException { int count = super.in.read(buffer, offset, length); System.out.println("InputStream.read(byte[],"+offset+','+length+"): "+count); return count; } } // class InputStreamReporter } // class BrokenUTF8 (Review ID: 112649) ======================================================================
|