Bug ID: JDK-4391895 UTF8 Decoder Broken

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 1.3.0

Priority: P3
Status: Resolved
Resolution: Fixed
OS: windows_nt
CPU: x86

Submitted: 2000-11-22
Updated: 2000-12-20
Resolved: 2000-12-20

Other
1.4.0 betaFixed


Name: yyT116575			Date: 11/22/2000


java version "1.3.0"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.0-C)
Java HotSpot(TM) Client VM (build 1.3.0-C, mixed mode)

This bug is related to the folloing bugs:
  4251997 - UTF-8 Surrogate Decoding is Broken
  4297837 - Silent Recovery from bad UTF-8
  4344267 - Broken UTF-8 conversion of split surrogate
but I've included a test case to highlight the problem.

While I've stated that this bug affects 1.3, it also
affects all previous versions as well.

The problem with the UTF8 decoder is that it does not
properly handle surrogate characters. It is stated in
the documentation that surrogates are not supported as
of yet but 1) they should be, and 2) they seem to be
supported anyway (or at least partially). It's alright
to not support them at this time but the support should
be consistent and the behavior should be defined.

For example, when bytes are passed to the String
constructor with an encoding name of "UTF8", the
surrogate characters are decoded correctly. However, if
the surrogates appear in a byte stream, the surrogates
are silently skipped! Strange. I would at least have
thought that both methods would use the same underlying
decoder code.

Also, the InputStreamReader decoding UTF8 silently skips
surrogates in the input stream. If the decision to NOT
support surrogates stands as is, then perhaps the reader
should throw some kind of exception to signal the error.
Passing over them silently can cause problems for the
application.

In addition, I believe that the UTF8 also does not support
reading a UTF8 byte-order-mark (BOM) at the beginning of
the input. (It *does* occur in the real-world -- e.g.
Microsoft adds UTF8 BOMs to a lot of their documents.)
It's not strictly disallowed; it's just weird and the
decoder should be able to handle it.

/* Test case. Doesn't test ability to detect BOM. */
import java.io.ByteArrayInputStream;
import java.io.FilterInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.IOException;
import java.io.Reader;

public class BrokenUTF8 {

    // MAIN

    public static void main(String[] argv) throws Exception {
        System.out.println("#");
        System.out.println("# Byte array");
        System.out.println("#");
        final byte[] bytes = {
            (byte)0xF0, (byte)0x90, (byte)0x80, (byte)0x80
        };
        for (int i = 0; i < bytes.length; i++) {
            int c = bytes[i] & 0x00FF;
            System.out.println("byte["+i+"]: 0x"+Integer.toHexString(c));
        }
        System.out.println("#");
        System.out.println("# Converting bytes: new String(bytes, \"UTF8\")");
        System.out.println("#");
        String s = new String(bytes, "UTF8");
        int slen = s.length();
        for (int i = 0; i < slen; i++) {
            int c = s.charAt(i);
            System.out.println("s.charAt("+i+"): 0x"+Integer.toHexString(c));
        }
        System.out.println("#");
        System.out.println("# Converting bytes: new InputStreamReader(stream,\"UTF8\")");
        System.out.println("#");
        InputStream stream = new ByteArrayInputStream(bytes);
        InputStream streamReporter = new InputStreamReporter(stream);
        Reader reader = new InputStreamReader(streamReporter, "UTF8");
        int c = -1;
        int count = 0;
        do {
            c = reader.read();
            String cs = c != -1 ? "0x"+Integer.toHexString(c) : "EOF";
            System.out.println("Reader.read(): "+cs);
        } while (c != -1);
        System.out.println("#");
        System.out.println("# Done.");
        System.out.println("#");
    }

    // Classes

    static class InputStreamReporter extends FilterInputStream {

        // Constructors

        public InputStreamReporter(InputStream stream) {
            super(stream);
        }

        // InputStream methods

        public int read() throws IOException {
            int c = in.read();
            System.out.print("InputStream.read(): 0x");
            if (c != -1) {
                System.out.print(Integer.toHexString(c));
            }
            else {
                System.out.print("EOF");
            }
            System.out.println();
            return c;
        }

        public int read(byte[] buffer, int offset, int length) throws IOException {
            int count = super.in.read(buffer, offset, length);
            System.out.println("InputStream.read(byte[],"+offset+','+length+"): "+count);
            return count;
        }

    } // class InputStreamReporter

} // class BrokenUTF8
(Review ID: 112649) 
======================================================================

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: merlin-beta FIXED IN: merlin-beta INTEGRATED IN: merlin-beta

14-06-2004

WORK AROUND Name: yyT116575 Date: 11/22/2000 Write your own UTF8 decoder. It would be correct and must faster than the one supplied with the JDK. ======================================================================

11-06-2004

EVALUATION The surrogate handling bugs/inconsistencies in the UTF-8 encoder/decoder code are being addressed in merlin. The utf-8 converter along with the other core converters is being replaced coincidentally with the introduction of the new pluggable charset SPI as part of JSR-051. The intention is to support encoding and decoding of surrogates to the extent of addressing the issues raised in bugIDs 4251997,4297837 and 4344267. BugID 4328816 is being used to track this issue. The InputStreamReader skipping over surrogates in an encoded byte stream happens when the read(char) method is used. If read(char,off,len) is used the ucs4 representation will be correctly decoded. This happens because the read(char c) method incorrectly assumes that a max of 1 char will be produced from a given utf8 byte sequence. This assumption is invalid when the encoded stream contains ucs4 representation of surrogate pairs. The implementation should change to allow a subsequent read() to return the low surrogate char value. The failure of the decoder to handle an initial BOM is a bug detail beyond what is already covered in the quoted related bugs above. To keep this issue tracked I will keep this bug open. Ian.Little@Ireland 11/23/2000

23-11-2000

Relates :	JDK-4328816 - Unicode 2.0 surrogate support
Relates :	JDK-4297837 - Silent Recorvery from bad UTF-8
Relates :	JDK-4344267 - Broken UTF-8 conversion of split surrogate-pair
Relates :	JDK-4251997 - UTF-8 Surrogate Decoding is Broken