Bug ID: JDK-7082884 Incorrect UTF8 conversion for sequence ED 31

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 7

Priority: P4
Status: Closed
Resolution: Won't Fix
OS: generic
CPU: generic

Submitted: 2011-08-24
Updated: 2015-01-29
Resolved: 2012-11-09

JDK 8
8Fixed

SYNOPSIS
--------
Incorrect UTF8 conversion for sequence ED 31

OPERATING SYSTEM
----------------
All

FULL JDK VERSION
----------------
Java 6 (tested with 1.6.0_26)
Java 7 (tested with GA / b147)

PROBLEM DESCRIPTION from LICENSEE
---------------------------------
The byte sequence ED 31 is not parsed correctly

The UTF8 specification states that the maximal valid subpart should be replaced by a single fffd before moving to process the next one. In this case ED is valid three byte sequence, but the second byte (31) is invalid. Therefore ED should be replaced by fffd, and 31 should be processed as single byte. 31 is valid single byte (1f).

TESTCASE
--------
public class RegTest {
    public static void main (String args[]) throws Exception {
        byte[] test1 = new byte[] {(byte)0xED, 31};
        String s1 = stringToHex(new String(test1, "UTF8"));
        System.out.println(s1);
    }

    public static String stringToHex( String base ) {
        StringBuffer buffer = new StringBuffer();
        int intValue;
        for (int x = 0; x < base.length(); x ++) {
            intValue = base.charAt(x);
            String hex = Integer.toHexString(intValue);
            if (hex.length() == 1) {
                buffer.append("0" + hex + " ");
            } else {
                buffer.append(hex + " ");
            }
        }
        return buffer.toString();
    }
}

REPRODUCTION INSTRUCTIONS
-------------------------
1. javac RegTest.java
2. java RegTest

Actual Output:
fffd

Expected Output:
fffd 1f

P4 and will probably not get fixed in JDK6.

09-11-2012

EVALUATION The submitter is correct. While the current implementation gives the best performance, the Standard appears to suggest we MUST return malform(1) in case of "mixed" illegal utf8 byte sequence.

24-08-2011