Bug ID: JDK-8039751 UTF-8 decoder fails to handle some edge cases correctly

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 7u51,8u20

Priority: P3
Status: Closed
Resolution: Fixed
OS: windows_2008
CPU: x86

Submitted: 2014-04-09
Updated: 2022-08-05
Resolved: 2014-04-12

JDK 8	JDK 9
8u20Fixed	9 b10Fixed

FULL PRODUCT VERSION :
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

EXTRA RELEVANT SYSTEM CONFIGURATION :
None required.

A DESCRIPTION OF THE PROBLEM :
The Apache Tomcat team has put together a test case [1] that demonstrates multiple
UTF-8 decoding bugs. You'll see from the change history of that file that the UTF-8
decoder in Java 8 is a significant improvement over the Java 7
implementation but a number of bugs still remain.

I thought it would be helpful to walk through one of the test case
examples. The code at line 99 onwards of the test case is as follows:

99 	// JVM decoder does not report error until all 4 bytes are available
100 	TEST_CASES.add(new Utf8TestCase(
101 	"Invalid code point - out of range",
102 	new int[] {0x41, 0xF4, 0x90, 0x80, 0x80, 0x41},
103 	2,
104 	"A\uFFFD\uFFFD\uFFFD\uFFFDA").addForJvm(ERROR_POS_PLUS2));

It is the ".addForJvm(...)" part that indicates that the standard Java
decoder does not handle this case correctly. The parameter to that
method call (or calls) indicates the problem (or problems). In this case
the invalid UTF sequence is detected however it is detected 2 bytes
later than it should have been.

The first byte is correctly decoded to 'A'.

The second byte is correctly interpreted as marking the start of a 4
byte UTF-8 sequence. Recall that a 4 byte UTF-8 sequence takes the form:

11110aaa 10bbbbbb 10cccccc 10dddddd

Recall also that the code point associated with the above four byte
sequence is:

000aaabb bbbbcccc ccdddddd


Therefore, if the first byte of the 4 byte sequence is 0xF4 then the
code point must be:

000100bb bbbbcccc ccdddddd

Recall that the valid range of UTF-8 code points is zero to 0x10FFFF or
in binary:
00010000 11111111 11111111

When the third byte (0x90) is read this maps the the second byte in the
4 byte sequence as follows:
10010000
10cccccc

This provides 6 more bits for the code point which gives:

00010010 0000cccc ccdddd

At this point is known that whatever the values of the third and fourth
bytes in the sequence, the code point is going to be greater than
0x10FFFF and therefore it can - and should - be rejected as invalid at
this point. The standard Java decoder does not do this.

The requirement to reject the invalid sequence and the importance of
doing at as soon as possible - particularly when the decoder has been
configured to use replacement characters - is discussed in the Unicode
specification 6.2, chapter 3, page 96 "Constraints on Conversion Processes".


The other test cases in the unit test test various edge cases for a
UTF-8 decoder.

The issues with the standard Java decoder may be summarised as:
- not always detecting an invalid sequence early enough
- sometimes incorrectly swallowing a valid byte as part of a preceding
  invalid byte sequence
- sometime incorrectly swallowing an invalid byte as part of a preceding
  invalid byte sequence

The nature of these errors is such that they often appear in combination
for a particular test case.

In order to avoid any potential security issue with the incorrect
decoding of a UTF-8 sequence - particularly in URLs - Tomcat has had to
implement its own UTF-8 decoder. I am aware that Jetty has also had to
take this approach and I assume other Servlet containers have as well.

It would be great to see these bugs in the UTF-8 decoder fixed so that
Tomcat (and the other containers that have had to implement their own
decoders) can drop that code and use the standard Java decoder.


[1] http://svn.apache.org/viewvc/tomcat/trunk/test/org/apache/tomcat/util/buf/TestUtf8.java?view=markup


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Decode the following byte sequence one byte at a time with the standard UTF-8 decoder:

0x41, 0xF4, 0x90, 0x80, 0x80, 0x41

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
An error should be thrown after processing the third byte (0x90).
ACTUAL -
An error is thrown after processing the fifth byte.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
package org.apache.markt;

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;

public class Utf8Bug {

    public static void main(String[] args) {
        int[] input = new int[] { 0x41, 0xF4, 0x90, 0x80, 0x80, 0x41};

        int len = input.length;

        ByteBuffer bb = ByteBuffer.allocate(len);
        CharBuffer cb = CharBuffer.allocate(len);

        // Configure decoder to fail on an error
        CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.onUnmappableCharacter(CodingErrorAction.REPORT);

        // Add each byte one at a time. The decoder should fail as soon as
        // an invalid sequence has been provided
        for (int i = 0; i < len; i++) {
            bb.put((byte) input[i]);
            bb.flip();
            CoderResult cr = decoder.decode(bb, cb, false);
            if (cr.isError()) {
                if (i == 2) {
                    break;
                }
                throw new IllegalStateException("Error first detected at index " +
                        i + " rather than at index 2");
            }
            bb.compact();
        }
    }
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Use a custom UTF-8 decoder:
http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/tomcat/util/buf/Utf8Decoder.java?view=annotate

Mark Thomas has confirmed the fix for Tomcat users. The Tomcat UTF-8 tests now all pass without having to add any workarounds for bugs in the JRE provided UTF-8 decoder. This looks good to back-port to me. Cheers, Mark
12-05-2014
URL: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/3bcdaa697fce User: lana Date: 2014-04-23 16:11:32 +0000
23-04-2014
URL: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/3bcdaa697fce User: sherman Date: 2014-04-12 21:40:25 +0000
12-04-2014
review thread : http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-April/026328.html
10-04-2014