Bug ID: JDK-6957230 CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be 3

Type: Bug
Component: core-libs
Sub-Component: java.nio.charsets
Affected Version: 7

Priority: P4
Status: Closed
Resolution: Fixed
OS: linux
CPU: x86

Submitted: 2010-05-31
Updated: 2014-09-22
Resolved: 2011-03-08

JDK 7
7 b121Fixed

A DESCRIPTION OF THE REQUEST :
Short summary: CharsetEncoder.maxBytesPerChar() returns a value of 4.0 for UTF-8. However, the *real* value should be 3.0. While it is possible for a code point to produce a 4 byte UTF-8 sequence, these code points require *two UTF-16 characters*, thus these code points have a bytes per char value of 2.


JUSTIFICATION :
This is a performance issue, not a correctness issue: The code path for String.getBytes("UTF-8") ends up allocating a *worst case* sized buffer, computed based on this value. Reducing this from 4.0 to 3.0 will reduce garbage collection rates for string processing applications.


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Charset.forName("UTF-8").newEncoder().maxBytesPerChar() should return 3.0

See the example code for a program that computes and verifies this value.
ACTUAL -
Charset.forName("UTF-8").newEncoder().maxBytesPerChar() returns 4.0

---------- BEGIN SOURCE ----------
import java.nio.charset.Charset;

public class Test {
    public static void main(String[] arguments)
            throws java.io.UnsupportedEncodingException {
        System.out.println("Reported max bytes per char: " +
                Charset.forName("UTF-8").newEncoder().maxBytesPerChar());

        double maxBytesPerChar = -1;
        for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
            String s = new String(Character.toChars(i));
            assert 0 < s.length() && s.length() <= 2;
            byte[] utf8 = s.getBytes("UTF-8");

            double bytesPerChar = utf8.length / (double) s.length();
            if (bytesPerChar > maxBytesPerChar) {
                maxBytesPerChar = bytesPerChar;
            }
        }

        System.out.println("Computed real max bytes per char: " +
                maxBytesPerChar);
    }
}

---------- END SOURCE ----------

EVALUATION It appears the submitter 's argument might be right. Even the API doc of maxBytesPerChar() uses the word "character" in "maximum number of bytes that will be produced for each character of input", it might be more accurate to say the 16-bit "char" unit of the CharBuffer.

16-08-2010