JDK-6957230 : CharsetEncoder.maxBytesPerChar() reports 4 for UTF-8; should be 3
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.nio.charsets
  • Affected Version: 7
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: linux
  • CPU: x86
  • Submitted: 2010-05-31
  • Updated: 2014-09-22
  • Resolved: 2011-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7
7 b121Fixed
Related Reports
Relates :  
Description
A DESCRIPTION OF THE REQUEST :
Short summary: CharsetEncoder.maxBytesPerChar() returns a value of 4.0 for UTF-8. However, the *real* value should be 3.0. While it is possible for a code point to produce a 4 byte UTF-8 sequence, these code points require *two UTF-16 characters*, thus these code points have a bytes per char value of 2.


JUSTIFICATION :
This is a performance issue, not a correctness issue: The code path for String.getBytes("UTF-8") ends up allocating a *worst case* sized buffer, computed based on this value. Reducing this from 4.0 to 3.0 will reduce garbage collection rates for string processing applications.


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Charset.forName("UTF-8").newEncoder().maxBytesPerChar() should return 3.0

See the example code for a program that computes and verifies this value.
ACTUAL -
Charset.forName("UTF-8").newEncoder().maxBytesPerChar() returns 4.0

---------- BEGIN SOURCE ----------
import java.nio.charset.Charset;

public class Test {
    public static void main(String[] arguments)
            throws java.io.UnsupportedEncodingException {
        System.out.println("Reported max bytes per char: " +
                Charset.forName("UTF-8").newEncoder().maxBytesPerChar());

        double maxBytesPerChar = -1;
        for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
            String s = new String(Character.toChars(i));
            assert 0 < s.length() && s.length() <= 2;
            byte[] utf8 = s.getBytes("UTF-8");

            double bytesPerChar = utf8.length / (double) s.length();
            if (bytesPerChar > maxBytesPerChar) {
                maxBytesPerChar = bytesPerChar;
            }
        }

        System.out.println("Computed real max bytes per char: " +
                maxBytesPerChar);
    }
}

---------- END SOURCE ----------

Comments
EVALUATION It appears the submitter 's argument might be right. Even the API doc of maxBytesPerChar() uses the word "character" in "maximum number of bytes that will be produced for each character of input", it might be more accurate to say the 16-bit "char" unit of the CharBuffer.
16-08-2010