A DESCRIPTION OF THE REQUEST :
Short summary: CharsetEncoder.maxBytesPerChar() returns a value of 4.0 for UTF-8. However, the *real* value should be 3.0. While it is possible for a code point to produce a 4 byte UTF-8 sequence, these code points require *two UTF-16 characters*, thus these code points have a bytes per char value of 2.
JUSTIFICATION :
This is a performance issue, not a correctness issue: The code path for String.getBytes("UTF-8") ends up allocating a *worst case* sized buffer, computed based on this value. Reducing this from 4.0 to 3.0 will reduce garbage collection rates for string processing applications.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Charset.forName("UTF-8").newEncoder().maxBytesPerChar() should return 3.0
See the example code for a program that computes and verifies this value.
ACTUAL -
Charset.forName("UTF-8").newEncoder().maxBytesPerChar() returns 4.0
---------- BEGIN SOURCE ----------
import java.nio.charset.Charset;
public class Test {
public static void main(String[] arguments)
throws java.io.UnsupportedEncodingException {
System.out.println("Reported max bytes per char: " +
Charset.forName("UTF-8").newEncoder().maxBytesPerChar());
double maxBytesPerChar = -1;
for (int i = 0; i <= Character.MAX_CODE_POINT; i++) {
String s = new String(Character.toChars(i));
assert 0 < s.length() && s.length() <= 2;
byte[] utf8 = s.getBytes("UTF-8");
double bytesPerChar = utf8.length / (double) s.length();
if (bytesPerChar > maxBytesPerChar) {
maxBytesPerChar = bytesPerChar;
}
}
System.out.println("Computed real max bytes per char: " +
maxBytesPerChar);
}
}
---------- END SOURCE ----------