Bug ID: JDK-8282429 StringBuilder/StringBuffer.toString() skip compressing for UTF16 strings

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 19
19 b17Fixed

Both StringBuilder and StringBuffer are subclasses of AbstractStringBulder. When clients append chars to AbstractStringBulder, it inflates the internal byte array if the incoming chars can't be encoded in LATIN1. The field coder is the result. It's either LATIN-1 or UTF16. 

Here is StringBuilder::toString().  The UTF16 path doesn't utilize the information that value can't be encoded in LATIN1, which has already been known by AbstractStringBuilder.  toString of StringBuffer is similar. 

    public String toString() {
        // Create a copy, don't share the array
        return isLatin1() ? StringLatin1.newString(value, 0, count)
                          : StringUTF16.newString(value, 0, count);
    }

As a result, StringUTF16.newString() attempts to compress value again if String.COMPACT_STRINGS is true. It ends up allocating a new array of len bytes but the compression can't succeed. 

    public static byte[] compress(byte[] val, int off, int len) {
        byte[] ret = new byte[len];
        if (compress(val, off, ret, 0, len) == len) {
            return ret;
        }
        return null;
}

Here is an example of that case.  When we use StringBuilder, the only last char can’t be encoded in LATIN-1. 

import org.openjdk.jmh.annotations.*;

@State(Scope.Benchmark)
@Fork(3)
@Warmup(iterations=10)
@Measurement(iterations = 10)
public class MyBenchmark {
    @Param({"1024"})
    public int SIZE;

    @Benchmark
    public String testMethod() {
        StringBuilder sb = new StringBuilder(SIZE);
        for (int i = 0; i < SIZE - 4; ++i) {
            sb.append('a');
        }
        sb.append("あ"); // can't be encoded in latin-1
        return sb.toString();
    }
}

The initial capacity of StringBuilder is SIZE in bytes.  When we encounter the last character ‘あ‘,  the string builder object inflates (2 * SIZE) and changes its encoder from LATIN1 to UTF16.  sb.toString() will take !isLatin1() path and StringUTF16::compress() will fail. The allocation in method compress() is wasteful.

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/7671 Date: 2022-03-03 02:36:58 +0000
29-05-2025
Changeset: bab431cc Author: Xin Liu <xliu@openjdk.org> Date: 2022-04-01 04:42:03 +0000 URL: https://git.openjdk.java.net/jdk/commit/bab431cc120fe09be371dadef0c1caf79ec9eef4
01-04-2022
A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7671 Date: 2022-03-03 02:36:58 +0000
03-03-2022
There are two scenarios. Many clients use StringBuilder in grow-only mode. they only grow the internal byte[] (field value) via append or insert. eg. java.io.BufferedReader.readLine(). The other scenario is they use StringBuilder like ArrayList. It may reset 'value' using setLength(0). I have seen that in javac com.sun.tools.javac.parser.readToken(). For grow-only mode, there's a nice property. Once value has been inflated, it means it can't represent in LATIN-1. Therefore, toString() can skip compression attempt for StringUTF16 case.
01-03-2022

Relates :	JDK-8325730 - StringBuilder.toString allocation for the empty String
Relates :	JDK-8332282 - AbstractStringBuilder.toString spec needs amendments for empty strings