First this was spotted by Amir Hadadi in https://stackoverflow.com/questions/70272651/missing-bounds-checking-elimination-in-string-constructor
It looks like in the following code
while (offset < sl) {
int b1 = bytes[offset];
if (b1 >= 0) {
dst[dp++] = (byte)b1;
offset++; // <---
continue;
}
if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
offset + 1 < sl) {
int b2 = bytes[offset + 1];
if (!isNotContinuation(b2)) {
dst[dp++] = (byte)decode2(b1, b2);
offset += 2;
continue;
}
}
// anything not a latin1, including the repl
// we have to go with the utf16
break;
}
bounds check elimination is not executed when accessing byte array via bytes[offset].
The reason, I guess, is that offset variable is modified within the loop (marked with arrow).
Possible fix for this could be changing:
while (offset < sl) ---> while (offset >= 0 && offset < sl)
However the best is to invest in C2 optimization to handle all such cases.
The following benchmark demonstrates good improvement:
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class StringConstructorBenchmark {
private byte[] array;
@Setup
public void setup() {
String str = "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen. Я"; // Latin1 ending with Russian
array = str.getBytes(StandardCharsets.UTF_8);
}
@Benchmark
public String newString() {
return new String(array, 0, array.length, StandardCharsets.UTF_8);
}
}
//baseline
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.newString avgt 50 173,092 ± 3,048 ns/op
//patched
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.newString avgt 50 126,908 ± 2,355 ns/op
The same is observed in String.translateEscapes() for the same String as in the benchmark above:
//baseline
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.translateEscapes avgt 100 53,627 ± 0,850 ns/op
//patched
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.translateEscapes avgt 100 48,087 ± 1,129 ns/op