Bug ID: JDK-8279888 Local variable independently used by multiple loops can interfere with loop optimizations

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 18,19

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2022-01-11
Updated: 2025-05-16
Resolved: 2022-04-25

JDK 19
19 b20Fixed

JDK-8279833 deals with a point fix where moving a method local variable into two separate scope locals avoids a regression introduced by JEP 254.

I've attached a JMH benchmark based on a slightly modified version of the pre-JDK-8279833 version of String.encodeUTF8_UTF16 (loopsWithSharedLocal), along with a version with the same patch applied as in JDK-8279833 (loopsWithScopedLocal):

Benchmark                        (negativeStart)  Mode  Cnt     Score   Error  Units
LoopLocals.loopsWithScopedLocal             true  avgt    2  2828.665          ns/op
LoopLocals.loopsWithScopedLocal            false  avgt    2  1164.918          ns/op
LoopLocals.loopsWithSharedLocal             true  avgt    2  5355.671          ns/op
LoopLocals.loopsWithSharedLocal            false  avgt    2  1164.860          ns/op

The interesting case is when we're spending most of the time in the second loop (negativeStart = true). When not splitting the locals this variant is about half as fast.

I expect that C2 should be able to act as if the variables were local to each individual scope in general here. I've tried simplifying the benchmark, e.g., by removing the byte[] allocations, but then the issue disappears.

Changeset: 32593df3 Author: Roland Westrelin <roland@openjdk.org> Date: 2022-04-25 09:30:00 +0000 URL: https://git.openjdk.java.net/jdk/commit/32593df392cfd139e10849c2a5db0a377fd1ce9c
25-04-2022
A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7352 Date: 2022-02-04 14:41:55 +0000
04-02-2022
Ok, so JDK-8278518 allows rearranging code like that in some cases, for better and worse, and my patch for JDK-8279833 accidentally prevents this by subtly changing the bytecode. I'm not sure what to make of this, but on a high level I think we should work to reduce the "surprise factor" from subtle (and fragile) changes as the ones in JDK-8279833
14-01-2022
Question: why does defining c in the body of the second loop affect the shape to prohibit this optimization in this case? The shape of the CFG (at the bytecode level) is different in the 2 cases. loopsWithSharedLocal() is compiled down to something like: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; continue; } if (c < 0x800) { dst[dp++] = (byte)(0xc0 \| (c >> 6)); dst[dp++] = (byte)(0x80 \| (c & 0x3f)); continue; } if (Character.isSurrogate(c)) { int uc = -1; char c2; if (Character.isHighSurrogate(c) && sp < sl && Character.isLowSurrogate(c2 = getChar(val, sp))) { uc = Character.toCodePoint(c, c2); } if (uc < 0) { dst[dp++] = '?'; } else { dst[dp++] = (byte)(0xf0 \| ((uc >> 18))); dst[dp++] = (byte)(0x80 \| ((uc >> 12) & 0x3f)); dst[dp++] = (byte)(0x80 \| ((uc >> 6) & 0x3f)); dst[dp++] = (byte)(0x80 \| (uc & 0x3f)); sp++; // 2 chars } continue; } { // 3 bytes, 16 bits dst[dp++] = (byte)(0xe0 \| ((c >> 12))); dst[dp++] = (byte)(0x80 \| ((c >> 6) & 0x3f)); dst[dp++] = (byte)(0x80 \| (c & 0x3f)); continue; } } that is, there are gotos back to the beginning of the loop at the end of every conditional block. loopsWithScopedLocal() has a single goto back to the loop header. That makes a difference because c2 doesn't build the same IR in the 2 cases. In the first case, it sees a loop with several backedges which it transforms to a loop nest, each loop with a single backedge (because that's the canonical representation of a loop in c2). The inner loop is the one that's really getting optimized which in this case is: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; } else { break; } } While in the second case, there's a single loop with a complicated body.
13-01-2022
Refactored the micros and added a variant with more randomly mixed chars, which still exhibit the issue at hand when I run on a build with the JDK-8278518 patch: Benchmark (variant) Mode Cnt Score Error Units LoopLocals.loopsWithScopedLocal startNonASCII avgt 5 2679.191 ± 438.853 ns/op LoopLocals.loopsWithScopedLocal endNonASCII avgt 5 1087.806 ± 72.943 ns/op LoopLocals.loopsWithScopedLocal mixed avgt 5 3941.537 ± 237.364 ns/op LoopLocals.loopsWithSharedLocal startNonASCII avgt 5 1592.903 ± 225.704 ns/op LoopLocals.loopsWithSharedLocal endNonASCII avgt 5 1144.281 ± 194.808 ns/op LoopLocals.loopsWithSharedLocal mixed avgt 5 6953.854 ± 630.521 ns/op The mixed variant is synthetic but somewhat more representative of real text, i.e., a mix of ASCII and higher code points. With JDK-8278518 we get a too optimistic speculation of the second loop in the startNonAscii variant since it'll almost only see ASCII bytes in that test. I've uploaded the updated microbenchmark. Since they're potentially misleading then perhaps the start/endNonASCII should be removed from any final version of this.
12-01-2022
Interesting interaction. Question: why does defining c in the body of the second loop affect the shape to prohibit this optimization in this case? (We should probably add a few more mixed micros: the else branch in the second loop might not be rare on more realistic UTF-16 Strings, but skew towards something a bit more balanced. The current micros skew it hard so that only a single branch in the second loop is exercised in each case.)
12-01-2022
I measure this: Benchmark (negativeStart) Mode Cnt Score Error Units LoopLocals.loopsWithScopedLocal true avgt 15 693.615 ± 28.509 ns/op LoopLocals.loopsWithScopedLocal false avgt 15 403.623 ± 11.235 ns/op LoopLocals.loopsWithSharedLocal true avgt 15 1304.995 ± 305.077 ns/op LoopLocals.loopsWithSharedLocal false avgt 15 409.803 ± 7.183 ns/op With the patch from JDK-8278518: Benchmark (negativeStart) Mode Cnt Score Error Units LoopLocals.loopsWithScopedLocal true avgt 15 692.950 ± 22.341 ns/op LoopLocals.loopsWithScopedLocal false avgt 15 403.563 ± 8.676 ns/op LoopLocals.loopsWithSharedLocal true avgt 15 491.926 ± 22.503 ns/op LoopLocals.loopsWithSharedLocal false avgt 15 411.230 ± 10.395 ns/op And ScopedLocal is now slower than SharedLocal. The reason I think is that SharedLocal's second loop: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; } else if (c < 0x800) { .. is compiled as: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; } else { break; } } with the less common else branch code moved in an outer loop. This allows the inner loop body to optimize much better. This happens at the ciTypeFlow level because of the particular bytecode shape of SharedLocal but I suppose we could get c2 to transform the IR of: for (;;) { if () { // common } else { // uncommon } } into: for (;;) { for (;;) { if () { // common } else { got uncommon; } } break; uncommon: // uncommon } Is it worth it? What do you think [~kvn]?
12-01-2022
ILW = possible performance regression, micro benchmark, no workaround = MMH = P3
12-01-2022
JDK 11u and 17u could be affected to based on JDK-8279833.
11-01-2022
@roland can you look on it?
11-01-2022

Causes :	JDK-8312011 - C2: Runtime increase for a double while loop code
Relates :	JDK-8279833 - Loop optimization issue in String.encodeUTF8_UTF16
Relates :	JDK-8319793 - C2 compilation fails with "Bad graph detected in build_loop_late" after JDK-8279888
Relates :	JDK-8281565 - Regression 5-7% in SPECjvm2008-Compress-G1 on x64 Linux and Mac in 19-b8
Relates :	JDK-8356278 - TestDuplicateBackedge fails due to missing CountedLoop in IR
Relates :	JDK-8293978 - Duplicate simple loop back-edge will crash the vm
Relates :	JDK-8281565 - Regression 5-7% in SPECjvm2008-Compress-G1 on x64 Linux and Mac in 19-b8
Relates :	JDK-8286451 - C2: assert(nb == 1) failed: only when the head is not shared