JDK-8279888 : Local variable independently used by multiple loops can interfere with loop optimizations
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 18,19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2022-01-11
  • Updated: 2023-11-09
  • Resolved: 2022-04-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b20Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
JDK-8279833 deals with a point fix where moving a method local variable into two separate scope locals avoids a regression introduced by JEP 254.

I've attached a JMH benchmark based on a slightly modified version of the pre-JDK-8279833 version of String.encodeUTF8_UTF16 (loopsWithSharedLocal), along with a version with the same patch applied as in JDK-8279833 (loopsWithScopedLocal):

Benchmark                        (negativeStart)  Mode  Cnt     Score   Error  Units
LoopLocals.loopsWithScopedLocal             true  avgt    2  2828.665          ns/op
LoopLocals.loopsWithScopedLocal            false  avgt    2  1164.918          ns/op
LoopLocals.loopsWithSharedLocal             true  avgt    2  5355.671          ns/op
LoopLocals.loopsWithSharedLocal            false  avgt    2  1164.860          ns/op

The interesting case is when we're spending most of the time in the second loop (negativeStart = true). When not splitting the locals this variant is about half as fast.

I expect that C2 should be able to act as if the variables were local to each individual scope in general here. I've tried simplifying the benchmark, e.g., by removing the byte[] allocations, but then the issue disappears.
Comments
Changeset: 32593df3 Author: Roland Westrelin <roland@openjdk.org> Date: 2022-04-25 09:30:00 +0000 URL: https://git.openjdk.java.net/jdk/commit/32593df392cfd139e10849c2a5db0a377fd1ce9c
25-04-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7352 Date: 2022-02-04 14:41:55 +0000
04-02-2022

Ok, so JDK-8278518 allows rearranging code like that in some cases, for better and worse, and my patch for JDK-8279833 accidentally prevents this by subtly changing the bytecode. I'm not sure what to make of this, but on a high level I think we should work to reduce the "surprise factor" from subtle (and fragile) changes as the ones in JDK-8279833
14-01-2022

Question: why does defining c in the body of the second loop affect the shape to prohibit this optimization in this case? The shape of the CFG (at the bytecode level) is different in the 2 cases. loopsWithSharedLocal() is compiled down to something like: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; continue; } if (c < 0x800) { dst[dp++] = (byte)(0xc0 | (c >> 6)); dst[dp++] = (byte)(0x80 | (c & 0x3f)); continue; } if (Character.isSurrogate(c)) { int uc = -1; char c2; if (Character.isHighSurrogate(c) && sp < sl && Character.isLowSurrogate(c2 = getChar(val, sp))) { uc = Character.toCodePoint(c, c2); } if (uc < 0) { dst[dp++] = '?'; } else { dst[dp++] = (byte)(0xf0 | ((uc >> 18))); dst[dp++] = (byte)(0x80 | ((uc >> 12) & 0x3f)); dst[dp++] = (byte)(0x80 | ((uc >> 6) & 0x3f)); dst[dp++] = (byte)(0x80 | (uc & 0x3f)); sp++; // 2 chars } continue; } { // 3 bytes, 16 bits dst[dp++] = (byte)(0xe0 | ((c >> 12))); dst[dp++] = (byte)(0x80 | ((c >> 6) & 0x3f)); dst[dp++] = (byte)(0x80 | (c & 0x3f)); continue; } } that is, there are gotos back to the beginning of the loop at the end of every conditional block. loopsWithScopedLocal() has a single goto back to the loop header. That makes a difference because c2 doesn't build the same IR in the 2 cases. In the first case, it sees a loop with several backedges which it transforms to a loop nest, each loop with a single backedge (because that's the canonical representation of a loop in c2). The inner loop is the one that's really getting optimized which in this case is: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; } else { break; } } While in the second case, there's a single loop with a complicated body.
13-01-2022

Refactored the micros and added a variant with more randomly mixed chars, which still exhibit the issue at hand when I run on a build with the JDK-8278518 patch: Benchmark (variant) Mode Cnt Score Error Units LoopLocals.loopsWithScopedLocal startNonASCII avgt 5 2679.191 ± 438.853 ns/op LoopLocals.loopsWithScopedLocal endNonASCII avgt 5 1087.806 ± 72.943 ns/op LoopLocals.loopsWithScopedLocal mixed avgt 5 3941.537 ± 237.364 ns/op LoopLocals.loopsWithSharedLocal startNonASCII avgt 5 1592.903 ± 225.704 ns/op LoopLocals.loopsWithSharedLocal endNonASCII avgt 5 1144.281 ± 194.808 ns/op LoopLocals.loopsWithSharedLocal mixed avgt 5 6953.854 ± 630.521 ns/op The mixed variant is synthetic but somewhat more representative of real text, i.e., a mix of ASCII and higher code points. With JDK-8278518 we get a too optimistic speculation of the second loop in the startNonAscii variant since it'll almost only see ASCII bytes in that test. I've uploaded the updated microbenchmark. Since they're potentially misleading then perhaps the start/endNonASCII should be removed from any final version of this.
12-01-2022

Interesting interaction. Question: why does defining c in the body of the second loop affect the shape to prohibit this optimization in this case? (We should probably add a few more mixed micros: the else branch in the second loop might not be rare on more realistic UTF-16 Strings, but skew towards something a bit more balanced. The current micros skew it hard so that only a single branch in the second loop is exercised in each case.)
12-01-2022

I measure this: Benchmark (negativeStart) Mode Cnt Score Error Units LoopLocals.loopsWithScopedLocal true avgt 15 693.615 ± 28.509 ns/op LoopLocals.loopsWithScopedLocal false avgt 15 403.623 ± 11.235 ns/op LoopLocals.loopsWithSharedLocal true avgt 15 1304.995 ± 305.077 ns/op LoopLocals.loopsWithSharedLocal false avgt 15 409.803 ± 7.183 ns/op With the patch from JDK-8278518: Benchmark (negativeStart) Mode Cnt Score Error Units LoopLocals.loopsWithScopedLocal true avgt 15 692.950 ± 22.341 ns/op LoopLocals.loopsWithScopedLocal false avgt 15 403.563 ± 8.676 ns/op LoopLocals.loopsWithSharedLocal true avgt 15 491.926 ± 22.503 ns/op LoopLocals.loopsWithSharedLocal false avgt 15 411.230 ± 10.395 ns/op And ScopedLocal is now slower than SharedLocal. The reason I think is that SharedLocal's second loop: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; } else if (c < 0x800) { .. is compiled as: while (sp < sl) { c = getChar(val, sp++); if (c < 0x80) { dst[dp++] = (byte)c; } else { break; } } with the less common else branch code moved in an outer loop. This allows the inner loop body to optimize much better. This happens at the ciTypeFlow level because of the particular bytecode shape of SharedLocal but I suppose we could get c2 to transform the IR of: for (;;) { if () { // common } else { // uncommon } } into: for (;;) { for (;;) { if () { // common } else { got uncommon; } } break; uncommon: // uncommon } Is it worth it? What do you think [~kvn]?
12-01-2022

ILW = possible performance regression, micro benchmark, no workaround = MMH = P3
12-01-2022

JDK 11u and 17u could be affected to based on JDK-8279833.
11-01-2022

@roland can you look on it?
11-01-2022