Bug ID: JDK-8359420 C2: JDK-8323582 slows C2 down by around 2%

JDK 27
27Unresolved

Deferring to JDK 27 for now, as this minor regression requires additional work. Please re-target to JDK 26 if a fix becomes ready in time.
07-11-2025
C2 after JDK-8323582 with -XX:-LoopMultiversioning is still around 2% slower than before JDK-8323582, which indicates that most of the overhead may not come from the effect of multiversioning itself but perhaps surrounding changes, see the attached plots.
17-06-2025
We decided to defer it to JDK 26 for now given that the solutions are not straight forward and more on the riskier side.
16-06-2025
ILW = Small performance regression with Superword changes addressing alignment and aliasing, Dacapo benchmarks, no workaround = HLM = P4
16-06-2025
I investigated the example here a little more: It seems in OSR, we at first have predicates, but then we have to peel to obtain the counted loop shape. And during peeling the predicates are lost, because we insert the peeled instructions. ---------------- before loop opts ------------- (rr) p find_node(84)->dump_bfs(10,find_node(25),"#c") dist dump --------------------------------------------- 10 25 CallLeaf === 5 1 7 8 1 (10 ) [[ 26 28 ]] # OSR_migration_end void ( rawptr:BotPTR ) !jvms: Test::test @ bci:2 (line 12) 9 26 Proj === 25 [[ 40 ]] #0 !jvms: Test::test @ bci:2 (line 12) 8 40 ParsePredicate === 26 39 [[ 41 50 ]] #Loop !jvms: Test::test @ bci:2 (line 12) 7 50 IfTrue === 40 [[ 51 ]] #1 !jvms: Test::test @ bci:2 (line 12) 6 51 ParsePredicate === 50 39 [[ 52 61 ]] #Profiled_Loop !jvms: Test::test @ bci:2 (line 12) 5 61 IfTrue === 51 [[ 62 ]] #1 !jvms: Test::test @ bci:2 (line 12) 4 62 ParsePredicate === 61 39 [[ 63 72 ]] #Auto_Vectorization_Check !jvms: Test::test @ bci:2 (line 12) 3 72 IfTrue === 62 [[ 73 ]] #1 !jvms: Test::test @ bci:2 (line 12) 2 73 ParsePredicate === 72 39 [[ 74 83 ]] #Loop_Limit_Check !jvms: Test::test @ bci:2 (line 12) 1 83 IfTrue === 73 [[ 84 ]] #1 !jvms: Test::test @ bci:2 (line 12) 0 84 Region === 84 180 83 [[ 84 101 88 189 ]] #reducible !jvms: Test::test @ bci:2 (line 12) ------------- after peeling ----------- (rr) p find_node(221)->dump_bfs(100,find_node(25),"#c") dist dump --------------------------------------------- 14 25 CallLeaf === 5 1 7 8 1 (10 ) [[ 26 28 ]] # OSR_migration_end void ( rawptr:BotPTR ) !jvms: Test::test @ bci:2 (line 12) 13 26 Proj === 25 [[ 194 ]] #0 !jvms: Test::test @ bci:2 (line 12) 12 194 If === 26 98 [[ 195 196 ]] P=0.999999, C=-1.000000 11 195 IfTrue === 194 [[ 40 110 ]] #1 10 40 ParsePredicate === 195 39 [[ 193 50 ]] #Loop !jvms: Test::test @ bci:2 (line 12) 9 50 IfTrue === 40 [[ 51 ]] #1 !jvms: Test::test @ bci:2 (line 12) 8 51 ParsePredicate === 50 39 [[ 52 61 ]] #Profiled_Loop !jvms: Test::test @ bci:2 (line 12) 7 61 IfTrue === 51 [[ 62 ]] #1 !jvms: Test::test @ bci:2 (line 12) 6 62 ParsePredicate === 61 39 [[ 63 72 ]] #Auto_Vectorization_Check !jvms: Test::test @ bci:2 (line 12) 5 72 IfTrue === 62 [[ 73 ]] #1 !jvms: Test::test @ bci:2 (line 12) 4 73 ParsePredicate === 72 39 [[ 74 83 ]] #Loop_Limit_Check !jvms: Test::test @ bci:2 (line 12) 3 83 IfTrue === 73 [[ 212 ]] #1 !jvms: Test::test @ bci:2 (line 12) 2 212 If === 83 203 [[ 213 222 ]] P=1.000000, C=15360.000000 !orig=116 !jvms: Test::test @ bci:7 (line 12) 1 213 IfTrue === 212 [[ 221 ]] #1 !orig=117 !jvms: Test::test @ bci:7 (line 12) 0 221 Loop === 221 213 117 [[ 214 221 234 237 ]] partial_peel !orig=[197] ------- analysis ----- We see that the "212 If" is inserted between the predicates and the Loop. That means we lost the predicates. From what I see, this "212 If" is the exit condition of the loop, that was peeled out now. Of course this is a simple loop, and so in more complicated cases the graph might look very different.
13-06-2025
Backgroun information: Auto vectorization sometimes requires runtime checks: - JDK-8323582: check if the vectors are aligned - JDK-8324751: check if the memory references alias We have two options how to add these runtime checks: - Predicate: if it fails we deopt - Multiversioning: duplicate the loop, have a fast and slow loop. In regular compilation, we at first have predicates, and so we can speculate that the checks always pass. If a check fails, we deopt and recompile without predicates. Then, we have to use multiversioning instead, to ensure we have optimal vectorized performance when the check passes, and still reasonably good non-vectorized but still compiled performance if the check fails. In OSR, we currently are not able to produce the predicates, and so we resort to multiversioning directly. An additional difficulty: if there are no predicates, we must decide if we want to use multiversioning when the loop is still in its single iteration state. But at this point, we have no idea if we will ever be able to vectorize and if we would need a runtime check. So currently, we just multiversion in all cases, and if we do not vectorize, or vectorize without a runtime check, we eventually drop the slow-loop. Further: until we are sure that we need the slow-loop, i.e. when we insert runtime checks for vectorization, we keep the slow-loop in a "delayed" mode, so that it is not yet further optimized. This means we save some compile time there. But we cannot avoid doing anything with the delayed slow-loop: we still have to include it in the loop-tree and run IGVN over its nodes, so that adds some minimal amount of compile time. I see a few options for mitigation: - If anybody is heavily affected by this, then disable multiversioning for now: -XX:-LoopMultiversioning - Alternative1: disable multiversioning during OSR. Because in OSR, we often struggle to add the predicates, and without predicates we resort to multiversioning. Risk: if we need runtime-checks for auto vectorization, then we cannot vectorize if there are neigher predicates nor multiversioning. - Alternative2: make sure that the predicates exist in OSR. That way, we can use the predicate and do not have to resort to multiversioning. This would be optimal.
13-06-2025
I wrote a very simple Test.java I repeat compilation 100x. For good benchmarking, I did: echo "1" \| sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo Running it with multiverioning enabled: ./java -XX:CompileCommand=compileonly,Test::test -Xbatch -XX:+CITime -XX:-TieredCompilation -XX:RepeatCompilation=100 -XX:+LoopMultiversioning -XX:CompileCommand=printcompilation,Test::test -XX:-TraceLoopMultiversioning Test.java C2 Compile Time: 5.751 s ... IdealLoop: 3.701 s AutoVectorize: 0.259 s And with multiversioning disabled: ./java -XX:CompileCommand=compileonly,Test::test -Xbatch -XX:+CITime -XX:-TieredCompilation -XX:RepeatCompilation=100 -XX:-LoopMultiversioning -XX:CompileCommand=printcompilation,Test::test -XX:-TraceLoopMultiversioning Test.java C2 Compile Time: 5.650 s ... IdealLoop: 3.628 s AutoVectorize: 0.260 s The impact is not huge, but it is about 0.1s out of 5.7s, so about 2% as you measured. Interesting about this case: we indeed only multiversion in OSR, but not regular compilation: ./java -XX:CompileCommand=compileonly,Test::test -Xbatch -XX:-TieredCompilation -XX:+LoopMultiversioning -XX:CompileCommand=printcompilation,Test::test -XX:+TraceLoopMultiversioning Test.java CompileCommand: compileonly Test.test bool compileonly = true CompileCommand: PrintCompilation Test.test bool PrintCompilation = true 4142 98 % b Test::test @ 2 (26 bytes) Multiversion Loop: N255/N117 counted [int,int),+1 (2147483648 iters) rc has_sfpt strip_mined Loop Multiversioning: - Loop-Selector-If: 259 If - True-Path-Loop (=Orig / Fast): 255 CountedLoop - False-Path-Loop (=Clone / Slow): 273 CountedLoop 4158 99 b Test::test (26 bytes) 4175 100 % b Test::test @ 2 (26 bytes) Multiversion Loop: N250/N117 counted [int,int),+1 (15364 iters) rc has_sfpt strip_mined Loop Multiversioning: - Loop-Selector-If: 254 If - True-Path-Loop (=Orig / Fast): 250 CountedLoop - False-Path-Loop (=Clone / Slow): 268 CountedLoop 4190 101 b Test::test (26 bytes) This makes sense: in OSR compilation, we struggle to generate the predicates before the loops. If there are predicates like in regular compilation, then we do not have to multiversion, and we do not have to duplicate the loop.
13-06-2025
I'll also try to create a smaller reproducer, probably a single loop would already do the trick.
13-06-2025
[~rcastanedalo] Thanks for the report! I'd have to do some specific digging here. In JDK-8323582, I added the auto-vectorization predicate and multiversioning. When the predicate is available, we should not have an impact on performance. But if there is no predicate (e.g. during OSR), then we multiversion, and that means more nodes, and probably that slows things down a big. That would not be surprising, because loop-opts can be an expensive part of compilation, and adding more loops in the multiversioning might make things even slower. There are probably some things we could do here. But we'd have to verify first that Multiversioning is indeed the issue. Since you have it all set up already: would you mind running once with and once without multiversioning? -XX:+LoopMultiversioning -XX:-LoopMultiversioning If it is indeed multiversioning during OSR, then we could think about disabling multiversioning during OSR. FYI: multiversioning is not just important for JDK-8323582, but also for JDK-8324751.
13-06-2025