Bug ID: JDK-8351409 Avoid scalar cmov in extreme long min/max loop scenarios

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 25,26

Priority: P4
Status: Open
Resolution: Unresolved

Submitted: 2025-03-07
Updated: 2025-09-15

Other
tbdUnresolved

Solving JDK-8307513 with the PR https://github.com/openjdk/jdk/pull/20098 contains edge cases where performance degradation can be observed. These performance regressions can be summarised as follows:

Regression 1: Given a loop with a long min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress.

Regression 2: Given a loop with a long min/max reduction pattern with one side of branch near 100% of time, when the platform does not support vector instructions to achieve this (e.g. AVX-512 quad word vpmax/vpmin), then HotSpot will use scalar instructions (cmov) and performance will regress.

Regression 3: Given a loop with a long min/max non-reduction pattern (e.g. longLoopMax) with one side of branch taken near 100% of time, when the platform does not vectorize it (either lack of CPU instruction support, or Superword finding not profitable), then HotSpot will use scalar instructions (cmov) and performance will regress.

What all these regressions have in common is that in this extreme scenarios the compiler emits scalar cmov instructions. So, the idea to fix this would be to detect these extreme scenarios would be to use branching code (e.g. cmp + mov).

Reassigned to myself as I start looking into how to implement the reassociation approach. Thanks [~epeter] and [~qamai]!
02-09-2025
[~galder] I quickly hard-coded a benchmark that has a re-associated reduction order. And it is even faster than either the branching or cmove approach! [empeter@emanuel bin]$ /home/empeter/Documents/oracle/jdk-25.0.1/bin/java -XX:CompileCommand=compileonly,Test::test TestWithPrimitiveAndIfReassociated.java CompileCommand: compileonly Test.test bool compileonly = true Warmup Running elapsed: 454519078 elapsed: 453438869 elapsed: 452792577 elapsed: 452490211 elapsed: 452410593 elapsed: 453310977 elapsed: 452583189 elapsed: 454780144 elapsed: 459231304 elapsed: 452641150
18-08-2025
[~galder] Adding another reduction benchmark which cannot vectorize because of control flow (secondary exit). We may at some point be able to vectorize that one, but there will always be more examples that cannot vectorize. [empeter@emanuel bin]$ /home/empeter/Documents/oracle/jdk-24.0.2/bin/java -XX:CompileCommand=compileonly,Test::test TestWithPrimitiveAndIf.java CompileCommand: compileonly Test.test bool compileonly = true Warmup Running elapsed: 520682141 elapsed: 521333123 elapsed: 519278126 elapsed: 517268944 elapsed: 518391889 elapsed: 525869308 elapsed: 519731880 elapsed: 524735159 elapsed: 523835247 elapsed: 522167307 [empeter@emanuel bin]$ /home/empeter/Documents/oracle/jdk-25.0.1/bin/java -XX:CompileCommand=compileonly,Test::test TestWithPrimitiveAndIf.java CompileCommand: compileonly Test.test bool compileonly = true Warmup Running elapsed: 814749503 elapsed: 819140631 elapsed: 826808455 elapsed: 811745721 elapsed: 812361086 elapsed: 812100515 elapsed: 812073014 elapsed: 812355067 elapsed: 811596182 elapsed: 811464468 Cases like these make me question the idea that these reductions are corner cases. They are probably not much less common than min/max reductions in general. I also had a benchmark with an Array of objects that wrap a long value. But here I could not see a performance difference between JDK24 and JDK25: it seems the bottleneck is chasing the object pointers from the array to the objects where the values are located. The reduction chain of cmoves is not the limiting factor any more.
18-08-2025
[~galder] I've looked at your report. There are some quite significant regressions from JDK-8307513, where we now have CMove's on the critical latency path of a reduction. On darwin/aarch64 you reported a 266% regression on some benchmarks. That is a lot! And I'm not yet fully convinced that it is an edge case. But I'll discuss it with [~thartmann], and see if he thinks it is ok too.
18-08-2025
Just a quick note: We will also see regressions in loops where Long.min/max is on the critical latency chain, and does not get vectorized because we never reach SuperWord, or because we cannot match the loop pattern for other reasons. It is not really a "profitability" issue, rather a pattern match issue. There will always be code shapes that we cannot teach the auto vectorizer to handle. How could we deal with these critical latency chains? - Turn them into branches, so the branch predictor could fast-forward over them. Tricky, because we would probably need to know the branching probability, and that is difficult (profile pollution because of different use sites for Long.min/max, extracting the probability now that we already intercept/intrinsify the Long.min/max call and do not inline, and then keeping the probability stored in a MinL/MaxL node has other downsides such as global value numbering). - Alternative: detect critical latency chains, and break them by reassociating. That would actually be more of a IGVN or other phase that could do that. And it would not just apply to Long.min/max, but also Integer.min/max (see JDK-8352082), but even other (binary) operators (add, mul, ...), see JDK-8345245.
18-08-2025
I've been experimenting with branching-based intrinsics for MinL/MaxL to see how they would perform. I've just attached a couple of patches that I used for the experiments. I also summarised my experiments, added performance numbers, and provide a conclusion in the attached PDF. The bottom line is that although branching can offer some advantages, the improvements are not seen across the board. Implementing a solution that targets only the advantageous situations does not seem worth it. I'm unassigning myself from this issue.
24-07-2025
JDK-8357530 offers some mitigation when profitability is the reason to not vectorize. As mentioned in the issue/PR, the flag might not help the end user much unless they can specifically point an issue in this area, so benchmarking is where it can help the most.
27-05-2025
I've created JDK-8352082 for the int case.
14-03-2025
Re: int case I will try to compile a similar list of regressions and see how likely the circumstances are.
13-03-2025
Re: int case I think it should be a separate RFE because the circumstances under which it can happen are narrower. For example: vectorized int min/max support is available in a wider set of architectures because AVX-512 is not needed, and IIRC smaller registry sizes in aarch64 128bit registers still work with it. I think such RFE could do with listing the situations just like it's done for long above.
13-03-2025
ILW = Minor performance regression, in edge cases with CMOV, lower ConditionalMoveLimit (?) = MLM = P4
10-03-2025
[~galder] Would you also tackle the int case, or should we file a separate RFE for that? See: https://github.com/openjdk/jdk/pull/20098#issuecomment-2662706564
07-03-2025

Relates :	JDK-8345245 - C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
Relates :	JDK-8352082 - Avoid scalar cmov in extreme int min/max reduction loop scenarios
Relates :	JDK-8357530 - C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability
Relates :	JDK-8307513 - C2: intrinsify Math.max(long,long) and Math.min(long,long)