JDK-8344085 : C2 SuperWord: improve vectorization for small loop iteration count
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 24
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2024-11-13
  • Updated: 2025-10-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
SuperWord is relatively well suited for high iteration count, where we spend most time in the super-unrolled main-loop. This gives us good throughput.

But for small iteration count, there are some issues:
- We align the loop in pre-loop. This can be up to vector-length (number of elements in vector) many iterations.
- Then we check at the zero-trip guard of the main-loop if we have enough iterations left to enter the super-unrolled main-loop. This would require that we have super-unroll-factor * vector-length many iterations left. If we have fewer, we directly go into the post-loop and never enter vectorized code at all.
- This is why we have filed JDK-8307084: if we don't have enough iterations for the super-unrolled main-loop, we should at least enter the vectorized post-loop. That has lower throughput, but is still good for not just cleanup after the main-loop but also for small iteration counts where we cannot use the main-loop.

One idea was JDK-8308994. That may work for some platforms - though it is maybe not worth the complexity. Still, they gathered some interesting data in the plots / benchmarks, see:
https://github.com/openjdk/jdk/pull/14581
Look at the saw-tooth plots it produces: we see that there are different phases.

We should have such a nice benchmark, so that we can experiment well.

Another issue is the pre-loop: it aligns the vectors for the main-loop. But it seems that on modern CPU's such alignment is not as performance relevant as on older CPU's. So here we really waste some possible iterations we might need to enter vectorized loops.
Comments
[~qamai] That is surely an option I'm considering. It would be relatively easy to implement given the machinery we have now. But: there is quite a bit of overhead for multiversioning as we have it now. More code, less locality, more cache pressure. We could of course do the trap/multiversion approach, just like with aliasing runtime checks, that would help as long as all iteration counts are small. In a sense, the "emit vectorized loops with different vector sizes", i.e. drain loops with progressively smaller sizes, is also a form of multiversioning. If I'm not mistaken, I've seen graal do the "progressively smaller vector" approach. I will run the benchmarks with graal, and see if that might be a good approach. Saves us needing to write the whole experiment ;)
14-10-2025

[~epeter] I wonder can/should we multi-version with loop iteration, too?
14-10-2025

We could also come back to [~qamai]'s idea, and drop the pre-loop for sufficiently small iteration count loops. But it may be a little difficult if we don't know the iteration count statically. But there are plenty of cases where the init/limit are statically known. What we don't want to do: drop the pre-loop and loose alignment, and then suddenly we deal with large arrays, and get massive performance loss (20%+) due to misalignment. I suppose we could also go the traditional loop-vectorizer way, and analyze a single-iteration count loop. Then find a way to "widen" the instructions to vectors directly, skipping all the unrolling steps in loop-opts, and doing "unroll+vectorize" directly. This would also allow us to emit loops for different vector sizes, if the analysis is able to deal with multiple "widening-factors". As a consequence, we may be able to go a step further after that, and implement JDK-8308994 properly on machines that support masking load/stores efficiently.
14-10-2025

[~pminborg] Just reported a new case JDK-8368245. The issue is that we keep making SuperWord better, and that gives us nice speedups in medium-large iteration count cases. But every case also creates a regression for small iteration count loops. That's not great.
14-10-2025

[~pminborg] sees performance issues in MemorySegment loops, especially for small iteration counts: JDK-8343773 This is not surprising, just another reason to keep working on this issue. I had another Idea: For small iteration counts, alignment is probably not very important. JDK-8355094 showed that especially for larger iteration counts alignment is very important, since it avoids having split loads or stores, which increases the micro-opt count in the CPU pipeline. But if we only have very few iterations anyway, it may be better to spend one more iteration in vectorized code rather than doing all of those iterations in pre-loop for alignment and post-loop for cleanup. FYI: we can already simulate disabling automatic alignment with: -XX:+UnlockDiagnosticVMOptions -XX:SuperWordAutomaticAlignment=0 I predict that this could improve speed for small iteration counts, but worsen performance for large iteration counts, as split-memops become more relevant.
28-05-2025

[~qamai] I'm not sure we can drop the pre-loop completely. RangeCheck elimination sometimes also uses the pre/post loop. Basically we restrict the main-loop to the "safe" region where we know the RC passes, leaving the before and after to pre/post loop. But at least we could limit the pre-loop to a single iteration - that should be relatively easy.
21-11-2024

I think dropping the pre-loop is reasonable. I think clang also does not introduce a pre-loop: https://godbolt.org/z/Whf38edj9 Another benefit of dropping the pre-loop is that it is then easier to reason about the trip count of the main loop. Currently, we cannot fully unroll this loop after vectorization: for (int i = 0; i < 64; i++) { a[i] = b[i]; } Or eliminate the post-loop in these kinds of loop: for (int i = 0; i < x * 64; i++) { a[i] = b[i]; }
16-11-2024