Bug ID: JDK-8368061 C2 SuperWord: allow more control over loop unrolling and super-unrolling

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 26

Priority: P4
Status: Open
Resolution: Unresolved

Submitted: 2025-09-19
Updated: 2025-09-19

Other
tbdUnresolved

The flag LoopMaxUnroll currently behaves a bit unexpected, and does not allow explicit control over unrolling before vectorization and after vectorization.

Such control would be quite helpful to debug performance issues, such as encountered during JDK-8367158, where we want to compare different unrolling factors of vectorized code. For example we may want to compare auto-vectorized code of different super-unrolling factors with fill/copy intrinsics.

--------------------- How LoopMaxUnroll currently works ----------------

The description of the flag is a bit weak:

  200   product(intx,  LoopMaxUnroll, 16,                                         \
  201           "Maximum number of unrolls for main loop")                        \
  202           range(0, max_jint)                                                \

It seems to suggest it might cover the factor of "pre-vec-unroll * super-unroll", i.e. total unroll. But that does not seem to be the case.

In IdealLoopTree::policy_unroll

   977   _local_loop_unroll_limit  = LoopUnrollLimit;
   978   _local_loop_unroll_factor = 4;                                                                                                                                                                                                       
   979   int future_unroll_cnt = cl->unrolled_count() * 2;
   980   if (!cl->is_vectorized_loop()) {
   981     if (future_unroll_cnt > LoopMaxUnroll) return false;   
   982   } else {
   983     // obey user constraints on vector mapped loops with additional unrolling applied
   984     int unroll_constraint = (cl->slp_max_unroll()) ? cl->slp_max_unroll() : 1;
   985     if ((future_unroll_cnt / unroll_constraint) > LoopMaxUnroll) return false;
   986   }

So when we are checking if we can unroll, there are these cases:
- scalar loop -> do not unroll more than LoopMaxUnroll
- vector main loop -> do not unroll more than LoopMaxUnroll*slp_max_unroll
- vector drain loop -> do not unroll more than LoopMaxUnroll. Q: can that lead to super-unrolling? Probably not...?

It seems we also always set  _local_loop_unroll_factor = 4, which mal be relevant below.

  1116   if (phase->C->do_superword()) {
  1117     // Only attempt slp analysis when user controls do not prohibit it
  1118     if (!range_checks_present() && (LoopMaxUnroll > _local_loop_unroll_factor)) {
  1119       // Once policy_slp_analysis succeeds, mark the loop with the
  1120       // maximal unroll factor so that we minimize analysis passes
  1121       if (future_unroll_cnt >= _local_loop_unroll_factor) {
  1122         policy_unroll_slp_analysis(cl, phase, future_unroll_cnt);
  1123       }
  1124     }
  1125   }

So LoopMaxUnroll must be 8 or larger, otherwise we don't do the policy_unroll_slp_analysis. Curious!
-> Investigate!

  1127   int slp_max_unroll_factor = cl->slp_max_unroll();
  1128   if ((LoopMaxUnroll < slp_max_unroll_factor) && FLAG_IS_DEFAULT(LoopMaxUnroll) && UseSubwordForMaxVector) {
  1129     LoopMaxUnroll = slp_max_unroll_factor;
  1130   }

We may now update the flag, if still in default mode. But this is a global update... so next time we come through here we cannot update it any more, right? Does this look good at all?
-> Investigation: ok, it does get updated repeatedly. I played around with an example like TestLoopMaxUnrollIncreasing.java

We also use the flag to limit the search for reduction chains in SuperWord:
src/hotspot/share/opto/superword.cpp:  PathEnd path_to_phi = find_in_path(n, input, LoopMaxUnroll, has_my_opcode,
src/hotspot/share/opto/superword.cpp:  PathEnd path_from_phi = find_in_path(first, input, LoopMaxUnroll, has_my_opcode,

policy_unroll_slp_analysis
 - unrolling_analysis sets _local_loop_unroll_factor, mark_passed_slp, and set_slp_max_unroll.
 - Maybe we can refactor the code, so we don't set random states all over the place?
 - eventually, we then set the _local_loop_unroll_limit, which seems to be the real limit... ah but that is a node limit??? a bit strange... and could possibly lead to inaccurate super-unrolling. The heuristic is also odd.
   -> this probably leads to the horrible over-unrolling we have seen when we pass slp analysis but fail to vectorize!

TODO: policy_unroll_slp_analysis - meaning of related fields
TODO: consider deprecating UseSubwordForMaxVector - I see no reason to disable it ever. We also have no tests for its correctness if disabled.
TODO: consider deprecating SuperWordLoopUnrollAnalysis - ah but it is false on arm only... strange!

It also seems surprising that do_maximally_unroll() seems to ignore LoopMaxUnroll and can only be controlled by LoopUnrollLimit. I would have expected that a loop that is normally maximally unrolled with 10 iterations to not unroll when I set LoopMaxUnroll=2, but it still does it.

19-09-2025

Relates :	JDK-8367158 - C2: create better fill and copy benchmarks, taking alignment into account
Relates :	JDK-8187601 - Unrolling more when SLP auto-vectorization failed
Relates :	JDK-8129920 - Vectorized loop unrolling