Bug ID: JDK-8345245 C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

[~qamai] had this idea, I'm filing it for him.

When we vectorize reductions, we try to move them out of the loop, see PhaseIdealLoop::move_unordered_reduction_out_of_loop introduced in JDK-8302652 / https://github.com/openjdk/jdk/pull/13056.

That still leaves us with a chain of vector-adds, which can limit the latency. I'm copying this from elsewhere:

[~qamai]:
Reassociation idea: Reduction loop is latency-bound, so we can reassociate the operations of an unrolled loop to saturate the ALU and load/store units. E.g: transforming x4 + (x3 + (x2 + (x1 + x))) into x + (x4 + (x3 + (x2 + x1))). This should be easier and introduce less register pressure compared to having several dedicated reduction lanes.
[~epeter]:
Ok, yes. After moving the reduction out of the loop, we now have add-vectors in a sequence.
This has high latency. We could further improve things this way:
- give each its own phi -> smaller latency but requires more registers
- reassociate them -> if we do it right, i.e. xv = xv + (xv4 + (xv3 + (xv2 + xv1))), then the latency is still minimal, but the register pressure on the backedge is smaller. Nice idea!

I may soon refactor away PhaseIdealLoop::move_unordered_reduction_out_of_loop, and move it into VLoop::optimize, so we can already predict during auto-vectoirzation if we can move the reduction nodes out of the loop, which makes vectorization more profitable.

So this optimization would have to be a stand-alone. Maybe it could be done in in IGVN after loop-opts, when we are done super-unrolling.

It would require that we find a benchmark where the reduction latency is the bottleneck, and not any other computation or memory operation.

This can also be generalized to scalar reductions that did not vectorize: they are also latency bound and reassociating them in the unrolled loop would break the latency: x = x4 + (x3 + (x2 + (x1 + x))) -> 4 adds on latency chain -> x = x + (x4 + (x3 + (x2 + x1))) -> 1 add on latency chain

29-11-2024

[~qamai] also sent me this: Throw in a uica emulation, I think it looks promising, this is a reduction loop with 8 collectors: https://uica.uops.info/?code=vpaddd%20ymm0%2C%20ymm0%2C%20%5Brax%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%2032%5D%0D%0Avpaddd%20ymm2%2C%20ymm2%2C%20%5Brax%20%2B%2064%5D%0D%0Avpaddd%20ymm3%2C%20ymm3%2C%20%5Brax%20%2B%2096%5D%0D%0Avpaddd%20ymm4%2C%20ymm4%2C%20%5Brax%20%2B%20128%5D%0D%0Avpaddd%20ymm5%2C%20ymm5%2C%20%5Brax%20%2B%20160%5D%0D%0Avpaddd%20ymm6%2C%20ymm6%2C%20%5Brax%20%2B%20192%5D%0D%0Avpaddd%20ymm7%2C%20ymm7%2C%20%5Brax%20%2B%20224%5D%0D%0A&syntax=asIntel&uArchs=SKL&uArchs=ICL&tools=uiCA&alignment=0 And this is a reassociated reduction loop with ymm0 being the sole collector: https://uica.uops.info/?code=vmovdqu%20ymm1%2C%20%5Brax%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%2032%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%2064%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%2096%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%20128%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%20160%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%20192%5D%0D%0Avpaddd%20ymm1%2C%20ymm1%2C%20%5Brax%20%2B%20224%5D%0D%0Avpaddd%20ymm0%2C%20ymm0%2C%20ymm1%0D%0A&syntax=asIntel&uArchs=SKL&uArchs=ICL&tools=uiCA&alignment=0

29-11-2024

Beware of this issue for now JDK-8345044 Will hopefully be fixed in a few months with JDK-8340093

29-11-2024

Relates :	JDK-8302652 - [SuperWord] Reduction should happen after loop, when possible
Relates :	JDK-8345044 - Sum of array elements not vectorized