[~qamai] had this idea, I'm filing it for him.
When we vectorize reductions, we try to move them out of the loop, see PhaseIdealLoop::move_unordered_reduction_out_of_loop introduced in JDK-8302652 / https://github.com/openjdk/jdk/pull/13056.
That still leaves us with a chain of vector-adds, which can limit the latency. I'm copying this from elsewhere:
[~qamai]:
Reassociation idea: Reduction loop is latency-bound, so we can reassociate the operations of an unrolled loop to saturate the ALU and load/store units. E.g: transforming x4 + (x3 + (x2 + (x1 + x))) into x + (x4 + (x3 + (x2 + x1))). This should be easier and introduce less register pressure compared to having several dedicated reduction lanes.
[~epeter]:
Ok, yes. After moving the reduction out of the loop, we now have add-vectors in a sequence.
This has high latency. We could further improve things this way:
- give each its own phi -> smaller latency but requires more registers
- reassociate them -> if we do it right, i.e. xv = xv + (xv4 + (xv3 + (xv2 + xv1))), then the latency is still minimal, but the register pressure on the backedge is smaller. Nice idea!
I may soon refactor away PhaseIdealLoop::move_unordered_reduction_out_of_loop, and move it into VLoop::optimize, so we can already predict during auto-vectoirzation if we can move the reduction nodes out of the loop, which makes vectorization more profitable.
So this optimization would have to be a stand-alone. Maybe it could be done in in IGVN after loop-opts, when we are done super-unrolling.
It would require that we find a benchmark where the reduction latency is the bottleneck, and not any other computation or memory operation.