Bug ID: JDK-8302652 [SuperWord] Reduction should happen after loop, when possible

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 21
21 b24Fixed

Currently, we do all reductions inside the loop. This makes sense for floating-point Add and Mul, where the order of reduction must be strictly linear, so as not to violate IEEE specification (basically the rounding would be ever so slightly different, and lead to wrong results).

Pseudocode:

acc = init
For (i ...) {
   vec = "some vector ops"; // vec holds vector of results from this iteration
   vector_reduction(vec, acc); // reduces vector vec into scalar accumulator acc
}
// use acc

However, in integer-reductions, and some floating-point reductions that do not require the linear order (Min / Max), we can do better. We can use a vector-accumulator in the loop, and do the reduction on this vector only after the loop. This should significantly reduce the work per loop iteration.

v_acc = scalar_to_vector(init); // depends on reduction op how we would do this
For (i ...) {
   vec = "some vector ops"; // vec holds vector of results from this iteration
   v_acc = vector_elememt_wise_reduction(v_acc, vec);
}
acc = vector_reduction(v_acc);
// use acc

Note: we already have different reduction implementations.
We already do a "recursive folding" for ints (C2_MacroAssembler::reduce8I), and a "linear folding" for floats (C2_MacroAssembler::reduce8F).
https://github.com/openjdk/jdk/blob/db1b48ef3bb4f8f0fbb6879200c0655b7fe006eb/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L1895-L1941
https://github.com/openjdk/jdk/blob/db1b48ef3bb4f8f0fbb6879200c0655b7fe006eb/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L2096-L2120

I found this while working on JDK-8302139, where I implemented an IR test for SuperWord reductions, and checked out the generated code.

Changeset: 06b0a5e0 Author: Emanuel Peter <epeter@openjdk.org> Date: 2023-05-23 08:05:13 +0000 URL: https://git.openjdk.org/jdk/commit/06b0a5e03852dfed9f1dee4791fc71b4e4e1eeda
23-05-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13056 Date: 2023-03-16 07:29:44 +0000
05-05-2023
I have a WIP patch, and I see some improvements in some of my benchmarks. I will continue working on this. My current approach is via IGVN.
15-03-2023
I added a simple test case with int addition reduction. ./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=print,Test::test -XX:+TraceNewVectors -XX:+TraceSuperWord Test.java I drew the current before/after picture, to better understand the current graph. This was the commandline: ./java -Xbatch -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=print,Test::testx -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+TraceLoopOpts -XX:LoopMaxUnroll=5 Test.java
14-03-2023

Blocks :	JDK-8302139 - Speed up SuperWord reduction tests
Relates :	JDK-8310130 - C2: assert(false) failed: scalar_input is neither phi nor a matchin reduction
Relates :	JDK-8302662 - [SuperWord] Vectorize loop when value from last iteration is used after loop
Relates :	JDK-8307513 - C2: intrinsify Math.max(long,long) and Math.min(long,long)
Relates :	JDK-8314612 - TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
Relates :	JDK-8309647 - [Vector API] Move Reduction outside loop when possible
Relates :	JDK-8345245 - C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
Relates :	JDK-8307516 - C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction