JDK-8307516 : C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
Type:Enhancement
Component:hotspot
Sub-Component:compiler
Affected Version:21
Priority:P4
Status:Open
Resolution:Unresolved
Submitted:2023-05-05
Updated:2024-05-07
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
As discussed in the recent reduction speedup work:
https://github.com/openjdk/jdk/pull/13056 JDK-8302652
Comments
Hi [~jcao]!
Thanks for the interest in this subject.
Yes, IF the reduction can be moved out of the loop, then vectorization would always seem to be profitable.
Have you done the same benchmark with float / integer?
There are a few reasons for a cost-model:
1) Not all reductions can be moved out of the loop.
1a) float/doulbe reductions require to be reduced in the same order as specified in the java program, otherwise rounding would produce different results (VectorAPI simply defines reductions as unordered and can ignore this issue).
1b) There are some examples I have where the int reductions cannot be moved out of the loop either - though these are unlikely to ever be in any real java code - the examples are very contrived edge cases.
2) Currently, we do not make good use of pack / extract / shuffle operations. If we want to introduce them, then we need some way of judging if the overhead of these additional instructions is worth it.
3) We may want to have multiple vectorization strategies, and then pick the most profitable of all of them.
4) If I get the chance to introduce If-conversion, then profitability is even more difficult to decide without a cost-model.
Feel free to ask more, I'm planning to much more work around the C2 auto-vectorizer in the foreseeable future.
21-11-2023
Hi Emanual. I think simple reductions should be profitable. My understanding is that after https://bugs.openjdk.org/browse/JDK-8302652, reductions in loops are always profitable if we can hoist them to after the loop.
The only case I'm aware where we cannot legally move out the reduction is when we use the accumulator inside the loop. It should be possible to check this inside `Superword::profitable`. Would we still want a cost model if this is the case?
After this change:
```
diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp
index dd3f1d10dee..781458f4fd9 100644
--- a/src/hotspot/share/opto/superword.cpp
+++ b/src/hotspot/share/opto/superword.cpp
@@ -1987,7 +1987,7 @@ bool SuperWord::profitable(Node_List* p) {
if (is_marked_reduction(p0)) {
Node* second_in = p0->in(2);
Node_List* second_pk = my_pack(second_in);
- if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
+ if (second_pk == nullptr) {
// Unmark reduction if no parent pack or if not enough work
// to cover reduction expansion overhead
_loop_reductions.remove(p0->_idx);
```
And running this simple case
```
class Reduction {
final static int SIZE = 1000000;
int reduction(int[] a) {
int s = 0;
for (int i = 0; i < SIZE; ++i) {
s += a[i];
}
return s;
}
public static void main(String[] args) {
Reduction reduction = new Reduction();
int[] arr = new int[SIZE];
for (int i = 0; i < 20000; ++i) {
reduction.reduction(arr);
}
}
}
```
The runtime improves from 8.2 seconds to 3.3 seconds on my x86 Linux machine.
21-11-2023
With the re-analysis of JDK-8260943, it is becoming clear that we need a cost model.
With this task here, we can introduce a cost model, build the packset optimistically, and then check with the cost-model if we want to apply the packset.
Later, we can also use this to unify the CompileCommand Option Vectorize (_do_vector_loop), such that we try both approaches for the "seeding".
More info in the PR of 8260943:
https://github.com/openjdk/jdk/pull/13930
17-05-2023
This should be related to JDK-8188313 (~5.5 years ago, someone had proposed to consider re-enabling C2's vectorization of simple reductions).