Bug ID: JDK-8309908 C2 SuperWord: improve packing strategy (lookahead?)

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 20,21,22

Priority: P4
Status: Open
Resolution: Unresolved

Submitted: 2023-06-13
Updated: 2025-01-07

Other
tbdUnresolved

Currently, extend_pairset_with_more_pairs_by_following_use_and_def is greedy.

This means that we can get inputs swapped, and then the inputs further up cannot be vectorized.

In the literature, there are algorithms that try to do some "look-ahead", to generate more optimal packing.

Related issue: JDK-8303113
(here we even have issues with creating adjacent memops)

--------------------- Original description: -----------------

The attached Test2.java  does not vectorize, as we can see:
./java -Xbatch -XX:CompileCommand=compileonly,Test2::test -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose -XX:+TraceLoopOpts -XX:UseAVX=2 Test2.java

See the attached "swap.png", we can see that there was a edge swapping. This means that the left and right operations are shuffled, and therefore the vector input/outputs do not align any more.

 If I comment out the commute operation during IGVN, then it does vectorize:

diff --git a/src/hotspot/share/opto/addnode.cpp b/src/hotspot/share/opto/addnode.cpp
index cf8f58d8e23..820fa5f9c84 100644
--- a/src/hotspot/share/opto/addnode.cpp
+++ b/src/hotspot/share/opto/addnode.cpp
@@ -118,11 +118,11 @@ static bool commute(PhaseGVN* phase, Node* add) {
     return true;
   }
 
-  // Otherwise, sort inputs (commutativity) to help value numbering.
-  if( in1->_idx > in2->_idx ) {
-    add->swap_edges(1, 2);
-    return true;
-  }
+  // // Otherwise, sort inputs (commutativity) to help value numbering.
+  // if( in1->_idx > in2->_idx ) {
+  //   add->swap_edges(1, 2);
+  //   return true;
+  // }
   return false;
 }

We should also look into removing CloneMap / CompileCommand Vectorize. The reason some cases require it is that the same array is referenced by a constant offset in the same loop. This confuses the packing - but already for find-adjacent memrefs.
07-01-2025
Is SuperWord::order_def_uses related to this issue?
10-02-2024
It seems there is already some kind of edge-swapping code around: SuperWord::opnd_positions_match Why does that not work? I edited Test2.java: ./java -Xbatch -XX:CompileCommand=compileonly,Test2::test* -XX:+TraceNewVectors -XX:+TraceSuperWord -XX:+Verbose -XX:+TraceLoopOpts -XX:UseAVX=2 -XX:+PrintIdeal Test2.java If I remove the swap, then it vectorizes, but then the two adds in "test2" do not get merged into one, "Identity" does not recognize it. I looked into the idea with removing the swap and adjusting the cmp. But that does not work. For Global Value Numbering to work, we assume that only nodes with the same inputs at the same indices can be the same, see "hash_find_insert". Only if all inputs are the same do we call "cmp", to see if the nodes may be different not because of the inputs, but because of some other internal state. I don't think we want to change how Global Value Numbering works just for this. Rather, we probably have to do something smarter inside SuperWord. And also take care of more general cases, like: for (i ...) { a[i+0] = b[i+0] + c[i+0] a[i+1] = c[i+1] + b[i+1] } Finding a optimal solution is probably out of scope, SuperWord is somewhat optimistic / greedy. Another solution is what I have found is "lookahead-SLP" [1]. The idea is to look ahead a little, and see if one can swap/reorder the edges to make packing more efficient. And a later paper [2] even reorders more. But basically, one would start some sort of local search, at some 2 nodes that are already packed. The left one might be the reference, the right one has to be adjusted to it as far as possible. Whenever we encounter a commutable node, we branch the search (swap or not?). Or we use "super-nodes" if we have a subgraph that is commutable, and reorder everything in it (factorial explosion of options). This search is relatively straight-forward if we go from use to def, via inputs up the graph. But if we go down, the we may have multiple uses which cannot directly be expected to match. So maybe we would have to enumerate all ways the outputs could be swapped, and branch over those? [1] Look-ahead SLP: auto-vectorization in the presence of commutative operations https://dl.acm.org/doi/10.1145/3168807 https://llvm.org/devmtg/2018-04/slides/Rocha-Look-Ahead%20SLP.pdf [2] Super-Node SLP: Optimized Vectorization for Code Sequences Containing Operators and Their Inverse Elements https://rcor.me/papers/cgo19snslp.pdf
06-10-2023
Idea discussed with [~thartmann]: Do not swap, but override the compare method. Needs more investigation.
15-06-2023
Converted to RFE because this is not a regression but an issue that probably existed since day 1.
13-06-2023
Some ideas for solutions: - disable the commute / edge swap operation. probably generally not a great idea. - disable it only inside a unrolled loop, so that we keep consistency. But the loop-info is not really available during IGVN. So one would maybe have to mark the nodes somehow as "please do not commute". - Make SuperWord more powerful and swap the edges. This would be most powerful and could even parallelize code that was "swapped" already in the Java source code: for (i ...) { a[i+0] = b[i+0] + c[i+0] a[i+1] = c[i+1] + b[i+1] }
13-06-2023
I don't think there is any regression here. We can also convert this into an RFE if that is preferred.
13-06-2023