JDK-8305717 : SuperWord: Vectorization in opposite direction traversal cases
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 21
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2023-04-06
  • Updated: 2023-06-05
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
Given this loop:

    for (int i = 0; i < DST.length; i++) {
        DST[i] = SRC1[i] + SRC2[DST.length - 1 - i];
    }

C2 should be able to vectorize this loop by reversing the elements after loading from SRC2. Similarly, a reduction loop like this should also be vectorized:

    int sum = 0;
    for (int i = 0; i < DST.length; i++) {
        sum += SRC1[i] + SRC2[DST.length - 1 - i];
    }

Comments
[~qamai] Of course one of the fundamental issues with first is aliasing. Assuming all arrays are of the same type (eg all int or all float), then they all land in the same memory slice. Arrays DST and SRC2 could be the same array, and form a cyclic dependency. I guess that is another thing we might want to address, either via profiling/speculation/uncommon trap, or via multi-versioning (one loop assuming no aliasing, another that assumes they can alias).
05-06-2023

[~epeter] Tbh I don't think it is a necessity to achieve the optimal cut. In the worst case for an n-operand node, we have to do n shuffles, which leads to n + 1 operations, this is still a win for the vast majority of circumstances, as most nodes only have 2 or 3 operands. Furthermore, I think a greedy isolate cut heuristic would yield acceptable solutions in most cases, and optimal ones when there are only 2 colours.
25-05-2023

[~qamai] I have had a vague idea recently: If we can limit the vector-shuffles to "reverse", ie all vectors are either "forward" or "backward", then we can split all vectors in 2 sets. We can now restate the problem as a Min-Cut problem: we would like to find the smallest cut (the lowest number of shuffles). Solving a Max-Flow / Min-Cut problem is relatively cheap. I'd have to do some more theoretic algo-work to design the concrete algorithm (and check if it even works, especially building the flow-graph).
24-05-2023

Sure, please go ahead. But as far as I know, this whole vector-shuffle stuff is an optimization problem, some solve it with ILP (integer linear programming). That is most likely out of scope for a JIT. But maybe there is another approach, that at least works for most cases we could be interested in.
24-04-2023

[~epeter] I have known Clang and GCC's capability to autovectorize vector shuffle before, a possible application of this particular use case is for the computation of polynomial products. I have not come up with an idea to accomplish this yet and I would be glad to give it a try if you do not have a plan in the near future. Thanks a lot.
11-04-2023

[~qamai] Did you randomly find this, or are there performance relevant use-cases? Would you like to work on this?
11-04-2023

This is a bit related to JDK-8305707. The question is how exactly how we'd want to deal with this. As an exception, where we only permute/reverse after a load / before a store? Or more general, where we allow permutation of the vector lanes in general? I think this RFE would be a bit of a bigger one, as lots of the SuperWord code only works in the forward direction. We do not just have to implement it, but also add lots of test cases (including IR tests), to ensure we have the required coverage.
11-04-2023