Bug ID: JDK-8306989 C2: generalize search in SLP reduction analysis

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Currently (after the integration of JDK-8287087), reduction analysis for superword vectorization works by searching for reduction chains that follow the same input index, modulo edges that are swapped when canonicalizing commutative nodes (see swap_edges(1, 2) calls in addnode.cpp and mulnode.cpp). This is relatively fast and covers most of the reductions found in practice, but requires keeping track of whether the input edges of a commutative node have been swapped (using the Node::Flag_has_swapped_edges flag). Forgetting to update this flag when the inputs of a commutative node are swapped can result in missing vectorization opportunities.

A simpler alternative, to be explored in this RFE, is to further generalize reduction analysis into a complete depth- or breadth-first search that explores both inputs of every potential reduction node. This has the advantage of being conceptually simpler, more modular (not relying on swapped edge state), and slightly more complete (being able to find reductions on partially unrolled loops with user-swapped inputs), at the possible expense of higher computational cost (see qualitative comparison in https://github.com/openjdk/jdk/pull/13120#issue-1634092373 and further discussion in the review of JDK-8287087). A specific aspect that should be evaluated is the cost of searching during the x64 matching phase through long chains of floating-point Min/Max nodes, since reduction analysis is performed for each individual node in the chain.

[~epeter] Interesting suggestion. If we want to explore this, we might draw inspiration from how matching over repeated, hierarchical graph patterns is approached in https://robcasloz.github.io/publications/CastanedaColeEa_PPoPP_2021.pdf - a paper I happened to co-author a few years ago :) See e.g. Figure 3.

21-06-2023

[~rcastanedalo] As you know, there has recently been a conversation about widening the scope of reductions. For example to cover this polynomial case: for (int i = fromIndex; i < end; i++) { result = 31 * result + a[i]; } This would make the whole algorithm more complicated, but would be really powerful. It is related to this prior work: https://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A1521542&dswid=-6366 https://inside.java/2021/01/27/extending-c2-autovectorization-capabilities/ And it would make these hand-crafted solutions obsolete: JDK-8282664

13-06-2023