Bug ID: JDK-8287087 C2: perform SLP reduction analysis on-demand

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 19

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2022-05-20
Updated: 2023-10-04
Resolved: 2023-04-27

JDK 21
21 b21Fixed

Currently, SLP reduction vectorization follows a two-step approach:

(1) identify and mark reduction nodes and their corresponding loops early (to inform the unrolling policy),
(2) reuse later this information (which nodes are marked as reductions) to guide SLP vectorization.

By performing (1) before loop unrolling, this approach can identify reductions in a simple and efficient manner. However, it has a serious drawback that makes it hard to maintain: many different high-level loop transformations and low-level Ideal transformations can invalidate the node and loop reduction marks (flags) within the loop optimization iterations between steps (1) and (2) above, leading in the worst case to SLP miscompilations based on inconsistent reduction information. JDK-8261147 and JDK-8279622 are actual miscompilations caused by this design problem.

A potential solution is to run step (1) on-demand: first to inform the unrolling policy, and then again to guide SLP vectorization. This would avoid this source of bugs by construction, because there would not be analysis information "on the side" that could be potentially invalidated. If a node found to be a reduction before unrolling stops being a reduction due to some loop or Ideal transformation, it will simply not be identified and treated as a reduction by SLP vectorization.

The solution suggested in this RFE requires generalizing reduction analysis to also identify reductions on unrolled loops. For example, the analysis should be able to detect that node (166 AddI) in reduction-before-unrolling.pdf (attached) is a reduction, and also that nodes {(166 AddI, 504 AddI, 579 AddI, 591 AddI)} form a reduction chain in the corresponding unrolled loop reduction-unrolled-x4.pdf. While detecting reduction chains in unrolled loops is necessarily more costly, the cost is never worse than ("number of phi nodes in the loop" x LoopMaxUnroll), under the assumption that all nodes in an unrolled reduction chain are connected via the same input number. This assumption is key to avoid a combinatorial explosion, and seems to hold in practice, since reduction chains are formed by cloning the same reduction node a number of times.

A prototype of this alternative design is available at https://github.com/robcasloz/jdk/tree/JDK-8287087.

Work left to do:

- Evaluate the overhead of doing general reduction analysis on unrolled loops and its contribution to total C2 execution time (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/superword.cpp#L115).

- Evaluate the overhead of doing per-node general reduction analysis for x86-64 min/max floating-point intrinsic selection (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/node.cpp#L3236).

- Test that the same reduction loops are vectorized as before. As part of this effort, IR test framework checks could be added to the reduction vectorization test cases for more robust regression testing. Some examples can be found in the prototype, e.g. https://github.com/robcasloz/jdk/blob/general-reduction-analysis/test/hotspot/jtreg/compiler/loopopts/superword/RedTest_int_x64.java. This would require additional support IR test framework for multi-target IR matching, see JDK-8280120.

- Test that the x86-64 min/max floating-point intrinsic selection works at least as well as (and possibly better than) the original implementation. The potential improvement over the original implementation comes from the fact that on-demand reduction analysis would be performed on *every* min/max floating-point node visited during instruction selection, not just on those visited in the context of SLP analysis.

Changeset: 1be80a44 Author: Roberto Castañeda Lozano <rcastanedalo@openjdk.org> Date: 2023-04-27 09:39:53 +0000 URL: https://git.openjdk.org/jdk/commit/1be80a4445cf74adc9b2cd5bf262a897f9ede74f

27-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13120 Date: 2023-03-21 14:49:26 +0000

22-03-2023

Blocks :	JDK-8294715 - Add IR checks to the reduction vectorization tests
Relates :	JDK-8279622 - C2: miscompilation of map pattern as a vector reduction
Relates :	JDK-8286177 - C2: "failed: non-reduction loop contains reduction nodes" assert failure
Relates :	JDK-8308746 - C2 IR test failures for TestFpMinMaxReductions.java with SSE2
Relates :	JDK-8261147 - C2: Node is wrongly marked as reduction resulting in a wrong execution due to wrong vector instructions
Relates :	JDK-8290964 - C2 compilation fails with assert "non-reduction loop contains reduction nodes"
Relates :	JDK-8306989 - C2: generalize search in SLP reduction analysis
Relates :	JDK-8074981 - Integer/FP scalar reduction optimization
Relates :	JDK-8217561 - x86: Add floating-point Math.min/max intrinsics
Relates :	JDK-8317507 - C2 compilation fails with "Exceeded _node_regs array"
Relates :	JDK-8305707 - SuperWord should vectorize reverse-order reduction loops