JDK-8287087 : C2: perform SLP reduction analysis on-demand
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 19
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2022-05-20
  • Updated: 2023-10-04
  • Resolved: 2023-04-27
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 21
21 b21Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Currently, SLP reduction vectorization follows a two-step approach:

(1) identify and mark reduction nodes and their corresponding loops early (to inform the unrolling policy),
(2) reuse later this information (which nodes are marked as reductions) to guide SLP vectorization.

By performing (1) before loop unrolling, this approach can identify reductions in a simple and efficient manner. However, it has a serious drawback that makes it hard to maintain: many different high-level loop transformations and low-level Ideal transformations can invalidate the node and loop reduction marks (flags) within the loop optimization iterations between steps (1) and (2) above, leading in the worst case to SLP miscompilations based on inconsistent reduction information. JDK-8261147 and JDK-8279622 are actual miscompilations caused by this design problem.

A potential solution is to run step (1) on-demand: first to inform the unrolling policy, and then again to guide SLP vectorization. This would avoid this source of bugs by construction, because there would not be analysis information "on the side" that could be potentially invalidated. If a node found to be a reduction before unrolling stops being a reduction due to some loop or Ideal transformation, it will simply not be identified and treated as a reduction by SLP vectorization.

The solution suggested in this RFE requires generalizing reduction analysis to also identify reductions on unrolled loops. For example, the analysis should be able to detect that node (166 AddI) in reduction-before-unrolling.pdf (attached) is a reduction, and also that nodes {(166 AddI, 504 AddI, 579 AddI, 591 AddI)} form a reduction chain in the corresponding unrolled loop reduction-unrolled-x4.pdf. While detecting reduction chains in unrolled loops is necessarily more costly, the cost is never worse than ("number of phi nodes in the loop" x LoopMaxUnroll), under the assumption that all nodes in an unrolled reduction chain are connected via the same input number. This assumption is key to avoid a combinatorial explosion, and seems to hold in practice, since reduction chains are formed by cloning the same reduction node a number of times.

A prototype of this alternative design is available at https://github.com/robcasloz/jdk/tree/JDK-8287087.

Work left to do:

- Evaluate the overhead of doing general reduction analysis on unrolled loops and its contribution to total C2 execution time (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/superword.cpp#L115).

- Evaluate the overhead of doing per-node general reduction analysis for x86-64 min/max floating-point intrinsic selection (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/node.cpp#L3236).

- Test that the same reduction loops are vectorized as before. As part of this effort, IR test framework checks could be added to the reduction vectorization test cases for more robust regression testing. Some examples can be found in the prototype, e.g. https://github.com/robcasloz/jdk/blob/general-reduction-analysis/test/hotspot/jtreg/compiler/loopopts/superword/RedTest_int_x64.java. This would require additional support IR test framework for multi-target IR matching, see JDK-8280120.

- Test that the x86-64 min/max floating-point intrinsic selection works at least as well as (and possibly better than) the original implementation. The potential improvement over the original implementation comes from the fact that on-demand reduction analysis would be performed on *every* min/max floating-point node visited during instruction selection, not just on those visited in the context of SLP analysis.
Comments
Changeset: 1be80a44 Author: Roberto CastaƱeda Lozano <rcastanedalo@openjdk.org> Date: 2023-04-27 09:39:53 +0000 URL: https://git.openjdk.org/jdk/commit/1be80a4445cf74adc9b2cd5bf262a897f9ede74f
27-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13120 Date: 2023-03-21 14:49:26 +0000
22-03-2023