JDK-8308994 : C2: Re-implement experimental post loop vectorization
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 21
  • Priority: P3
  • Status: In Progress
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2023-05-29
  • Updated: 2024-05-21
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Post loop vectorization takes advantage of vector mask (predicate) features of some hardware platforms, such as x86 AVX-512 and AArch64 SVE, to vectorize tail iterations of loops for better performance. The existing implementation in the C2 compiler has a long history. It was first implemented in JDK-8153998 in 2016 under a C2's experimental feature PostLoopMultiversioning to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, We took over JDK-8183390 to fix and re-enable this feature. Several issues were fixed and AArch64 vector mask support was added at that time. As we proposed to make post loop vectorization non-experimental in future JDK releases, we did some stress tests early in this year but found more problems inside. The problems include stability, maintainability and performance.

1. Stability
Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, including JDK-8301657, JDK-8301904, JDK-8301944, JDK-8304774, JDK-8308949 and perhaps more with recent C2 patches.

2. Maintainability
The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.

3. Performance
Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read JDK-8307084 to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.

For better stability, maintainability and performance, we now propose to deprecate current multi-versioning framework and completely re-implement the experimental post loop vectorization, for both x86 AVX-512 and AArch64 SVE. Our new proposal is to add a standalone ideal loop phase (outside SuperWord) to do vector mask transformation directly on the original scalar post loop.

Patch for this is expected to be targeted to JDK 22.

Comments
Since the old implementation of post loop vectorization is cleaned up in JDK-8311691, I have re-linked its related issues and closed them.
17-07-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/14581 Date: 2023-06-21 08:24:19 +0000
21-06-2023

Okay, let's leave them open for now.
30-05-2023

I am not sure about the process and don't know if anyone else has interest to fix those issues. Our new patch for this JBS task will propose to remove all legacy code related to PostLoopMultiversioning.
30-05-2023

Just wondering, should we close the related issues as duplicates?
30-05-2023