JDK-8183390 : Fix and re-enable post loop vectorization
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9,10,11,12,13,14,15,16,17,18,19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2017-07-03
  • Updated: 2022-07-08
  • Resolved: 2022-04-05
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b17Fixed
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
The fix for JDK-8183103 disabled PostLoopMultiversioning and AVX=3 in JDK 9 by default due to problems with the vectorization of post loops.

We should implement a full fix and re-enable both features in JDK 10 (or a JDK 9 update release).
This is an experimental feature fix and it's based on masked vector matching rules introduced in JDK18. So no need to backport.

Changeset: 741be461 Author: Pengfei Li <pli@openjdk.org> Date: 2022-04-05 23:50:13 +0000 URL: https://git.openjdk.java.net/jdk/commit/741be46138c4a02f1d9661b3acffb533f50ba9cf

Also attached my slides as a design doc for future reference https://bugs.openjdk.java.net/secure/attachment/98430/JDK-8183390.pdf

JMH tests results (Cipher parts) on Arm Neoverse v1 AESGCMBench.decrypt -0.14% AESGCMBench.decryptMultiPart 1.24% AESGCMBench.encrypt 6.17% AESGCMBench.encryptMultiPart 1.94% AESGCMByteBuffer.decrypt -1.97% AESGCMByteBuffer.decrypt 0.28% AESGCMByteBuffer.decryptMultiPart -0.93% AESGCMByteBuffer.decryptMultiPart 0.13% AESGCMByteBuffer.encrypt 3.55% AESGCMByteBuffer.encrypt 5.53% AESGCMByteBuffer.encryptMultiPart -0.77% AESGCMByteBuffer.encryptMultiPart 3.88% CipherBench.ChaCha20Poly1305.decrypt -1.64% CipherBench.ChaCha20Poly1305.encrypt -1.33% CipherBench.GCM.decrypt 0.05% CipherBench.GCM.encrypt 0.45% KeyAgreementBench.EC.generateSecret -0.24% KeyAgreementBench.XDH.generateSecret 0.10% KeyPairGeneratorBench.generateKeyPair 0.03% KeyPairGeneratorBench.generateKeyPair 10.99% It looks there are more regressions on x86, more improvements on AArch64 SVE.

Performance testing found few regressions (x64) based on both performance runs: Crypto-ChaCha20Poly1305.encrypt ~2% Crypto-EC.generateSecret ~3.5% Crypto-XDH.generateSecret ~1.7% Renaissance-ChiSquare ~1.2% And no noticeable confirmed improvements.

I attached log from TestSuperwordFailsUnrolling test failure. It is the only failure.

[~jbhateja] [~thartmann] I have opened a pull request for this fix. https://github.com/openjdk/jdk/pull/6828

[~jbhateja] [~thartmann] I have already done a patch to fix and re-enable post loop vectorization using masked operations. My patch fixes several issues and is fully tested. Now the post loop feature works on both x86 AVX-512 and AArch64 SVE. I haven't pushed my patch for review because the dependent patch (https://github.com/openjdk/jdk/pull/5873) has not been merged yet.

Thanks for the update, Jatin.

Hi Tobias, I plan to address this along with extending existing post-atomic loop to handle masked operations, this shall prevent generation of any post scalar tail. But I am not sure if this can be addressed in JDK-18.

[~jbhateja] any updates on this? JDK-8247838 includes a reproducer.


[~vdeshpande] Vivek, can someone in Intel take this bug?

As for a fix to PostLoopMultiversioning: there are two possible paths to take, the first and easier is: a.) Add an attribute to loops: dependenceVectorConsistent, initialize to true, then analyze and prove false with the following constraints on the fully rolled loop (unroll factor =1) before entry to the multiversion clone code. For each memory access annotate a description of its dependence vector in a collection of strings: An address expression that moves in the iteration space in the direction of the induction variable, for an up counted loops this is ">", for a down counted loop this is "<". An Address expression that moves in the iteration space contrary to the direction of the induction variable, for an up counted loops this is "<", for a down counted loop this is ">". once the analysis is complete on the rolled loop, if the collection yields the existence of a vector moving in the opposite direction of the induction variable then dependenceVectorConsistent is false. Reason: we are not currently sorting the access to alignment for direction of usage of vectors in superword for only PostLoopMultiversioning model, ergo the transformation becomes illegal. Note: you may come up with a simpler annotation that just indicates the existence of an opposing dependence vector wrt to loop direction. All other cases are consistent with the initial value of dependenceVectorConsistant as true as the direction of usage of vectors is consistent in these cases. This value can then be used to prevent entry into the multiversioning clone code (insert_scalar_rced_post_loop(...)) or as a guarded return case in the code itself, preventing any down stream actions, as the post loop is never modified to be processed by superword in these cases. b.) The second, which does not require the above (a) change, is to enable sorting for non power of two fixup loops which have variant residual predicated iterations which fit under the currently mapped power of two size. An example is a max of 16 iterations and could fit a predicated loop which maps 1..16 iterations in fixup space. This is a bit more involved though as the number of residual iterations is not constant, i.e. the predication size is variable.

As of the writing of this comment, there are no known issues with post loop vectorization disabled and UseAVX=3 (7-5-2017).

Another problem with post loop vectorization showed up (JDK-8183319) which should be fixed with this change as well.

ILW = Experimental feature is broken, with -XX:+PostLoopMultiversioning and -XX:UseAVX=3, no workaround = MMH = P3