Bug ID: JDK-8334431 C2 SuperWord: fix performance regression due to store-to-load-forwarding failures

JDK 24
24 b25Fixed

Changeset: 75420e93 Branch: master Author: Emanuel Peter <epeter@openjdk.org> Date: 2024-11-20 14:23:57 +0000 URL: https://git.openjdk.org/jdk/commit/75420e9314c54adc5b45f9b274a87af54dd6b5a8
20-11-2024
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/21521 Date: 2024-10-15 11:33:04 +0000
07-11-2024
[~kvn][~ecaspole] Maybe I can hack something together that just rejects vectorization if I detect a guaranteed store-to-load-forwarding failure. I think that could be done actually. It won't cover all cases nor will it be perfect. In the long-run we may want something better, but it could work for now. I'll investigate.
14-10-2024
[~ecaspole] I see this is a P2 for JDK24, so relatively high priority. The problem is this: There is a "set of loops" that currently vectorize but that would be faster if they did not vectorize. The problem is store-to-load-forwarding failure. The SHA code has intrinsics on most modern machines, but not on some older ones like the Coffe Lake machine you found the regression on. Without intrinsics, the SHA java code happens to be such a case where vectorization leads to a regression. We could eventually try to create a cost-model that has a heuristic that can predict if we will have such store-to-load failures. But that is currently too complicated, i.e. I have other priorities. But I hope to get there eventually. Do you think it is acceptable to leave this as "Won't Fix" for now? We could give people a workaround, by adding a -XX:CompileCommand=UseSuperWord,Test::test,false This would at least allow users to disable SuperWord on a per-method-basis if this is critical for them. Currently, one can only enable/disable SuperWord globally with -XX:-UseSuperWord.
14-10-2024
Here my blog-post about the topic: https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html
25-06-2024
I was actually able to create some benchmarks which show the effect of store-to-load-forwarding and the failure thereof quite well. And these examples show this effect irrespective of JDK-8325155. That means even before JDK-8325155 we had some cases where we SuperWord-ed, and that was actually slower than without SuperWord. The numbers below are for that attached VectorLoadToStoreForwarding.java. I ran it with: make test TEST="micro:vm.compiler.VectorLoadToStoreForwarding" CONF=linux-x64 I'll have to run it again with more runs to be sure the numbers are stable. Benchmark (SIZE) (seed) Mode Cnt Score Error Units VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_0 80 0 avgt 26.599 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_1 80 0 avgt 30.829 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_2 80 0 avgt 83.134 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_3 80 0 avgt 57.173 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_4 80 0 avgt 51.338 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_5 80 0 avgt 39.334 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_6 80 0 avgt 34.384 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_7 80 0 avgt 33.468 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_8 80 0 avgt 33.040 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingNoSuperWord.benchmark_9 80 0 avgt 30.244 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_0 80 0 avgt 11.415 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_1 80 0 avgt 30.491 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_2 80 0 avgt 42.592 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_3 80 0 avgt 275.857 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_4 80 0 avgt 33.821 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_5 80 0 avgt 142.923 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_6 80 0 avgt 142.265 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_7 80 0 avgt 141.702 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_8 80 0 avgt 32.395 ns/op VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_9 80 0 avgt 77.071 ns/op
24-06-2024
Nice analysis, Emanuel. I think you need to detect store-to-load-forwarding (or missing of it) for vector operations. We don't have intrinsics for all shape of Java code. There is code which will be affected even if SHA is supported because it is not related to the issue. You can also reproduce/test changes on modern CPU by disabling intrinsic and see affect of store-to-load-forwarding. I don't think you need to revert your changes but you need to have a solution for JDK 24.
24-06-2024
Thus here the summary: 1. The regression most likely shows up on all machines that have "store-to-load-forwarding". This optimization works as follows: - Stores first go to a store-buffer before they go to the cache-line. - Loads first check if the value can be forwarded directly from the store-buffer, which is much faster than leading even from a cache-line. - But this forwarding has some restrictions, to simplify a bit: they must start at the same address, and the load data must be contained in the store data (i.e. equal or smaller). - If these requirements are not met, and the store and load alias (i.e. have an overlapping memory region), there is a store-load dependency that the CPU detects, and since there is no forwarding, the CPU has to stall the load until the store is committed to the cache-line. This incurs a penalty of many CPU cycles. 2. Why does this affect SuperWord / AutoVectorization, and why did this happen after JDK-8325155? - JDK-8325155 removed certain restrictions on vectorization, i.e. that the vector lanes have some "alignment". This allows some more patterns to vectorize, which run into this store-load penalty because the store-to-load-forwarding optimization cannot be applied by the CPU. - Imagine we have a store to addresses [x+0] and [x+1], and then a load from addresses [x+1] and [x+2]. There is a store-load dependency on [x+1]. But if we have a scalar-loop, this store can be directly fetched from the store-buffer, and we do not have to wait for the store to reach the cache-line. But if we vectorize to store [x+0, x+1] and load [x+1, x+2], then the vector-store cannot be forwarded to the vector-load from the store-buffer, and we have to wait until the store goes to the cache-line, before we can load. While vectorization generally leads to speedup, this additional store-load dependency that must go over the cache-line (and not like the scalars over the faster fetching from store-buffer) makes vectorization slower. 3. The regression was detected on the benchmark "SPECjvm2008-Crypto.signverify", particularly in the C2 compiled method "sun.security.provider.SHA.implCompress0", and more specifically on the line: https://github.com/openjdk/jdk/blob/642084629a9a793a055cba8a950fdb61b7450093/src/java.base/share/classes/sun/security/provider/SHA.java#L158 However, this is part of the SHA compression code that has a corresponding intrinsic "vmIntrinsics::_sha_implCompress", which is about 3x faster than the non-intrinsified C2-compiled code. To my knowledge, most x64 and aarch64 machines support this intrinsic - it is enabled with the CPU-feature "sha" (check with "-Xlog:os+cpu"). But there are some machines like the Mac x64 machine here (Coffee Lake-B processor [3.0 GHz Intel Core i5-8500B]) that apparently do not support the "sha" CPU-feature. Thus, instead of using the "vmIntrinsics::_sha_implCompress" intrinsic, they end up C2-compiling the method "sun.security.provider.SHA.implCompress0", which contains the loop that has the vectorization-regression. 4. We could of course revert JDK-8325155 which has caused this regression, but that would be sad, it was a really nice refactoring that reduced complexity and enables more features in the future. I will probably try to develop a cost-model soon, and I can try to integrate the cost of these "store-load" dependency penalties when store-to-load-forwarding fails. A more tangential question: can we really not have a better intrinsic for the SHA code for those machines? Probably that is simply not worth it, or the required CPU instructions are just not available.
24-06-2024
If I disable the "implCompress", then I can see the same regression on my machine - about an 18% regression. With perfasm, I can see that there is quite a bit of time spent on the same loop as I analyzed before, so it is likely that the vectorization of that loop is the cause for the regression - just like on Mac x64. But with the intrinsic enabled, the performance is much much faster, about 3x as fast. ./jdk-24+1-15/fastdebug/bin/java -jar specjvm2008-jmh-1.25.jar com.oracle.crypto.Signverify.signverify -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_md5_implCompress,_sha_implCompress,_sha2_implCompress,_sha5_implCompress,_sha3_implCompress" -prof perfasm Benchmark Mode Cnt Score Error Units Signverify.signverify thrpt 15 1314.315 ± 23.876 ops/min Signverify.signverify:·asm thrpt NaN ./jdk-24+1-16/fastdebug/bin/java -jar specjvm2008-jmh-1.25.jar com.oracle.crypto.Signverify.signverify -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_md5_implCompress,_sha_implCompress,_sha2_implCompress,_sha5_implCompress,_sha3_implCompress" -prof perfasm Benchmark Mode Cnt Score Error Units Signverify.signverify thrpt 15 1111.993 ± 16.286 ops/min Signverify.signverify:·asm thrpt NaN ./jdk-24+1-16/fastdebug/bin/java -jar specjvm2008-jmh-1.25.jar com.oracle.crypto.Signverify.signverify -jvmArgs "" -prof perfasm Benchmark Mode Cnt Score Error Units Signverify.signverify thrpt 15 3320.990 ± 132.087 ops/min Signverify.signverify:·asm thrpt NaN
24-06-2024
With perfasm, I can see that we indeed are intrinsifying on my linux x64 avx512 (and probably other machines). ./jdk-24+1-16/fastdebug/bin/java -jar specjvm2008-jmh-1.25.jar com.oracle.crypto.Signverify.signverify -jvmArgs "" -prof perfasm ....[Hottest Regions]............................................................................... 19.32% libjvm.so SpaceMangler::mangle_region 18.20% runtime stub StubRoutines::sha1_implCompressMB 17.89% libjvm.so montgomery_multiply 17.60% runtime stub StubRoutines::md5_implCompressMB 11.16% runtime stub StubRoutines::sha256_implCompressMB 1.16% libjvm.so SharedRuntime::montgomery_square 1.15% c2, level 4 java.math.MutableBigInteger::divideMagnitude, version 2, compile id 1049 1.13% c2, level 4 spec.benchmarks.crypto.signverify.Main::harnessMain, version 2, compile id 1861 1.11% c2, level 4 spec.benchmarks.crypto.signverify.Main::harnessMain, version 2, compile id 1861 1.11% c2, level 4 spec.benchmarks.crypto.signverify.Main::harnessMain, version 2, compile id 1861 1.09% c2, level 4 spec.benchmarks.crypto.signverify.Main::harnessMain, version 2, compile id 1861 1.07% c2, level 4 spec.benchmarks.crypto.Util::getTestData, version 4, compile id 1286 0.92% libc-2.31.so __memset_avx2_erms 0.69% c2, level 4 java.math.BigInteger::oddModPow, version 3, compile id 1253 0.51% libjvm.so SharedRuntime::montgomery_square 0.44% libjvm.so BitMap::find_first_bit_impl<0ul, false> 0.44% libjvm.so CollectedHeap::fill_with_array 0.35% libjvm.so SharedRuntime::montgomery_multiply 0.28% runtime stub StubRoutines::mulAdd 0.25% libjvm.so MemAllocator::mem_allocate_inside_tlab_slow 4.14% <...other 3534 warm regions...> .................................................................................................... 100.00% <totals> ....[Hottest Methods (after inlining)].............................................................. 19.32% libjvm.so SpaceMangler::mangle_region 18.20% runtime stub StubRoutines::sha1_implCompressMB 17.89% libjvm.so montgomery_multiply 17.60% runtime stub StubRoutines::md5_implCompressMB 11.16% runtime stub StubRoutines::sha256_implCompressMB 4.57% c2, level 4 spec.benchmarks.crypto.signverify.Main::harnessMain, version 2, compile id 1861 1.72% libjvm.so SharedRuntime::montgomery_square 1.45% c2, level 4 java.math.MutableBigInteger::divideMagnitude, version 2, compile id 1049 1.08% c2, level 4 spec.benchmarks.crypto.Util::getTestData, version 4, compile id 1286 1.03% c2, level 4 java.math.BigInteger::oddModPow, version 3, compile id 1253 0.92% libc-2.31.so __memset_avx2_erms 0.87% kernel [unknown] 0.46% libjvm.so SharedRuntime::montgomery_multiply 0.44% libjvm.so BitMap::find_first_bit_impl<0ul, false> 0.44% libjvm.so CollectedHeap::fill_with_array 0.28% runtime stub StubRoutines::mulAdd 0.25% libjvm.so MemAllocator::mem_allocate_inside_tlab_slow 0.16% c2, level 4 java.math.MutableBigInteger::modInverse, version 3, compile id 1531 0.15% c2, level 4 java.math.MutableBigInteger::add, version 3, compile id 1217 0.13% c2, level 4 java.math.MutableBigInteger::subtract, version 2, compile id 1147 1.90% <...other 934 warm methods...> .................................................................................................... 100.00% <totals> ....[Distribution by Source]........................................................................ 47.46% runtime stub 41.32% libjvm.so 9.27% c2, level 4 0.96% libc-2.31.so 0.87% kernel 0.06% 0.01% ld-2.31.so 0.01% libpthread-2.31.so 0.01% [vdso] 0.00% Unknown, level 0 0.00% c1, level 1 0.00% perf-1503748.map 0.00% interpreter 0.00% c1, level 3 0.00% hsdis-amd64.so 0.00% libjava.so .................................................................................................... 100.00% <totals>
24-06-2024
I ran the full benchmark on my linux x64 avx512 machine: ./jdk-24+1-16/fastdebug/bin/java -jar specjvm2008-jmh-1.25.jar com.oracle.crypto.Signverify.signverify -jvmArgs "" -prof jfr emanuel@emanuel-oracle:/oracle-work/JDK-8334431-sw-regression$ ./jdk-24+1-15/fastdebug/bin/jfr view hot-methods /oracle-work/JDK-8334431-sw-regression/com.oracle.crypto.Signverify.signverify-Throughput/profile.jfr Java Methods that Executes the Most Method Samples Percent ------------------------------------------------------------------------------------------------------- ------- ------- java.lang.ThreadLocal.get() 2,231 36.91% java.math.MutableBigInteger.mulsub(int[], int[], int, int, int) 762 12.61% spec.benchmarks.crypto.Util.getTestData(String) 568 9.40% java.math.BigInteger.oddModPow(BigInteger, BigInteger) 487 8.06% jdk.internal.util.Preconditions.checkFromIndexSize(int, int, int, BiFunction) 401 6.63% java.io.ByteArrayInputStream.read(byte[], int, int) 309 5.11% spec.benchmarks.crypto.signverify.Main.harnessMain() 196 3.24% java.math.MutableBigInteger.divideMagnitude(MutableBigInteger, MutableBigInteger, boolean) 189 3.13% java.math.BigInteger.implMontgomeryMultiplyChecks(int[], int[], int[], int, int[]) 101 1.67% java.math.MutableBigInteger.add(MutableBigInteger) 88 1.46% java.math.MutableBigInteger.subtract(MutableBigInteger) 62 1.03% java.math.MutableBigInteger.modInverse(MutableBigInteger) 33 0.55% java.math.MutableBigInteger.compare(MutableBigInteger) 29 0.48% java.math.MutableBigInteger.getLowestSetBit() 28 0.46% java.math.MutableBigInteger.unsignedLongCompare(long, long) 24 0.40% java.io.ByteArrayOutputStream.write(int) 21 0.35% java.math.MutableBigInteger.primitiveRightShift(int) 20 0.33% java.math.BigInteger.montgomerySquare(int[], int[], int, long, int[]) 19 0.31% java.math.BigInteger.addOne(int[], int, int, int) 18 0.30% java.math.BigInteger.stripLeadingZeroBytes(int, byte[], int, int) 17 0.28% java.math.BigInteger.montReduce(int[], int[], int, int) 17 0.28% java.math.MutableBigInteger.primitiveLeftShift(int) 16 0.26% java.math.BigInteger.montgomeryMultiply(int[], int[], int[], int, long, int[]) 16 0.26% java.math.MutableBigInteger.normalize() 15 0.25% java.math.MutableBigInteger.leftShift(int) It seems that maybe the method in question is not compiled. There could be a reason for that - maybe inlining, intrinsics, etc. This indicates that other platforms really spend much less time on the method "sun.security.provider.SHA.implCompress0", which encounters the SuperWord regression on the Mac x64 machine.
24-06-2024
[~sviswanathan] Excellent, I speculated about the write-buffer optimizations yesterday, and quickly read up about that online before I saw your message. Thanks for confirming it! A quote from 3.6.4.1: Store-forwarding restrictions vary with each microarchitecture. The following rules help satisfy size and alignment restrictions for store forwarding: Assembly/Compiler Coding Rule 41. (H impact, M generality) A load that forwards from a store must have the same address start point and therefore the same alignment as the store data. Assembly/Compiler Coding Rule 42. (H impact, M generality) The data of a load which is forwarded from a store must be completely contained within the store data. The scalar loop in test1 can perfectly forward the stores to later loads - they have the same size and are 4byte aligned. But the vectorized loop has 8byte stores, and the 8byte loads are shifted by 4bytes, thus they neither have the same starting address nor is the load contained within the store data - so both rules above are violated. For test7, the 8byte stores are at the same address as the 8byte loads, and so the stores in the write-buffer can directly be loaded, and we avoid having to wait for the store to reach the cacheline, which has a penalty of a few cycles. This is absolutely fascinating, I definately learned something new here. I'll have to think a bit about what we can do to avoid the regression: I'm thinking we might need a way to predict if store-load forwarding is going to be successful or not - this affects the latency of the loop body.
22-06-2024
The situation that Emanuel is describing in A.java between test1 and test7 is due to Store-to-Load forwarding optimizations in the CPU. It is described in the Intel Software Optimization Manual (https://cdrdv2.intel.com/v1/dl/getContent/671488) at https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html, section 3.6.4.1 Store-to-Load-Forwarding Restriction on Size and Alignment.
22-06-2024
May be Intel can answer your questions. [~jbhateja] and [~sviswanathan] can you look?
21-06-2024
The strange thing: The observations of regression on my microbenchmark A.java go across all platforms I could inspect: different x64 variants with avx2 or avx512, and even aarch64 neon/asimd. But the Crypto.signverify as a whole only seems to detect the regression on Mac x64 (Coffee Lake-B processor [3.0 GHz Intel Core i5-8500B]). So we are missing something else here, and I need to track this down next week.
21-06-2024
I spent today looking more deeply into this, and testing the microbenchmarks on all sorts of platforms, various x64 machines and even an aarch64 asimd/neon machine. See the attached A.java with all the ananlysis, benchmark numbers and comments. But a quick summary here before the weekend: - Unrolling has some small effect, and we can look into improving that. But it is the most significant effect. - A general observation is that 2-element-vectors can lead to speedup over no vectorization, and 4-element-vectorization is generally even more profitable. But in this case, 2-element-vectors are not profitable - but why? - What becomes clear in test9 / test9b is that dependencies between iterations can have a drastic effect. These tests generate the same machine / assembly code, but on the CPU it seems the loads/stores are detected to NOT alias, and therefore the dependencies are dropped on the CPU, even though they are present on assembly/machine code level. - So test1 and test9 have such cross-iteration dependencies, and so some stores of previous iterations must happen before loads of later iterations - this increases the latency drastically. - But test7 and test8 have similar dependencies, just at an offset 2 - test1 and test9 have an offset 3. This is quite peculiar that this has such an adverse effect. This leads me to speculate that there could be some store-load caching that test7/8 benefit from (exactly the same 2-element-vectors are stored and then loaded again), whereas test1/9 cannot benefit from that caching (the 2-element-vector store is offset by 1 from the 2-element-vector load, and they only half-overlap). It would be interesting if there is actually such a caching, or any other effect that explains this difference. At any rate: JDK-8325155 allows some new patterns to be vectorized, and in particular such that have dependencies between iterations, such that vector stores of previous iterations have to happen before vector loads of later iterations. It seems that some of these patterns, especially the 2-element-vector cases, are not profitable with vectorization. 2-element-vector cases are particularly problematic, because these usually have dependencies between iterations i and i+2, which are quite close. If we look at 2-element-vector cases, this would mean the dependencies is at iterations i and i+4, and we can already parallelize 4 iterations and not just 2 - this is more profitable in vectorization.
21-06-2024
I'm now running the same benchmark on my x64 avx512 linux machine, with master (including JDK-8325155), just to see the impact there. For consistency, I disabled turboboost. ------ No SuperWord ----- time /oracle-work/jdk-fork2/build/linux-x64-debug/jdk/bin/java -XX:CompileCommand=compileonly,A::test* -XX:CompileCommand=printcompilation,A::* -XX:CompileCommand=TraceAutoVectorization,::,PRECONDITIONS,BODY,POINTERS,SW_REJECTIONS,SW_INFO -Xbatch -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:-UseSuperWord A.java ... Unroll 16( 7) Loop: N1591/N246 counted [int,73),+8 (65 iters) main has_sfpt strip_mined ... real 0m15.330s user 0m15.281s sys 0m0.229s ------ With SuperWord ----- time /oracle-work/jdk-fork2/build/linux-x64-debug/jdk/bin/java -XX:CompileCommand=compileonly,A::test* -XX:CompileCommand=printcompilation,A::* -XX:CompileCommand=TraceAutoVectorization,::,PRECONDITIONS,BODY,POINTERS,SW_REJECTIONS,SW_INFO -Xbatch -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UseSuperWord A.java .... SuperWord::apply_vectorization Loop: N1361/N246 counted [int,77),+4 (65 iters) main has_sfpt strip_mined TraceNewVectors [SuperWord]: 1410 LoadVector === 667 1367 1353 [[ 1349 1332 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1352],1133,353,[144] !jvms: A::test @ bci:19 (line 11) TraceNewVectors [SuperWord]: 1411 LoadVector === 667 1367 1355 [[ 1347 1330 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1354],1135,429,[195] !jvms: A::test @ bci:33 (line 11) TraceNewVectors [SuperWord]: 1412 LoadVector === 667 1367 1357 [[ 1348 1331 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1356],1137,391,[170] !jvms: A::test @ bci:26 (line 11) TraceNewVectors [SuperWord]: 1413 LoadVector === 667 1367 1351 [[ 1349 1332 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1350],1131,315,[119] !jvms: A::test @ bci:13 (line 11) TraceNewVectors [SuperWord]: 1414 XorV === _ 1413 1410 [[ 1348 1331 ]] #vectord[2]:{int} !orig=[1349],1130,145 !jvms: A::test @ bci:20 (line 11) TraceNewVectors [SuperWord]: 1415 XorV === _ 1414 1412 [[ 1347 1330 ]] #vectord[2]:{int} !orig=[1348],1129,171 !jvms: A::test @ bci:27 (line 11) TraceNewVectors [SuperWord]: 1416 XorV === _ 1415 1411 [[ 1338 1329 ]] #vectord[2]:{int} !orig=[1347],1128,196 !jvms: A::test @ bci:34 (line 11) TraceNewVectors [SuperWord]: 1417 RotateRightV === _ 1416 211 [[ 1337 1328 ]] #vectord[2]:{int} !orig=[1338],1126,214 !jvms: Integer::rotateLeft @ bci:7 (line 1673) A::test @ bci:40 (line 12) TraceNewVectors [SuperWord]: 1418 StoreVector === 1361 1367 1339 1417 [[ 315 1125 1131 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; !orig=[1337],1125,235,1152 !jvms: A::test @ bci:43 (line 12) TraceNewVectors [SuperWord]: 1419 LoadVector === 667 1367 1134 [[ 1130 145 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1133],[353],[144] !jvms: A::test @ bci:19 (line 11) TraceNewVectors [SuperWord]: 1420 LoadVector === 667 1367 1136 [[ 1128 196 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1135],[429],[195] !jvms: A::test @ bci:33 (line 11) TraceNewVectors [SuperWord]: 1421 LoadVector === 667 1367 1138 [[ 1129 171 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1137],[391],[170] !jvms: A::test @ bci:26 (line 11) TraceNewVectors [SuperWord]: 1422 LoadVector === 667 1418 1132 [[ 1130 145 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1131],[315],[119] !jvms: A::test @ bci:13 (line 11) TraceNewVectors [SuperWord]: 1423 XorV === _ 1422 1419 [[ 1129 171 ]] #vectord[2]:{int} !orig=[1130],[145] !jvms: A::test @ bci:20 (line 11) TraceNewVectors [SuperWord]: 1424 XorV === _ 1423 1421 [[ 1128 196 ]] #vectord[2]:{int} !orig=[1129],[171] !jvms: A::test @ bci:27 (line 11) TraceNewVectors [SuperWord]: 1425 XorV === _ 1424 1420 [[ 1126 214 ]] #vectord[2]:{int} !orig=[1128],[196] !jvms: A::test @ bci:34 (line 11) TraceNewVectors [SuperWord]: 1426 RotateRightV === _ 1425 211 [[ 1125 235 ]] #vectord[2]:{int} !orig=[1126],[214] !jvms: Integer::rotateLeft @ bci:7 (line 1673) A::test @ bci:40 (line 12) TraceNewVectors [SuperWord]: 1427 StoreVector === 1361 1418 1127 1426 [[ 1367 660 238 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; !orig=[1125],[235],1152 !jvms: A::test @ bci:43 (line 12) SuperWord::transform_loop: success TraceNewVectors [MacroLogic]: 1576 MacroLogicV === _ 1422 1419 1421 1575 [[ ]] #vectord[2]:{int} TraceNewVectors [MacroLogic]: 1577 MacroLogicV === _ 1413 1410 1412 1575 [[ ]] #vectord[2]:{int} real 0m34.676s user 0m34.577s sys 0m0.293s ------ Explanation ----- Ouch, also here vectorization is actually worse, i.e. leads to a slowdown. Is it maybe down to lesser unrolling, because the SuperWord version only has a 4x unrolling, but the one without SuperWord has a 16 unrolling? Side note: I get the same performance if I use -XX:UseAVX=2. With -XX:UseAVX=1, then the RotateRightV is not implemented for length-2-vector (or maybe not at any size, did not check), this prevents vectorization, and leads to faster performance: WARNING: Removed pack: not implemented at any smaller size: 0: 1126 RotateRight === _ 1128 211 [[ 1125 ]] #int !orig=214 !jvms: Integer::rotateLeft @ bci:7 (line 1673) A::test @ bci:40 (line 12) 1: 214 RotateRight === _ 196 211 [[ 235 ]] #int !jvms: Integer::rotateLeft @ bci:7 (line 1673) A::test @ bci:40 (line 12) ------ No SuperWord, -XX:LoopMaxUnroll=4 ------- time /oracle-work/jdk-fork2/build/linux-x64-debug/jdk/bin/java -XX:CompileCommand=compileonly,A::test* -XX:CompileCommand=printcompilation,A::* -XX:CompileCommand=TraceAutoVectorization,::,PRECONDITIONS,BODY,POINTERS,SW_REJECTIONS,SW_INFO -Xbatch -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:-UseSuperWord -XX:LoopMaxUnroll=4 A.java ... Unroll 4(31) Loop: N1141/N246 counted [int,79),+2 (65 iters) main has_sfpt strip_mined ... real 0m15.310s user 0m15.252s sys 0m0.228s --------- Explanation -------- No, it seems to still be the vectorization. I'll have to do some more investigation. But it seems the problem is not just for Mac x64 (AVX2), but more general. It's a little strange that only the Mac x64 picked this up on the initially reported benchmark (SPECjvm2008-Crypto.signverify) though.
21-06-2024
[~ecaspole] gave me access to such a Mac x64 machine. I was able to reduce the benchmark to this code: public class A { public static void main(String[] args) { int[] W = new int[80]; for (int i = 0; i < 100_000_000; i++) { test(W); } } public static void test(int[] W) { for (int t = 16; t <= 79; t++) { int temp = W[t-3] ^ W[t-8] ^ W[t-14] ^ W[t-16]; W[t] = Integer.rotateLeft(temp, 1); } } } I ran it with the jdk build before and after JDK-8325155. Note about this test: we have a "read-backward" scenario here, i.e. to write at position W[t] we look back 3 iterations to W[t-3]. Hence, we can only expect to vectorize at most 3 elements, and if we go with powers-of-2 vector lengths, then 2-element-vectors. -------------------------- before JDK-8325155 ------------------------ time ./jdk-24+1-15/fastdebug/bin/java -XX:CompileCommand=compileonly,A::test* -XX:CompileCommand=printcompilation,A::* -XX:CompileCommand=TraceAutoVectorization,::,PRECONDITIONS,BODY,POINTERS,SW_REJECTIONS,SW_INFO -Xbatch -XX:+TraceNewVectors -XX:+TraceLoopOpts A.java ... 1394 99 b 4 A::test (51 bytes) ... Unroll 4(31) Loop: N1141/N246 counted [int,79),+2 (65 iters) main has_sfpt strip_mined ... After Superword::filter_packs_for_profitable PackSet::print: 0 packs SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize Loop: N0/N0 has_sfpt Loop: N647/N652 predicated counted [16,int),+1 (4 iters) pre Loop: N268/N267 sfpts={ 270 } Loop: N1361/N246 counted [int,77),+4 (65 iters) main has_sfpt strip_mined Loop: N609/N614 counted [int,80),+1 (4 iters) post VLoop::check_preconditions Loop: N1361/N246 counted [int,77),+4 (65 iters) main has_sfpt strip_mined 1361 CountedLoop === 1361 268 246 [[ 1328 1337 1361 1125 1366 1367 263 235 ]] inner stride: 4 main of N1361 strip mined !orig=[1141],[269],[257],[69] !jvms: A::test @ bci:9 (line 11) VLoop::check_preconditions: failed: loop only wants to be unrolled real 0m8.746s user 0m8.699s sys 0m0.054s ------ Explanation ------- The loop is unrolled 4x, then SuperWord is attempted, but the loads / stores are not deemed "profitable" because the vector lanes do not have the same "alignment" (see JDK-8325155, where I removed this requirement). Hence, there is no vectorization, and the final result is just the loop with 4x unrolling. A rough benchmarking gives us about 8.7 sec on that machine. -------------------------- after JDK-8325155 ------------------------ time ./jdk-24+1-16/fastdebug/bin/java -XX:CompileCommand=compileonly,A::test* -XX:CompileCommand=printcompilation,A::* -XX:CompileCommand=TraceAutoVectorization,::,PRECONDITIONS,BODY,POINTERS,SW_REJECTIONS,SW_INFO -Xbatch -XX:+TraceNewVectors -XX:+TraceLoopOpts A.java ... 1371 99 b 4 A::test (51 bytes) ... Unroll 4(31) Loop: N1141/N246 counted [int,79),+2 (65 iters) main has_sfpt strip_mined ... SuperWord::output Loop: N1361/N246 counted [int,77),+4 (65 iters) main has_sfpt strip_mined TraceNewVectors [SuperWord]: 1410 LoadVector === 667 1367 1353 [[ 1349 1332 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1352],1133,353,[144] !jvms: A::test @ bci:19 (line 11) TraceNewVectors [SuperWord]: 1411 LoadVector === 667 1367 1355 [[ 1347 1330 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1354],1135,429,[195] !jvms: A::test @ bci:33 (line 11) TraceNewVectors [SuperWord]: 1412 LoadVector === 667 1367 1357 [[ 1348 1331 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1356],1137,391,[170] !jvms: A::test @ bci:26 (line 11) TraceNewVectors [SuperWord]: 1413 LoadVector === 667 1367 1351 [[ 1349 1332 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1350],1131,315,[119] !jvms: A::test @ bci:13 (line 11) TraceNewVectors [SuperWord]: 1414 XorV === _ 1413 1410 [[ 1348 1331 ]] #vectord[2]:{int} !orig=[1349],1130,145 !jvms: A::test @ bci:20 (line 11) TraceNewVectors [SuperWord]: 1415 XorV === _ 1414 1412 [[ 1347 1330 ]] #vectord[2]:{int} !orig=[1348],1129,171 !jvms: A::test @ bci:27 (line 11) TraceNewVectors [SuperWord]: 1416 XorV === _ 1415 1411 [[ 1338 1329 ]] #vectord[2]:{int} !orig=[1347],1128,196 !jvms: A::test @ bci:34 (line 11) TraceNewVectors [SuperWord]: 1417 RotateRightV === _ 1416 211 [[ 1337 1328 ]] #vectord[2]:{int} !orig=[1338],1126,214 !jvms: Integer::rotateLeft @ bci:7 (line 1673) A::test @ bci:40 (line 12) TraceNewVectors [SuperWord]: 1418 StoreVector === 1361 1367 1339 1417 [[ 315 1125 1131 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; !orig=[1337],1125,235,1152 !jvms: A::test @ bci:43 (line 12) TraceNewVectors [SuperWord]: 1419 LoadVector === 667 1367 1134 [[ 1130 145 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1133],[353],[144] !jvms: A::test @ bci:19 (line 11) TraceNewVectors [SuperWord]: 1420 LoadVector === 667 1367 1136 [[ 1128 196 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1135],[429],[195] !jvms: A::test @ bci:33 (line 11) TraceNewVectors [SuperWord]: 1421 LoadVector === 667 1367 1138 [[ 1129 171 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1137],[391],[170] !jvms: A::test @ bci:26 (line 11) TraceNewVectors [SuperWord]: 1422 LoadVector === 667 1418 1132 [[ 1130 145 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched #vectord[2]:{int} (does not depend only on test, unknown control) !orig=[1131],[315],[119] !jvms: A::test @ bci:13 (line 11) TraceNewVectors [SuperWord]: 1423 XorV === _ 1422 1419 [[ 1129 171 ]] #vectord[2]:{int} !orig=[1130],[145] !jvms: A::test @ bci:20 (line 11) TraceNewVectors [SuperWord]: 1424 XorV === _ 1423 1421 [[ 1128 196 ]] #vectord[2]:{int} !orig=[1129],[171] !jvms: A::test @ bci:27 (line 11) TraceNewVectors [SuperWord]: 1425 XorV === _ 1424 1420 [[ 1126 214 ]] #vectord[2]:{int} !orig=[1128],[196] !jvms: A::test @ bci:34 (line 11) TraceNewVectors [SuperWord]: 1426 RotateRightV === _ 1425 211 [[ 1125 235 ]] #vectord[2]:{int} !orig=[1126],[214] !jvms: Integer::rotateLeft @ bci:7 (line 1673) A::test @ bci:40 (line 12) TraceNewVectors [SuperWord]: 1427 StoreVector === 1361 1418 1127 1426 [[ 1367 660 238 ]] @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; mismatched Memory: @int[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any , idx=5; !orig=[1125],[235],1152 !jvms: A::test @ bci:43 (line 12) SuperWord::transform_loop: success real 0m19.363s user 0m19.312s sys 0m0.060s ----- Explanation ----- The loop is also 4x unrolled, and SuperWord is attempted. Now, after JDK-8325155, "profitable" does no longer do the "alignment" checks on the vector lanes, and vectorization succeeds. We can see that lines 11 and 12 are vectorized into 2 sets of 2-element-vectors. We have: 2x 4 LoadVector 2x 3 XorV 2x 1 RotateRightV 2x 1StoreVector Looking at this, one would think that this should actually lead to faster execution, as vectors are generally faster than scalar operations - we have fewer ops in the loop body. We need to investigate why this vectorization actually leads to performance loss. Instead of the 8.7 sec, this now takes 19.3 seconds - this is quite terrible. -------------------------- after JDK-8325155 without SuperWord ------------------------ Just as a sanity check: time ./jdk-24+1-16/fastdebug/bin/java -XX:CompileCommand=compileonly,A::test* -XX:CompileCommand=printcompilation,A::* -XX:CompileCommand=TraceAutoVectorization,::,PRECONDITIONS,BODY,POINTERS,SW_REJECTIONS,SW_INFO -Xbatch -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:-UseSuperWord A.java 1381 99 b 4 A::test (51 bytes) Counted Loop: N269/N246 counted [16,80),+1 (-1 iters) Loop: N0/N0 has_sfpt Loop: N268/N267 limit_check profile_predicated predicated Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (-1 iters) has_sfpt strip_mined Predicate IC Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt rce strip_mined Predicate RC Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt rce strip_mined Predicate RC Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt rce strip_mined Predicate RC Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt rce strip_mined Predicate RC Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt rce strip_mined Predicate RC Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt rce strip_mined Loop: N0/N0 has_sfpt Loop: N268/N267 limit_check profile_predicated predicated sfpts={ 270 } Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt strip_mined PreMainPost Loop: N269/N246 limit_check profile_predicated predicated counted [16,80),+1 (65 iters) has_sfpt strip_mined Unroll 2 Loop: N269/N246 counted [int,80),+1 (65 iters) main has_sfpt strip_mined Exceeding node budget: 320 < 557 Loop: N0/N0 has_sfpt Loop: N647/N652 limit_check profile_predicated predicated counted [16,int),+1 (4 iters) pre Loop: N268/N267 sfpts={ 270 } Loop: N1141/N246 counted [int,79),+2 (65 iters) main has_sfpt strip_mined Loop: N609/N614 counted [int,80),+1 (4 iters) post Unroll 4(31) Loop: N1141/N246 counted [int,79),+2 (65 iters) main has_sfpt strip_mined Loop: N0/N0 has_sfpt Loop: N647/N652 limit_check profile_predicated predicated counted [16,int),+1 (4 iters) pre Loop: N268/N267 sfpts={ 270 } Loop: N1361/N246 counted [int,77),+4 (65 iters) main has_sfpt strip_mined Loop: N609/N614 counted [int,80),+1 (4 iters) post Unroll 8(15) Loop: N1361/N246 counted [int,77),+4 (65 iters) main has_sfpt strip_mined Loop: N0/N0 has_sfpt Loop: N647/N652 limit_check profile_predicated predicated counted [16,int),+1 (4 iters) pre Loop: N268/N267 sfpts={ 270 } Loop: N1591/N246 counted [int,73),+8 (65 iters) main has_sfpt strip_mined Loop: N609/N614 counted [int,80),+1 (4 iters) post Unroll 16( 7) Loop: N1591/N246 counted [int,73),+8 (65 iters) main has_sfpt strip_mined Loop: N0/N0 has_sfpt Loop: N647/N652 limit_check profile_predicated predicated counted [16,int),+1 (4 iters) pre Loop: N268/N267 sfpts={ 270 } Loop: N1890/N246 counted [int,65),+16 (65 iters) main has_sfpt strip_mined Loop: N609/N614 counted [int,80),+1 (4 iters) post PredicatesOff Loop: N0/N0 has_sfpt Loop: N647/N652 predicated counted [16,int),+1 (4 iters) pre Loop: N268/N267 sfpts={ 270 } Loop: N1890/N246 counted [int,65),+16 (65 iters) main has_sfpt strip_mined Loop: N609/N614 counted [int,80),+1 (4 iters) post real 0m8.266s user 0m8.232s sys 0m0.041s ---- Explanation ---- It seems that now we get a 16x unrolling, leading to a faster performance of 8.3 sec. This is probably a little faster than the non-vectorized 4x unrolled version of before JDK-8325155, because of the increased unrolling. But it is definately much faster than with SuperWord enabled. So the problem seems to truly be with the vectorization, and with JDK-8325155 specifically. ---- to be continued -----
21-06-2024
Looks like a very specific machine, that I do not have access to currently. [~ecaspole] said he would come up with a plan on how to debug this.
18-06-2024
ILW = Significant performance regression, single benchmark on Mac x64, no known workaround = HLH = P2
18-06-2024
Emanuel, please have a look.
18-06-2024

Relates :	JDK-8335006 - C2 SuperWord: add JMH benchmark VectorLoadToStoreForwarding.java
Relates :	JDK-8325155 - C2 SuperWord: remove alignment boundaries
Relates :	JDK-8355094 - Performance drop in auto-vectorized kernel due to split store