JDK-8323609 : C2: Odd vectorization breakage with DBB.getLong loop
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 23
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • Submitted: 2024-01-11
  • Updated: 2024-01-15
  • Resolved: 2024-01-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdResolved
Related Reports
Duplicate :  
Description
Spotted this during related performance work. If you run the current bytebuffer microbenchmarks, then one of them stands out:

```
% CONF=linux-x86_64-server-release make images test TEST="micro:ByteBuffers.testDirect.*Long" MICRO="FORK=1;OPTIONS=-p size=131072" 

ByteBuffers.testDirectLoopGetLong:        1904.220 +- 0.555  ns/op
ByteBuffers.testDirectLoopGetLongRO:  1914.562 +- 7.225  ns/op
ByteBuffers.testDirectLoopGetLongSwap:     4839.337 +- 2.398  ns/op  <---- !!!
ByteBuffers.testDirectLoopGetLongSwapRO:   1902.759 +- 0.812  ns/op
ByteBuffers.testDirectLoopPutLong:        2068.266 +- 2.197  ns/op
ByteBuffers.testDirectLoopPutLongSwap:      2104.532 +- 2.153  ns/op
```

testDirectLoopGetLongSwap is way out of band, with 2x throughput loss.

Perfasm shows that in the bad case we have not auto-vectorized the loop, there is a sequence of 8-byte reads+adds. Good cases are all auto-vectorized with 256-byte reads. What is even more funky that the bad case gets "repaired" when one asks for read-only (RO) version of it, see testDirectLoopGetLongSwapRO!

(Note that "swap" is misleading, it "swaps" default big-endian BB to little-endian, which matches x86.)

This reliably reproduces Xeon Platinum 8124M. I have not investigated deeply (at least yet).
Comments
I took the benchmark and ran it on JDK 17 and JDK 21, and I don't think it is a regression per se. The whole thing gets interesting after observing the benchmark results change quite a bit after JDK-8286401! I ran the tests on the same benchmark JAR with JDK-8286401, which resembles mainline state of it. ``` # JDK 21 ByteBuffers.testDirectLoopGetLong 131072 avgt 10 2438.782 ± 2.387 ns/op ByteBuffers.testDirectLoopGetLongRO 131072 avgt 10 1895.233 ± 5.196 ns/op ByteBuffers.testDirectLoopGetLongSwap 131072 avgt 10 4838.730 ± 0.516 ns/op ByteBuffers.testDirectLoopGetLongSwapRO 131072 avgt 10 1907.789 ± 1.281 ns/op ByteBuffers.testDirectLoopPutLong 131072 avgt 10 3143.003 ± 4.817 ns/op ByteBuffers.testDirectLoopPutLongSwap 131072 avgt 10 3181.492 ± 6.562 ns/op # JDK 17 ByteBuffers.testDirectLoopGetLong 131072 avgt 10 8055.508 ± 4.676 ns/op ByteBuffers.testDirectLoopGetLongRO 131072 avgt 10 8072.991 ± 4.990 ns/op ByteBuffers.testDirectLoopGetLongSwap 131072 avgt 10 4843.763 ± 3.319 ns/op ByteBuffers.testDirectLoopGetLongSwapRO 131072 avgt 10 8115.210 ± 2.318 ns/op ByteBuffers.testDirectLoopPutLong 131072 avgt 10 3170.164 ± 1.835 ns/op ByteBuffers.testDirectLoopPutLongSwap 131072 avgt 10 3164.374 ± 3.375 ns/op ```
15-01-2024

Hence, this may be a regression, but probably not. If it is a regression, then I think it must have been like this for a while. See my RFE: JDK-8307516 TLDR: The SuperWord reduction heuristic thinks that just a reduction in a loop is not worth vectorizing. This comes from the time before JDK-8302652, where the whole vector->scalar reduction was always done inside the loop, and therefore had a overhead over the scalar code. Vectorization was only profitable if there are some other operations in the loop, which then can also be vectorized, and the benefit of that outweighs the cost of vectorizing the reduction. In the current examples, the shuffling is such an additional operation, which when vectorized leads to signifficant benefit to outweigh the cost of reduction vectorization. But now with JDK-8302652, the simple reduction is simply a vector-add (vpaddq). This is cheaper than having many scalar adds in sequence. And that is why we need to adjust the reduction heuristic, as I suggested a while ago in JDK-8307516. Ah, and why not just apply the patch right away? The problem is that float/double reductions still are kept inside the loop, and so the overhead for a simple reduction is still worse with vectorization.
12-01-2024

It is as I expected, a problem with the reduction heuristic. ---------------------- Apply this patch: diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp index 4b1d9e54572..53ec7fc0839 100644 --- a/src/hotspot/share/opto/superword.cpp +++ b/src/hotspot/share/opto/superword.cpp @@ -1891,7 +1891,7 @@ bool SuperWord::profitable(Node_List* p) { if (is_marked_reduction(p0)) { Node* second_in = p0->in(2); Node_List* second_pk = my_pack(second_in); - if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) { + if ((second_pk == nullptr)) { // Unmark reduction if no parent pack or if not enough work // to cover reduction expansion overhead _loop_reductions.remove(p0->_idx); ----------------------- New results: Benchmark (size) Mode Cnt Score Error Units ByteBuffers.testDirectLoopGetLong 131072 avgt 10 822.898 ? 11.028 ns/op ByteBuffers.testDirectLoopGetLong:asm 131072 avgt NaN --- ByteBuffers.testDirectLoopGetLongRO 131072 avgt 10 830.093 ? 25.866 ns/op ByteBuffers.testDirectLoopGetLongRO:asm 131072 avgt NaN --- ByteBuffers.testDirectLoopGetLongSwap 131072 avgt 10 641.121 ? 1.537 ns/op ByteBuffers.testDirectLoopGetLongSwap:asm 131072 avgt NaN --- ByteBuffers.testDirectLoopGetLongSwapRO 131072 avgt 10 818.680 ? 6.343 ns/op ByteBuffers.testDirectLoopGetLongSwapRO:asm 131072 avgt NaN --- ByteBuffers.testDirectLoopPutLong 131072 avgt 10 1413.511 ? 25.368 ns/op ByteBuffers.testDirectLoopPutLong:asm 131072 avgt NaN --- ByteBuffers.testDirectLoopPutLongSwap 131072 avgt 10 1404.666 ? 15.598 ns/op ByteBuffers.testDirectLoopPutLongSwap:asm 131072 avgt NaN --- ------------ testDirectLoopGetLongSwap: 16 times: vpaddq (%rdx),%zmm4,%zmm4 and then reduce the vector down to scalar after the loop.
12-01-2024

Yes, that agrees with my perfasm logs too: in bad case, there is no vector instructions in sight. The loop is just unrolled, and then we do the "generic" reads and writes, 8 bytes at a time.
12-01-2024

Benchmark (size) Mode Cnt Score Error Units ByteBuffers.testDirectLoopGetLong 131072 avgt 10 846.519 ? 63.010 ns/op ByteBuffers.testDirectLoopGetLongRO 131072 avgt 10 825.922 ? 33.193 ns/op ByteBuffers.testDirectLoopGetLongSwap 131072 avgt 10 3508.293 ? 157.550 ns/op ByteBuffers.testDirectLoopGetLongSwapRO 131072 avgt 10 829.467 ? 20.075 ns/op ByteBuffers.testDirectLoopPutLong 131072 avgt 10 1426.645 ? 34.974 ns/op ByteBuffers.testDirectLoopPutLongSwap 131072 avgt 10 1407.044 ? 30.759 ns/op I reproduced it on my machine, with some AVX512 features. The difference is huge. Interesting... :) CONF=linux-x64 make images test TEST="micro:ByteBuffers.testDirect.*Long" MICRO="FORK=1;OPTIONS=-p size=131072 -prof perfasm" Assembly code used: --- testDirectLoopGetLong: 16 times: vmovdqu32 0x3c0(%r9),%zmm3 vmovdqu64 -0x669542(%rip),%zmm20 vpshufb %zmm20,%zmm3,%zmm20 vpaddq %zmm3,%zmm2,%zmm2 (load, shuffle, and accumulate into zmm2, after the loop the instruction is reduced to a single long with vextracti64x4 etc.) --- testDirectLoopGetLongRO: 16 times: vmovdqu32 0x3c0(%r9),%zmm3 vmovdqu64 -0x7019c2(%rip),%zmm20 vpshufb %zmm20,%zmm3,%zmm20 vpaddq %zmm3,%zmm2,%zmm2 (load, shuffle, and accumulate into zmm2, after the loop the instruction is reduced to a single long with vextracti64x4 etc.) --- testDirectLoopGetLongSwap: 16 times: 0x10(%rdx),%rax (only load one long at a time, reduce into rax) -> bad, this will be slow. --- testDirectLoopGetLongSwapRO: 16 times: vmovdqu32 0x3c0(%r9),%zmm3 vmovdqu64 -0x66dd42(%rip),%zmm20 vpshufb %zmm20,%zmm3,%zmm20 vmovdqu64 -0x66dd52(%rip),%zmm3 vpaddq %zmm3,%zmm2,%zmm2 (load, shuffle, and accumulate into zmm2, after the loop the instruction is reduced to a single long with vextracti64x4 etc.) --- testDirectLoopPutLong: 16 times: vmovdqu32 %zmm0,0x40(%r14) --- testDirectLoopPutLongSwap: 16 times: vmovdqu32 %zmm2,(%rdi)
12-01-2024

I don't know if it vectorized before, this was the first time I ran those benchmarks on those machine. I could bisect, probably, but not right now.
12-01-2024

I think still AVX2 only on that machine: % shipilev-jdk/build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+PrintFlagsFinal 2>&1 | grep UseAVX int UseAVX = 2 {ARCH product} {default}
12-01-2024

[~shade] thanks for the report. I'll have a look. Do you think this ever did work (i.e. did vectorize)? Does your machine support AVX512? What features does it have? I attached a Test1.java. Could you run it and tell me what logs your are getting? ./java -XX:CompileCommand=printcompilation,Test1::test* -XX:+TraceLoopOpts -XX:+TraceSuperWord -XX:+Verbose -XX:UseAVX=2 Test1.java I get this with AVX2: -------------------------- test1: After combine_packs packset Pack: 0 align: 0 784 LoadL === 787 7 785 [[ 783 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=729,367 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test1 @ bci:15 (line 20) align: 8 779 LoadL === 787 7 780 [[ 778 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=367 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test1 @ bci:15 (line 20) align: 16 729 LoadL === 787 7 730 [[ 728 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=367 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test1 @ bci:15 (line 20) align: 24 367 LoadL === 787 7 365 [[ 428 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test1 @ bci:15 (line 20) Pack: 1 align: 0 783 AddL === _ 786 784 [[ 778 ]] Type:long !orig=728,428,742 !jvms: Test::test1 @ bci:18 (line 20) align: 8 778 AddL === _ 783 779 [[ 728 ]] Type:long !orig=428,742 !jvms: Test::test1 @ bci:18 (line 20) align: 16 728 AddL === _ 778 729 [[ 428 ]] Type:long !orig=428,742 !jvms: Test::test1 @ bci:18 (line 20) align: 24 428 AddL === _ 728 367 [[ 786 628 544 ]] Type:long !orig=742 !jvms: Test::test1 @ bci:18 (line 20) Unprofitable 783 AddL === _ 786 784 [[ 778 ]] Type:long !orig=728,428,742 !jvms: Test::test1 @ bci:18 (line 20) Unprofitable 784 LoadL === 787 7 785 [[ 783 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=729,367 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test1 @ bci:15 (line 20) After filter_packs packset -------------------------- test2: After combine_packs packset Pack: 0 align: 0 783 LoadL === 785 7 784 [[ 774 782 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=724,368 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 8 778 LoadL === 785 7 779 [[ 772 777 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=368 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 16 724 LoadL === 785 7 725 [[ 719 723 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=368 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 24 368 LoadL === 785 7 366 [[ 389 479 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) Pack: 1 align: 0 782 ReverseBytesL === _ 783 [[ 774 ]] Type:long !orig=723,389 !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:12 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 8 777 ReverseBytesL === _ 778 [[ 772 ]] Type:long !orig=389 !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:12 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 16 723 ReverseBytesL === _ 724 [[ 719 ]] Type:long !orig=389 !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:12 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 24 389 ReverseBytesL === _ 368 [[ 479 ]] Type:long !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:12 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) Pack: 2 align: 0 774 CMoveL === _ 381 783 782 [[ 773 ]] Type:long !orig=719,479,[390] !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:15 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 8 772 CMoveL === _ 381 778 777 [[ 771 ]] Type:long !orig=479,[390] !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:15 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 16 719 CMoveL === _ 381 724 723 [[ 718 ]] Type:long !orig=479,[390] !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:15 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) align: 24 479 CMoveL === _ 381 368 389 [[ 425 ]] Type:long !orig=[390] !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:15 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) Pack: 3 align: 0 773 AddL === _ 786 774 [[ 771 ]] Type:long !orig=718,425,735 !jvms: Test::test2 @ bci:18 (line 28) align: 8 771 AddL === _ 773 772 [[ 718 ]] Type:long !orig=425,735 !jvms: Test::test2 @ bci:18 (line 28) align: 16 718 AddL === _ 771 719 [[ 425 ]] Type:long !orig=425,735 !jvms: Test::test2 @ bci:18 (line 28) align: 24 425 AddL === _ 718 479 [[ 786 619 533 ]] Type:long !orig=735 !jvms: Test::test2 @ bci:18 (line 28) Unimplemented 774 CMoveL === _ 381 783 782 [[ 773 ]] Type:long !orig=719,479,[390] !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:15 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) Unprofitable 773 AddL === _ 786 774 [[ 771 ]] Type:long !orig=718,425,735 !jvms: Test::test2 @ bci:18 (line 28) Unprofitable 782 ReverseBytesL === _ 783 [[ 774 ]] Type:long !orig=723,389 !jvms: jdk.internal.misc.Unsafe::convEndian @ bci:12 (line 3819) jdk.internal.misc.Unsafe::getLongUnaligned @ bci:8 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) Unprofitable 783 LoadL === 785 7 784 [[ 774 782 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe (does not depend only on test, unknown control) Type:long !orig=724,368 !jvms: jdk.internal.misc.Unsafe::getLongUnaligned @ bci:5 (line 3555) jdk.internal.misc.ScopedMemoryAccess::getLongUnalignedInternal @ bci:15 (line 2583) jdk.internal.misc.ScopedMemoryAccess::getLongUnaligned @ bci:6 (line 2571) java.nio.DirectByteBuffer::getLong @ bci:13 (line 838) java.nio.DirectByteBuffer::getLong @ bci:12 (line 855) Test::test2 @ bci:15 (line 28) After filter_packs packset ----------------------------------------- test3: After filter_packs packset Pack: 0 align: 0 767 StoreL === 774 778 773 501 [[ 766 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe Memory: @rawptr:BotPTR, idx=Raw; !orig=714,384,728 !jvms: jdk.internal.misc.Unsafe::putLongUnaligned @ bci:10 (line 3676) jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal @ bci:17 (line 2604) jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned @ bci:8 (line 2592) java.nio.DirectByteBuffer::putLong @ bci:18 (line 867) java.nio.DirectByteBuffer::putLong @ bci:13 (line 886) Test::test3 @ bci:15 (line 38) align: 8 766 StoreL === 774 767 770 501 [[ 714 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe Memory: @rawptr:BotPTR, idx=Raw; !orig=384,728 !jvms: jdk.internal.misc.Unsafe::putLongUnaligned @ bci:10 (line 3676) jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal @ bci:17 (line 2604) jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned @ bci:8 (line 2592) java.nio.DirectByteBuffer::putLong @ bci:18 (line 867) java.nio.DirectByteBuffer::putLong @ bci:13 (line 886) Test::test3 @ bci:15 (line 38) align: 16 714 StoreL === 774 766 718 501 [[ 618 384 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe Memory: @rawptr:BotPTR, idx=Raw; !orig=384,728 !jvms: jdk.internal.misc.Unsafe::putLongUnaligned @ bci:10 (line 3676) jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal @ bci:17 (line 2604) jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned @ bci:8 (line 2592) java.nio.DirectByteBuffer::putLong @ bci:18 (line 867) java.nio.DirectByteBuffer::putLong @ bci:13 (line 886) Test::test3 @ bci:15 (line 38) align: 24 384 StoreL === 774 714 381 501 [[ 614 778 532 420 ]] @rawptr:BotPTR, idx=Raw; unaligned unsafe Memory: @rawptr:BotPTR, idx=Raw; !orig=728 !jvms: jdk.internal.misc.Unsafe::putLongUnaligned @ bci:10 (line 3676) jdk.internal.misc.ScopedMemoryAccess::putLongUnalignedInternal @ bci:17 (line 2604) jdk.internal.misc.ScopedMemoryAccess::putLongUnaligned @ bci:8 (line 2592) java.nio.DirectByteBuffer::putLong @ bci:18 (line 867) java.nio.DirectByteBuffer::putLong @ bci:13 (line 886) Test::test3 @ bci:15 (line 38) ... and then it goes on to super-unroll. -------------------------------- Sadly, the logs are a bit cryptic, and I hope to improve that in the future. But I think one issue is that these cases are reductions (test1 and test2), and hence we consider them unprofitable (a bad decision that I hope to address in the future). But since you see a difference in performance with the RO case, there must be more to it than that. I see that the generated code for RO also includes CMove nodes, which is curious.
12-01-2024

ILW = Loop not vectorized as expected, single microbenchmark with C2, no known workaround = MLH = P4
12-01-2024

FYI, [~epeter]
12-01-2024