Bug ID: JDK-8368245 Copying of native memory segments is slower for certain sizes after JDK-8324751

JDK 27
27Unresolved

Deferring to JDK 27 for now, as this minor regression is an edge case and fixing it would require significant engineering work. Please re-target to JDK 26 if a fix becomes ready in time.
07-11-2025
Looking at preliminary results from JDK-8367158 / https://github.com/openjdk/jdk/pull/27315, I think that we have a more fundamental issue with SuperWord, and small iteration count: Generally, for small iterations our current SuperWord architecture leads to regressions (20-50%). Generally, for medium-large iterations, we get speedups (2x-20x). Consequence: whenever we cover a new case with SuperWord, we get nice speedups for medium-large iteration count. But also a regression for small iteration count. That is not great. In a sense, this is a duplicate report of JDK-8343773, which we closed as a duplicate of JDK-8344085. But of course this here is also a regression from JDK-8324751, since that one now allows vectorization of copy loops (regression for small iteration count, but speedup for medium-large iteration count).
14-10-2025
I ran the benchmark with these two configurations: make test TEST="micro:SegmentBulkCopy.heapSegmentCopyJava" CONF=linux-x64 TEST_VM_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=1" MICRO="OPTIONS=-prof perfasm -p ELEM_SIZE=64" Benchmark (ELEM_SIZE) Mode Cnt Score Error Units SegmentBulkCopy.heapSegmentCopyJava 64 avgt 30 20.907 ± 2.132 ns/op make test TEST="micro:SegmentBulkCopy.heapSegmentCopyJava" CONF=linux-x64 TEST_VM_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0" MICRO="OPTIONS=-prof perfasm -p ELEM_SIZE=64" Benchmark (ELEM_SIZE) Mode Cnt Score Error Units SegmentBulkCopy.heapSegmentCopyJava 64 avgt 30 14.034 ± 2.644 ns/op See attached log files: benchmark_SegmentBulkCopy_heapSegmentCopyJava_64_default.log benchmark_SegmentBulkCopy_heapSegmentCopyJava_64_novectorization.log Summarizing the observations: - novectorization (like before JDK-8324751): uses pre-loop, unrolled-main-loop, post-loop. Especially the unrolled (non-vectorized) main-loop gets the bulk of the work done here. - default: vectorizes. But the main-loop requires a lot of iterations to enter. We have too few, and so we only spend time in the non-vectoirzed and non-unrolled pre/post loops. - default: spends a few percent on the aliasing runtime check. Maybe we could optimize that away, given that we have a dominating "src.overlaps(dst)" check. It could be that JDK-8307084 helps a bit, by allowing us to enter the vectorized drain-loop more often. But there will still be a barrier of entry to that one. Let's think about it: - The benchmark has 64 bytes. The current MemorySegment impl processes it with 8-byte longs, so now we only have 8 iterations. - In theory, we could of course handle the copy in a single 64 byte load/store, but that's not realistic given we will do some work in the pre-loop, at least 1 iteration, that is just a limitation of our current pre/main/post architecture. We could still use 32 or at least 16 byte accesses though, or at least some unrolled segment of 8 byte accesses. I'll have to do some more benchmarking on small sizes, to see where else vectorization is currently slower than scalar. I had done some vectorization benchmarking with JDK-8344118, but never really compared the performance to the scalar alternative. We need to fix that with JDK-8367158. One more thought: Maybe it would be ideal to not just have a super-unrolled main loop and a full-vector-size drain-loop. Because that means we have: pre: stride 1 main: stride 64 * super_unroll drain: stride 64 post: stride 1 It could be good to have something between the drain and post loops, to handle iteration counts between 2..64 better. In a sense that is what the SegmentBulkOperations.copy and also GraalVM auto-vec does: they have multiple loops with different vector-lengths. Something like: pre: stride 1 main: stride 64 * super_unroll drain64: stride 64 drain32: stride 32 drain16: stride 16 drain8: stride 8 drain4: stride 4 drain2: stride 2 post: stride 1 This would be taking the work of JDK-8307084 a step further. Ok, maybe that's too many loops. Maybe we can do every second one or something like that. But this approach is a bit difficult to do with a SuperWord algorithm: because here we have to first unroll (maybe 64x times), and then vectorize. But that means we cannot emit smaller vectors. To get all the different vectorization sizes, a regular "widening" vectorizer would be more adequate. An alternative: if the iteration count is too low, just don't vectorize at all. But that leads to a different tradeoff based on profiling: as long as the iteration count is always low, that works fine. But if we occasionally have larger iteration counts, then we would definatively profit from vectorization with super-unrolling. I do think that addressing low-iteration count cases is super important. I heard it from many sources, internal and external, that the average array size is not very large, maybe in the 10-100 order of magnitute, but not really the 1000+. I don't have data for that now, but I still think it is worth to invest more in the 10-100 range. ------------------------------ Copying from a slack conversation: Approach 1 I suppose we could profile the average iteration count -> if too low don't vectorize but instead scalar loop unrolling. Downside: what if suddenly we find larger arrays? Or we have a very mixed use of small and large. Then we don't get good performance on large iteration count any more. Approach 2 Like Approach 2, but recompile with vectorization if we ever find a large loop. But what if large arrays occur, but only rarely? Then we are back to bad performance on small arrays. Approach 3 Multiversion: send small iteration count to unvectorized branch, larger to vectorized. Maybe, but that means we now need to multiverion basically everywhere. Not great for code size, and has also a minor perf impact (pressure on code cache, and worse code locality). Approach 4: BulkOp / Graal approach: pre: stride 1 -> until aligned. main: stride 64 * super_unroll drain64: stride 64 drain32: stride 32 drain16: stride 16 drain8: stride 8 drain4: stride 4 drain2: stride 2 post: stride 1 -> read "drain1" If we have a small iteration count, we just end up using one of the later drain loops. E.g. size=14 -> 1 iter in pre, 1 iter in drain8, 1 iter in drain4, 1 iter in drain1=post . This would probably give best performance over all iteration counts. Maybe there could be some issues with branch prediction, since we now have more branches, not sure. But it would require us to generate vector loops for different vector sizes, which is not something we can easily do with the current architecture.
14-10-2025
I looked at the benchmark that [~pminborg] ran. Specifically, this one is slower: openjdk.bench.java.lang.foreign.SegmentBulkCopy.heapSegmentCopyJava-ELEM_SIZE:64 -24.97% Linux aarch64 -33.33% Linux x64 -24.99% MacOSX aarch64 -42.69% Windows x64 For larger sizes, we see an opposite trend: openjdk.bench.java.lang.foreign.SegmentBulkCopy.heapSegmentCopyJava-ELEM_SIZE:512 18.80% Linux aarch64 21.70% Linux x64 22.14% MacOSX aarch64 21.16% Windows x64 openjdk.bench.java.lang.foreign.SegmentBulkCopy.heapSegmentCopyJava-ELEM_SIZE:4096 43.63% Linux aarch64 46.13% Linux x64 38.44% MacOSX aarch64 42.21% Windows x64 But there are also lines where we have speedups and slowdowns mixed, depending on the platform. The results look a bit random, I'm wondering if they are super stable. Maybe we have some variance/noise due to random alignment. I'll investigate. This is the test in question: 87 @Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.copy=31"}) 88 @Benchmark 89 public void heapSegmentCopyJava() { 90 MemorySegment.copy(heapSrcSegment, 0, heapDstSegment, 0, ELEM_SIZE); 91 } This eventually delegates to SegmentBulkOperations.copy 104 @ForceInline 105 public static void copy(AbstractMemorySegmentImpl src, long srcOffset, 106 AbstractMemorySegmentImpl dst, long dstOffset, 107 long size) { 108 109 Utils.checkNonNegativeIndex(size, "size"); 110 // Implicit null check for src and dst 111 src.checkAccess(srcOffset, size, true); 112 dst.checkAccess(dstOffset, size, false); 113 114 if (size <= 0) { 115 // Do nothing 116 } else if (size < NATIVE_THRESHOLD_COPY && !src.overlaps(dst)) { 117 // 0 < size < FILL_NATIVE_LIMIT : 0...0X...XXXX 118 // 119 // Strictly, we could check for !src.asSlice(srcOffset, size).overlaps(dst.asSlice(dstOffset, size) but 120 // this is a bit slower and it likely very unusual there is any difference in the outcome. Also, if there 121 // is an overlap, we could tolerate one particular direction of overlap (but not the other). 122 123 // 0...0X...X000 124 final int limit = (int) (size & (NATIVE_THRESHOLD_COPY - Long.BYTES)); 125 int offset = 0; 126 for (; offset < limit; offset += Long.BYTES) { 127 final long v = SCOPED_MEMORY_ACCESS.getLongUnaligned(src.sessionImpl(), src.unsafeGetBase(), src.unsafeGetOffset() + srcOffset + offset, !Architecture.isLittleEndian()); 128 SCOPED_MEMORY_ACCESS.putLongUnaligned(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + dstOffset + offset, v, !Architecture.isLittleEndian()); 129 } 130 int remaining = (int) size - offset; 131 // 0...0X00 132 if (remaining >= Integer.BYTES) { 133 final int v = SCOPED_MEMORY_ACCESS.getIntUnaligned(src.sessionImpl(), src.unsafeGetBase(),src.unsafeGetOffset() + srcOffset + offset, !Architecture.isLittleEndian()); 134 SCOPED_MEMORY_ACCESS.putIntUnaligned(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + dstOffset + offset, v, !Architecture.isLittleEndian()); 135 offset += Integer.BYTES; 136 remaining -= Integer.BYTES; 137 } 138 // 0...00X0 139 if (remaining >= Short.BYTES) { 140 final short v = SCOPED_MEMORY_ACCESS.getShortUnaligned(src.sessionImpl(), src.unsafeGetBase(), src.unsafeGetOffset() + srcOffset + offset, !Architecture.isLittleEndian()); 141 SCOPED_MEMORY_ACCESS.putShortUnaligned(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + dstOffset + offset, v, !Architecture.isLittleEndian()); 142 offset += Short.BYTES; 143 remaining -= Short.BYTES; 144 } 145 // 0...000X 146 if (remaining == 1) { 147 final byte v = SCOPED_MEMORY_ACCESS.getByte(src.sessionImpl(), src.unsafeGetBase(), src.unsafeGetOffset() + srcOffset + offset); 148 SCOPED_MEMORY_ACCESS.putByte(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + dstOffset + offset, v); 149 } 150 // We have now fully handled 0...0X...XXXX 151 } else { 152 // For larger sizes, the transition to native code pays off 153 SCOPED_MEMORY_ACCESS.copyMemory(src.sessionImpl(), dst.sessionImpl(), 154 src.unsafeGetBase(), src.unsafeGetOffset() + srcOffset, 155 dst.unsafeGetBase(), dst.unsafeGetOffset() + dstOffset, size); 156 } 157 } And I think we should end up in the multiple Java loops with the L/I/S/B variants.
14-10-2025
[~pminborg] Thanks for filing this! I will investigate when I get back from vacation. We may also need to benchmark for more values, i.e. for all iteration counts from 0..1024 for example. I'm also working on a better benchmark for these cases here: JDK-8367158
29-09-2025
Thanks for the report [~pminborg]! Emanuel is currently on vacation but, assuming this is not urgent/blocking, will have a look once he's back.
29-09-2025
ILW = Slower copying of memory segments, only certain sizes, no workaround? = MLH = P4
22-09-2025

Causes :	JDK-8324751 - C2 SuperWord: Aliasing Analysis runtime check
Relates :	JDK-8343773 - Superword/auto vectorization of fill pattern is slow on Aarch64 (for small iteration counts)
Relates :	JDK-8344085 - C2 SuperWord: improve vectorization for small loop iteration count