Bug ID: JDK-8371551 memory segment bulk copy operation performance when using plain loops

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 26

Priority: P4
Status: New
Resolution: Unresolved

Submitted: 2025-11-10
Updated: 2025-11-10

I did some experiments comparing performance of memory segment bulk operation against plain Java loops. Here are some results (unscientific benchmark attached):

FILL

Benchmark                       Mode  Cnt        Score        Error  Units
BulkOps.segment_fill            avgt   10   119323.358 ±   3484.991  ns/op
BulkOps.segment_fill_int_loop   avgt   10  2055700.828 ± 101325.298  ns/op
BulkOps.segment_fill_long_loop  avgt   10    47875.953 ±   1727.711  ns/op

COPY

Benchmark                                     Mode  Cnt      Score      Error  Units
BulkOps.segment_copy_static                   avgt   10  86283.631 ± 4562.169  ns/op
BulkOps.segment_copy_static_int_loop          avgt   10  82480.038 ± 3476.123  ns/op
BulkOps.segment_copy_static_long_loop         avgt   10  78929.262 ± 2100.533  ns/op
BulkOps.segment_copy_static_small             avgt   10      4.346 ±    0.037  ns/op
BulkOps.segment_copy_static_small_int_loop    avgt   10      5.110 ±    0.055  ns/op
BulkOps.segment_copy_static_small_long_loop   avgt   10      4.208 ±    0.026  ns/op

MISMATCH

Benchmark                                 Mode  Cnt       Score       Error  Units
BulkOps.mismatch_large_segment            avgt   10   38011.887 ±  2219.403  ns/op
BulkOps.mismatch_large_segment_int_loop   avgt   10  778412.959 ± 11380.481  ns/op
BulkOps.mismatch_large_segment_long_loop  avgt   10  283515.423 ±  7737.791  ns/op
BulkOps.mismatch_small_segment            avgt   10       2.719 ±     0.097  ns/op
BulkOps.mismatch_small_segment_int_loop   avgt   10       2.963 ±     0.030  ns/op
BulkOps.mismatch_small_segment_long_loop  avgt   10       2.892 ±     0.011  ns/op

Overall, really great progress. I think we're really close to being able to just use plain loops for these routines in the memory segment implementation (and maybe even ByteBuffer) classes.

One notable hiccup is that loops using int induction variables are still significantly slower than those using long variables. 

Another issue (but this is known) is that the intrinsics for mismatch is still faster than a loop -- this is due to limitations with autovectorization and control flow (as mismatch needs to branch out of the loop if a mismatch is detected).

Linking with JDK-8331659 The BulkOps.segment_fill_int_loop case with int-iv but long-limit can be found in TestMemorySegment.java, test testMemorySegmentBadExitCheck

10-11-2025

I quickly looked at the code shape of BulkOps.segment_fill_int_loop for (int i = 0 ; i < segment.byteSize() ; i++) { segment.set(ValueLayout.JAVA_BYTE, i, (byte)42); } It has a int iv but a long limit. That's the same as this example 518 @Test 519 @IR(counts = {IRNode.LOAD_VECTOR_B, "= 0", 520 IRNode.ADD_VB, "= 0", 521 IRNode.STORE_VECTOR, "= 0"}, 522 applyIfPlatform = {"64-bit", "true"}, 523 applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true", "rvv", "true"}) 524 // FAILS 525 // Exit check: iv < long_limit -> (long)iv < long_limit 526 // Thus, we have an int-iv, but a long-exit-check. 527 // Is not properly recognized by either CountedLoop or LongCountedLoop 528 static Object[] testMemorySegmentBadExitCheck(MemorySegment a) { 529 for (int i = 0; i < a.byteSize(); i++) { 530 long adr = i; 531 byte v = a.get(ValueLayout.JAVA_BYTE, adr); 532 a.set(ValueLayout.JAVA_BYTE, adr, (byte)(v + 1)); 533 } 534 return new Object[]{ a }; 535 } You can find it in compiler/loopopts/superword/TestMemorySegment.java I'm tracking that with JDK-8331659

10-11-2025

Thanks [~mcimadamore] for the report! I'll look at the results and the missing vectorization a bit later. I hope we can fill the gaps in the coming JDK versions :)

10-11-2025