JDK-8343773 : Superword/auto vectorization of fill pattern is slow on Aarch64 (for small iteration counts)
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 24
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • Submitted: 2024-11-07
  • Updated: 2025-05-28
  • Resolved: 2025-05-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdResolved
Related Reports
Duplicate :  
Relates :  
Description
In the class `jdk.internal.foreign.SegmentBulkOperations` there is a method `fill()`. Said method is manually written to use long -> int -> short -> byte operations to maximize unit size during segment traversal. 

It would be tempting to replace that method with something like this:

            final int end = (int) dst.length;
            // Rely on aligned auto vectorization
            for (int i = 0 ; i < end; i++) {
                SCOPED_MEMORY_ACCESS.putByte(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + i, value);
            }

The C2/Grall would then be able to generate even more optimized constructs such as using super-words and unrolling. However, at least on macOS M1, the C2 is slower or equal:


@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class SegmentBulk2Fill {

    @Param({"8", "32", "512", "2048", "32768"})
    public int ELEM_SIZE;

    byte[] array;
    MemorySegment heapSegment;
    MemorySegment nativeSegment;
    ByteBuffer buffer;

    @Setup
    public void setup() {
        array = new byte[ELEM_SIZE];
        heapSegment = MemorySegment.ofArray(array);
        nativeSegment = Arena.ofAuto().allocate(ELEM_SIZE, 8);
        buffer = ByteBuffer.wrap(array);
    }


    @Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
    @Benchmark
    public void heapSegmentFillJava() {
        heapSegment.fill((byte) 0);
    }

    @Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
    @Benchmark
    public void nativeSegmentFillJava() {
        nativeSegment.fill((byte) 0);
    }

}

$ make test TEST="micro:java.lang.foreign.SegmentBulk2Fill"  MICRO="OPTIONS=-p ELEM_SIZE=65536"

Base
Benchmark                               (ELEM_SIZE)  Mode  Cnt    Score    Error  Units
SegmentBulk2Fill.heapSegmentFillJava          65536  avgt   30  662.179 ? 23.403  ns/op
SegmentBulk2Fill.nativeSegmentFillJava        65536  avgt   30  650.022 ? 11.491  ns/op

Loop

Benchmark                               (ELEM_SIZE)  Mode  Cnt     Score       Error  Units
SegmentBulk2Fill.heapSegmentFillJava          65536  avgt   30  7314.986 ? 11931.163  ns/op
SegmentBulk2Fill.nativeSegmentFillJava        65536  avgt   30   658.273 ?    16.371  ns/op

The C2 compiler is using superword (16 bytes) and unroll 16:

$ sudo make test TEST="micro:java.lang.foreign.SegmentBulk2Fill"  MICRO="VM_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:CompileCommand=TraceAutoVectorization,*Bulk2Fill.heapSegmentFillJava,ALL;OPTIONS=-prof dtraceasm -p ELEM_SIZE=65536" CONF=macosx-aarch64-debug

....[Hottest Region 1]..............................................................................
c2, level 4, org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub, version 5, compile id 717 

             0x00000001162887ec:   mov w8, #0xe800                // #59392
             0x00000001162887f0:   movk w8, #0x3, lsl #16
             0x00000001162887f4:   cmp w13, w8
             0x00000001162887f8:   csel w11, w12, w13, hi  // hi = pmore
             0x00000001162887fc:   add w13, w11, w4                ;*getstatic SCOPED_MEMORY_ACCESS {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - jdk.internal.foreign.SegmentBulkOperations::fill@46 (line 75)
                                                                       ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                       ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                                                                       ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
            ;; B42: # out( B42 B43 ) &lt;- in( B41 B42 ) Loop( B42-B42 inner main of N374 strip mined) Freq: 6.42119e+14
   0.04%  ?  0x0000000116288800:   add x11, x21, w4, sxtw
          ?  0x0000000116288804:   str q16, [x11]
          ?  0x0000000116288808:   str q16, [x11, #16]
          ?  0x000000011628880c:   str q16, [x11, #32]
          ?  0x0000000116288810:   str q16, [x11, #48]
          ?  0x0000000116288814:   str q16, [x11, #64]
   0.08%  ?  0x0000000116288818:   str q16, [x11, #80]
  15.41%  ?  0x000000011628881c:   str q16, [x11, #96]
   7.20%  ?  0x0000000116288820:   str q16, [x11, #112]
   0.02%  ?  0x0000000116288824:   str q16, [x11, #128]
          ?  0x0000000116288828:   str q16, [x11, #144]
  10.72%  ?  0x000000011628882c:   str q16, [x11, #160]
   0.96%  ?  0x0000000116288830:   str q16, [x11, #176]
          ?  0x0000000116288834:   str q16, [x11, #192]
          ?  0x0000000116288838:   str q16, [x11, #208]
  24.22%  ?  0x000000011628883c:   str q16, [x11, #224]
  34.94%  ?  0x0000000116288840:   str q16, [x11, #240]            ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
          ?                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
          ?                                                            ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
          ?                                                            ; - jdk.internal.foreign.SegmentBulkOperations::fill@65 (line 75)
          ?                                                            ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
          ?                                                            ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
          ?  0x0000000116288844:   add w4, w4, #0x100              ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ?                                                            ; - jdk.internal.foreign.SegmentBulkOperations::fill@68 (line 74)
          ?                                                            ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
          ?                                                            ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
          ?                                                            ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
          ?  0x0000000116288848:   cmp w4, w13
          ?  0x000000011628884c:   b.lt 0x0000000116288800  // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - jdk.internal.foreign.SegmentBulkOperations::fill@43 (line 74)
                                                                       ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                       ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                                                                       ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
            ;; B43: # out( B41 B44 ) &lt;- in( B42 )  Freq: 6.81265e+09
   0.55%     0x0000000116288850:   ldr x6, [x28, #48]              ; ImmutableOopMap {r14=Oop r16=Oop c_rarg2=Oop c_rarg5=Derived_oop_c_rarg2 r19=Oop }
                                                                       ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                                       ; - (reexecute) jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
                                                                       ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                       ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                                                                       ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
             0x0000000116288854:   ldr wzr, [x6]                   ;   {poll}
   0.08%     0x0000000116288858:   ldrb w8, [x28, #1184]
             0x000000011628885c:   cbz x8, 0x0000000116288874
            ;; 0x104DAB6FC
             0x0000000116288860:   mov x8, #0xb6fc                // #46844
                                                                       ;   {runtime_call JavaThread::verify_cross_modify_fence_failure(JavaThread*)}
             0x0000000116288864:   movk x8, #0x4da, lsl #16
             0x0000000116288868:   movk x8, #0x1, lsl #32
             0x000000011628886c:   mov x0, x28
             0x0000000116288870:   blr x8                          ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
                                                                       ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
....................................................................................................


Comments
With [~pminborg]'s permission, I'll close this as a duplicate of JDK-8344085.
28-05-2025

[~pminborg] sent me yet another series of benchmarks: Benchmark (ELEM_SIZE) Mode Cnt Score Error Units SegmentBulkFill.nativeSegmentFillJava 2 avgt 30 1.609 ± 0.054 ns/op SegmentBulkFill.nativeSegmentFillJava 3 avgt 30 1.586 ± 0.052 ns/op SegmentBulkFill.nativeSegmentFillJava 4 avgt 30 1.728 ± 0.042 ns/op SegmentBulkFill.nativeSegmentFillJava 5 avgt 30 1.752 ± 0.060 ns/op SegmentBulkFill.nativeSegmentFillJava 6 avgt 30 1.758 ± 0.057 ns/op SegmentBulkFill.nativeSegmentFillJava 7 avgt 30 1.762 ± 0.065 ns/op SegmentBulkFill.nativeSegmentFillJava 8 avgt 30 2.420 ± 0.062 ns/op SegmentBulkFill.nativeSegmentFillJava 64 avgt 30 3.915 ± 0.232 ns/op SegmentBulkFill.nativeSegmentFillJava 512 avgt 30 6.554 ± 0.174 ns/op SegmentBulkFill.nativeSegmentFillJava 4096 avgt 30 45.112 ± 0.768 ns/op SegmentBulkFill.nativeSegmentFillJava 32768 avgt 30 335.697 ± 6.970 ns/op SegmentBulkFill.nativeSegmentFillJava 262144 avgt 30 4188.986 ± 19.753 ns/op SegmentBulkFill.nativeSegmentFillJava 2097152 avgt 30 33083.550 ± 190.234 ns/op SegmentBulkFill.nativeSegmentFillJava 16777216 avgt 30 295656.160 ± 4475.285 ns/op SegmentBulkFill.nativeSegmentFillJava 134217728 avgt 30 2841740.323 ± 84982.192 ns/op SegmentBulkFill.nativeSegmentFillLoop 2 avgt 30 1.731 ± 0.022 ns/op SegmentBulkFill.nativeSegmentFillLoop 3 avgt 30 3.317 ± 0.105 ns/op SegmentBulkFill.nativeSegmentFillLoop 4 avgt 30 4.070 ± 0.072 ns/op SegmentBulkFill.nativeSegmentFillLoop 5 avgt 30 4.090 ± 0.099 ns/op SegmentBulkFill.nativeSegmentFillLoop 6 avgt 30 5.071 ± 0.184 ns/op SegmentBulkFill.nativeSegmentFillLoop 7 avgt 30 5.309 ± 0.016 ns/op SegmentBulkFill.nativeSegmentFillLoop 8 avgt 30 5.694 ± 0.166 ns/op SegmentBulkFill.nativeSegmentFillLoop 64 avgt 30 8.010 ± 0.348 ns/op SegmentBulkFill.nativeSegmentFillLoop 512 avgt 30 16.987 ± 0.228 ns/op SegmentBulkFill.nativeSegmentFillLoop 4096 avgt 30 48.399 ± 0.105 ns/op SegmentBulkFill.nativeSegmentFillLoop 32768 avgt 30 338.067 ± 10.981 ns/op SegmentBulkFill.nativeSegmentFillLoop 262144 avgt 30 4114.891 ± 28.666 ns/op SegmentBulkFill.nativeSegmentFillLoop 2097152 avgt 30 32785.146 ± 112.855 ns/op SegmentBulkFill.nativeSegmentFillLoop 16777216 avgt 30 286126.428 ± 2434.176 ns/op SegmentBulkFill.nativeSegmentFillLoop 134217728 avgt 30 2790475.242 ± 45321.993 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 2 avgt 30 3.155 ± 0.105 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 3 avgt 30 2.801 ± 0.008 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 4 avgt 30 3.184 ± 0.066 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 5 avgt 30 2.952 ± 0.085 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 6 avgt 30 2.835 ± 0.071 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 7 avgt 30 2.831 ± 0.064 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 8 avgt 30 2.494 ± 0.010 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 64 avgt 30 2.558 ± 0.085 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 512 avgt 30 6.510 ± 0.058 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 4096 avgt 30 43.075 ± 1.675 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 32768 avgt 30 358.993 ± 18.847 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 262144 avgt 30 3648.437 ± 243.745 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 2097152 avgt 30 57285.130 ± 4353.914 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 16777216 avgt 30 508089.126 ± 31859.609 ns/op SegmentBulkFill.nativeSegmentFillUnsafe 134217728 avgt 30 3867608.812 ± 339849.011 ns/op The results are really quite clear: Loop / auto-vectorizer struggle with small sizes, but starting with about 4k auto-vectorization is consistently fastest. This is yet another reason to work more on improving small iteration count loops: JDK-8344085
28-05-2025

Fair comment [~chagedorn]. I've updated the issue to be an Enhancement and not a Bug.
13-11-2024

Sounds good, thanks [~pminborg]!
08-11-2024

Is this a regression or just an optimization opportunity? In the latter case, we could also treat it as an RFE.
08-11-2024