Bug ID: JDK-8299808 C2 SuperWord: investigate performance difference to ArrayFill

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

We should try to get SuperWord performance as close as possible to array-fill intrinsics.

Check for existing benchmarks, also related benchmarks for MemorySegment with native memory.

Related issues:
JDK-8344085,  JDK-8307084, JDK-8342692

-------------------- Original Description ---------------

During JDK-8299179, I found that ArrayFill is only done if somehow unrolling is disabled or otherwise circumvented. But it would be beneficial to prefer ArrayFill over unrolling.

For example in this case we unroll instead of ArrayFill, which is surprizing to me, and should be fixed:

    static void test() {
        // Note: currently unrolled, not intrinsified (unless -XX:LoopUnrollLimit=1)
        int arr[] = new int[22];
        for (int i = 6; i < 20; i++) {
            arr[i] = 1;
        }
        intA = arr;
    }

Suggestion: fix the bug, and add an IR verification that we indeed get a node like this:
260  CallLeafNoFP  === 120 1 59 8 9 (258 40 270 1 ) [[ 262 263 ]] # jint_fill void ( NotNull *+bot, int, long, half )

Do this for byte, short, int (long not yet implemented).

Further, there is some commented out code in PhaseIdealLoop::intrinsify_fill, which was supposed to detect if filling overwrites the whole array, and then remove the zeroing. It is not clear if this optimization is now done elsewhere, or was just forgotten. It could also be that since we currently mostly unroll, the unrolling then detects that we overwrite the initialization, and drops the zeroing. Hence, we may have a slowdown without removing initialization when we fill the whole array, instead of unrolling the store.

[~epeter] Just FYI. We always prefer using ArrayFill on AArch64 because its performance is much better when the value to be filled is zero (the use cases in default initialization). AArch64 can use cache maintenance instruction (DC ZVA) to do large block zeroing. JMH numbers about this can be found at TestArrayFill.zero**Array lines of https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2020-June/038605.html
06-03-2023
Hmm, currently OptimizeFill is guarded by avx512vlbw support: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp#L1681 Is your computer does not support it? Or may be an other reason?
10-01-2023
Yes, those. But, as Jatin in PR #5967 comment said, fill code was not optimized for long/double values: https://github.com/openjdk/jdk/pull/5967#issuecomment-948697276
10-01-2023
I was referring to JDK-8275047 and JDK-8247307. There is some evaluation in https://github.com/openjdk/jdk/pull/5967, [~jbhateja] might know more.
10-01-2023
[~thartmann] also told me that, but then said that the hand-written code for ArrayFill was improved and now could maybe be better. [~kvn] Do you know which RFE did this work to study unrolling+vectorization?
10-01-2023
There could be some cases when ArrayFill is faster but it is not always.
09-01-2023
I think there was RFE and study that unrolling+vectorization is better then hand-written code for ArrayFill.
09-01-2023

Relates :	JDK-8299179 - ArrayFill with store on backedge needs to reduce length by 1
Relates :	JDK-8344085 - C2 SuperWord: improve vectorization for small loop iteration count