Bug ID: JDK-8349452 Performance regression for Arrays.fill() with AVX512

JDK-8349452 : Performance regression for Arrays.fill() with AVX512

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 18,21,25,26

Priority: P3
Status: Open
Resolution: Unresolved
OS: generic
CPU: generic

Submitted: 2025-02-03
Updated: 2025-05-06

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 26
26Unresolved

Related Reports

Causes :

JDK-8275047 - Optimize existing fill stubs for AVX-512 target

Description

ADDITIONAL SYSTEM INFORMATION :
# Java version
java 23.0.2 2025-01-21
java 21.0.6 2025-01-21 LTS

# Operating system details
$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

A DESCRIPTION OF THE PROBLEM :
Performance regression for Arrays.fill() in 23.0.2 and 21.0.6 compared to 17.0.12.

REGRESSION : Last worked in version 17.0.14

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
The following steps shows how to reproduce the issue on a Ubuntu Linux
environment and the corresponding results.

```
# set corresponding JAVA_HOME before running the commands

17.0.12-oracle:
java -XX:TieredStopAtLevel=1 ByteMatrix  0.81s user 0.01s system 100% cpu 0.822 total
java -XX:TieredStopAtLevel=4 ByteMatrix  1.47s user 0.00s system 100% cpu 1.461 total

21.0.6-oracle:
java -XX:TieredStopAtLevel=1 ByteMatrix  0.82s user 0.01s system 100% cpu 0.836 total
java -XX:TieredStopAtLevel=4 ByteMatrix  4.22s user 0.01s system 100% cpu 4.214 total

23.0.2-oracle:
java -XX:TieredStopAtLevel=1 ByteMatrix  0.84s user 0.01s system 100% cpu 0.844 total
java -XX:TieredStopAtLevel=4 ByteMatrix  4.17s user 0.01s system 100% cpu 4.167 total
```

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The performance of Arrays.fill() should be similar in newer versions.
ACTUAL -
As shown above. There is potential performance issue in C2 JIT compiler. 
Besides, this issue only happens when `height` is small. If we change 
`ByteMatrix bM = new ByteMatrix(90, 1);` to `ByteMatrix bM = new ByteMatrix(90,20);`, 
results look like this:
```
17.0.12-oracle:
java -XX:TieredStopAtLevel=1 ByteMatrix  5.97s user 0.01s system 100% cpu 5.976 total
java -XX:TieredStopAtLevel=4 ByteMatrix  2.56s user 0.01s system 100% cpu 2.563 total

21.0.6-oracle:
java -XX:TieredStopAtLevel=1 ByteMatrix  6.11s user 0.02s system 100% cpu 6.126 total
java -XX:TieredStopAtLevel=4 ByteMatrix  2.22s user 0.02s system 83% cpu 2.677 total

23.0.2-oracle:
java -XX:TieredStopAtLevel=1 ByteMatrix  9.54s user 0.01s system 87% cpu 10.913 total
java -XX:TieredStopAtLevel=4 ByteMatrix  2.26s user 0.02s system 99% cpu 2.290 total
```
It is interesting the program runs faster with larger values.

---------- BEGIN SOURCE ----------
# ByteMatrix.java

```java
import java.util.Arrays;

public final class ByteMatrix {
    private final byte[][] bytes;
    public ByteMatrix(int width, int height) {
        bytes = new byte[(int) width][(int) height];
    }
    public void clear(byte value) {
        for (byte[] aByte : bytes) {
            Arrays.fill(aByte, value);
        }
    }
    public static void main(String[] args) {
        ByteMatrix bM = new ByteMatrix(90, 1);
        int N = 10000000;
        for (int i = 0; i < N; ++i) {
            bM.clear((byte) (i % 256));
        }
    }
}
```
---------- END SOURCE ----------

FREQUENCY : always

Comments

Tentatively deferring this to JDK 26 because it's an old issue. Feel free to still fix this in JDK 25 if there's time left.
06-05-2025
Workaround is to use -XX:-OptimizeFill runtime flag.
10-02-2025
JDK-8275047 optimized the fill stub for AVX512 targets, showing 2-5x gains on fill sizes above 64 bytes. For small fill sizes call overhead seems to dominate the performance gains. This is a typical use case for partial inlining, where fill sizes below 64 bytes (vector size) should be inlined while bigger sizes should call optimized stub. We already have partial inlining in place for Arraycopy and mismatch operation.
10-02-2025
Peformance of optimized fill for different sizes GNR>for i in 1 4 8 16 32 64 96 128 256; do echo "SIZE = $i"; java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -cp . ByteMatrix $i; java -Xbatch -XX:-TieredCompilation -XX:UseAVX=3 -cp . ByteMatrix $i; done SIZE = 1 Run. elapsed: 1349113781 Run. elapsed: 6520207560 SIZE = 4 Run. elapsed: 1772088687 Run. elapsed: 6506770516 SIZE = 8 Run. elapsed: 2477918993 Run. elapsed: 6507166582 SIZE = 16 Run. elapsed: 2753495712 Run. elapsed: 6654518500 SIZE = 32 Run. elapsed: 3503449592 Run. elapsed: 7200327857 SIZE = 64 Run. elapsed: 8699340657 Run. elapsed: 2341319386 SIZE = 96 Run. elapsed: 7332777796 Run. elapsed: 7262748797 SIZE = 128 Run. elapsed: 9837441795 Run. elapsed: 2815066980 SIZE = 256 Run. elapsed: 10630173370 Run. elapsed: 7555739001 Without OptimizeFill (Stub) GNR>for i in 1 4 8 16 32 64 96 128 256; do echo "SIZE = $i"; java -Xbatch -XX:-TieredCompilation -XX:UseAVX=2 -cp . ByteMatrix $i; java -Xbatch -XX:-TieredCompilation -XX:UseAVX=3 -XX:-OptimizeFill -cp . ByteMatrix $i; done SIZE = 1 Run. elapsed: 1349078338 Run. elapsed: 1348925407 SIZE = 4 Run. elapsed: 1768614017 Run. elapsed: 1746244560 SIZE = 8 Run. elapsed: 2450750783 Run. elapsed: 2432879829 SIZE = 16 Run. elapsed: 2754119558 Run. elapsed: 2815326161 SIZE = 32 Run. elapsed: 3611029103 Run. elapsed: 3333030676 SIZE = 64 Run. elapsed: 8700533260 Run. elapsed: 8985031446 SIZE = 96 Run. elapsed: 7332627568 Run. elapsed: 6786500170 SIZE = 128 Run. elapsed: 9866379227 Run. elapsed: 10093268601 SIZE = 256 Run. elapsed: 10628638554 Run. elapsed: 10632333394 GNR>
10-02-2025
ILW = Performance regression, edge case with Arrays.fill and AVX-512, no workaround but use AVX2 = HLM = P3
05-02-2025
I confirmed that it's a regression from JDK-8275047 in JDK 18 b21.
05-02-2025
[~thartmann] is still running the build-search, so that we can confirm that it is JDK-8275047.
05-02-2025
It looks like it could be a regression from this change in JDK18: JDK-8275047: Optimize existing fill stubs for AVX-512 target I see that [~jbhateja] did performance testing back there, but only starting with array sizes >= 10. This regression looks like it is on very small arrays, here 1 element.
05-02-2025
This could be the cause of the regression. [~jbhateja] Can you have a look? JDK-8275047
05-02-2025
Trubo boost can significantly alter the results, because it allows the CPU to work faster (higher clock speed) for a short amount of time, until the CPU heats up and it has to slow down to a sustainable level. Disabling turbo boost means that the CPU runs at a constant clock cycle, and so the time measurement is more reliable. Aaaah, we just found out that the issue is probably between: - AVX2: no regression - AVX3 / AVX512: regression. Maybe there is some AVX512 Array::fill intrinsic that is not very good for small arrays.
05-02-2025
Creating a test that fails when it takes too much time, so we can run build-search: import java.util.Arrays; public final class ByteMatrix { private final byte[][] bytes; public ByteMatrix(int width, int height) { bytes = new byte[(int) width][(int) height]; } public void clear(byte value) { for (byte[] aByte : bytes) { Arrays.fill(aByte, value); } } public static void main(String[] args) { System.out.println("Init and Warmup."); ByteMatrix bM = new ByteMatrix(90, 1); int N = 10000000; for (int i = 0; i < N; ++i) { bM.clear((byte) (i % 256)); } System.out.println("Run."); long t0 = System.nanoTime(); for (int i = 0; i < N; ++i) { bM.clear((byte) (i % 256)); } long t1 = System.nanoTime(); long t = t1 - t0; System.out.println("elapsed: " + (t)); if (t > 4000000000L) { throw new RuntimeException("too slow"); } } } /oracle-work/jdk-17.0.12/bin/java -XX:TieredStopAtLevel=4 ByteMatrix.java Init and Warmup. Run. elapsed: 2311643937 /oracle-work/jdk-21.0.6/bin/java -XX:TieredStopAtLevel=4 ByteMatrix.java Init and Warmup. Run. elapsed: 8180353387 Exception in thread "main" java.lang.RuntimeException: too slow at ByteMatrix.main(ByteMatrix.java:31)
05-02-2025
[~eaymane] You tested it on m1. I'm running it on x64 linux. Here the slightly modified test - I just added some print statements and timing to ensure I'm not measuring time before my code even runs. import java.util.Arrays; public final class ByteMatrix { private final byte[][] bytes; public ByteMatrix(int width, int height) { bytes = new byte[(int) width][(int) height]; } public void clear(byte value) { for (byte[] aByte : bytes) { Arrays.fill(aByte, value); } } public static void main(String[] args) { System.out.println("Run."); long t0 = System.nanoTime(); ByteMatrix bM = new ByteMatrix(90, 1); int N = 10000000; for (int i = 0; i < N; ++i) { bM.clear((byte) (i % 256)); } long t1 = System.nanoTime(); System.out.println("elapsed: " + (t1 - t0)); } } I also disabled turbo-boost, which is critical for such measurements. /oracle-work/jdk-23.0.2/bin/java -XX:TieredStopAtLevel=1 ByteMatrix Run. elapsed: 1607271407 /oracle-work/jdk-23.0.2/bin/java -XX:TieredStopAtLevel=4 ByteMatrix Run. elapsed: 8298166091 /oracle-work/jdk-21.0.6/bin/java -XX:TieredStopAtLevel=4 ByteMatrix Run. elapsed: 8197218625 /oracle-work/jdk-21.0.6/bin/java -XX:TieredStopAtLevel=1 ByteMatrix Run. elapsed: 1605899936 /oracle-work/jdk-17.0.12/bin/java -XX:TieredStopAtLevel=1 ByteMatrix Run. elapsed: 1587202016 /oracle-work/jdk-17.0.12/bin/java -XX:TieredStopAtLevel=4 ByteMatrix Run. elapsed: 2361920526 /oracle-work/jdk-17.0.15/bin/java -XX:TieredStopAtLevel=1 ByteMatrix Run. elapsed: 1590344130 /oracle-work/jdk-17.0.15/bin/java -XX:TieredStopAtLevel=4 ByteMatrix Run. elapsed: 2197434416 -------------------------------------- This looks like something really did get SIGNIFFICANTLY slower from JDK17 -> JDK21. I'll investigate a little more, and maybe write a more pin-pointed benchmark.
05-02-2025
Unable to reproduce on an m1 mac system. The results are for the smaller 'height' value, and are in milliseconds: % java --version java 23 2024-09-17 Java(TM) SE Runtime Environment (build 23+37-2369) Java HotSpot(TM) 64-Bit Server VM (build 23+37-2369, mixed mode, sharing) % java -XX:TieredStopAtLevel=1 ByteMatrix 938 % java -XX:TieredStopAtLevel=4 ByteMatrix 1808 ---------------- % java --version java 21.0.6 2025-01-21 LTS Java(TM) SE Runtime Environment (build 21.0.6+8-LTS-188) Java HotSpot(TM) 64-Bit Server VM (build 21.0.6+8-LTS-188, mixed mode, sharing) % java -XX:TieredStopAtLevel=1 ByteMatrix 937 % java -XX:TieredStopAtLevel=4 ByteMatrix 1770 ---------------- % java --version java 17.0.12 2024-07-16 LTS Java(TM) SE Runtime Environment (build 17.0.12+8-LTS-286) Java HotSpot(TM) 64-Bit Server VM (build 17.0.12+8-LTS-286, mixed mode, sharing) % java -XX:TieredStopAtLevel=1 ByteMatrix 985 % java -XX:TieredStopAtLevel=4 ByteMatrix 1766
04-02-2025