Bug ID: JDK-8372153 AArch64: Performance regression in long reduction microbenchmarks after JDK-8340093

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 26
26Unresolved

JDK-8340093 caused some performance regression in some long reduction microbenchmarks on SVE machines.

unit = ns/op
WI = with cost model
WO = without cost model
P0 = with cost model, but auto vectorization disabled, i.e. -XX:AutoVectorizationOverrideProfitability=0

128-bit sve machine:

Benchmark                   							WI vs WO     WI vs P0
VectorReduction2.NoSuperword.longAddDotProduc      23.21%		22.92%
VectorReduction2.NoSuperword.longMulDotProduct	18.25%		17.96%	
VectorReduction2.NoSuperword.longMulSimple		21.11%		21.16%
VectorReduction2.WithSuperword.longAddDotProduct	22.92%		23.03%
VectorReduction2.WithSuperword.longMulDotProduct	18.23%		18.19%
VectorReduction2.WithSuperword.longMulSimple		21.74%		21.04%

256-bit sve machine:

Benchmark                   							WI vs WO     WI vs P0
VectorReduction2.WithSuperword.longMulDotProduct	39.32%		39.32%
VectorReduction2.WithSuperword.longMulSimple		23.88%		23.86%
VectorReduction2.NoSuperword.longMulDotProduct	39.33%		39.35%
VectorReduction2.NoSuperword.longMulSimple		23.87%		23.92%

Initial ILW = Performance regression with long reductions, only on AArch64 and with SVE, no workaround = MLH = P4
21-11-2025
[~fgao] That sounds good to me, I have no rush here. And good to hear that JDK-8343689 does improve things! So at least for those cases the simple reductions are now enabled with JDK-8340093, so we have a mixed bag of improvements and regressions.
20-11-2025
[~epeter], thanks for your comments. I tested and compared the performance between the current mainline and the patch from JDK-8343689 on 256-bit SVE machines. The results (ns/op) are as follows: Benchmark patch vs. master VectorReduction2.NoSuperword.longMulDotProduct -54.12% VectorReduction2.NoSuperword.longMulSimple -50.16% VectorReduction2.WithSuperword.longMulDotProduct -54.10% VectorReduction2.WithSuperword.longMulSimple -50.16% The patch significantly improves performance on 256-bit SVE machines and also outperforms the results from before the cost model was introduced. I haven’t tested on 128-bit SVE yet. I don’t have enough time to continue with this issue at the moment, and I understand you don’t have access to SVE hardware. If there’s no immediate need, I plan to resume this task in early January. What do you think? Of course, if someone else is interested in the issue, please feel free to take it.
20-11-2025
[~fgao] I wonder what the overlap is with JDK-8343689, maybe it fixes part of the regression?
20-11-2025
Sadly, I don't have access to SVE machines, so I cannot really do performance investigations. One simple fix would be to adjust the cost model, or just disallow long multiplication for auto vectorization. We already do that for NEON, I think.
19-11-2025
[~fgao] Thanks for benchmarking this! It seems these are all cases with long multiplication. Are we actually using vector instructions for these in the backend? If so, are they especially slow / high latency / low throughput?
19-11-2025

Causes :	JDK-8340093 - C2 SuperWord: implement cost model
Relates :	JDK-8343689 - AArch64: Optimize MulReduction implementation