JDK-8355094 : Performance drop in auto-vectorized kernel due to split store
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 24,25
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: x86_64
  • Submitted: 2025-04-21
  • Updated: 2025-05-22
  • Resolved: 2025-05-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25
25 b24Fixed
Related Reports
Relates :  
Relates :  
Description
Following benchmarking kernel shows around 20% performance drop with latest JDK-25 build 25-ea+19-2255) vs JDK-17 build 17.0.9+11-LTS-201 due to split stores.

https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/VectorLoadToStoreForwarding.java#L197

Command line: perf stat -e cycles,instructions,mem_inst_retired.all_stores,mem_inst_retired.split_stores java -jar target/benchmarks.jar -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.vm.compiler.VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_20


JDK-25 PMU events
================
   92,58,13,16,800     cycles
   28,45,27,41,807     instructions              #    0.31  insn per cycle
    9,58,42,45,086      mem_inst_retired.all_stores
    4,49,51,55,071      mem_inst_retired.split_stores
      32.510948769     seconds time elapsed
      33.010587000     seconds user
       0.194167000      seconds sys

System: Model name:                      Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake Server)

Comments
Changeset: 277bb208 Branch: master Author: Emanuel Peter <epeter@openjdk.org> Date: 2025-05-20 13:51:47 +0000 URL: https://git.openjdk.org/jdk/commit/277bb208a2c6de888c57285854b6f5d030021f94
20-05-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/25065 Date: 2025-05-06 13:21:30 +0000
15-05-2025

I now have a draft. I have a new benchmark that is much more sensitive to load and store splitting over the cacheline boundary. https://github.com/openjdk/jdk/pull/25065
06-05-2025

Finally got access to a "Cascade Lake" machine. And these are the numbers I see on it: ./bench 0 0 Warmup... Benchmark... [time] 1215 ms ./bench 0 0 Warmup... Benchmark... [time] 1215 ms ./bench 1 1 Warmup... Benchmark... [time] 1291 ms ./bench 1 1 Warmup... Benchmark... [time] 1302 ms ./bench 1 0 Warmup... Benchmark... [time] 1298 ms ./bench 1 0 Warmup... Benchmark... [time] 1296 ms ./bench 0 1 Warmup... Benchmark... [time] 1372 ms ./bench 0 1 Warmup... Benchmark... [time] 1375 ms All aligned is fastest (1215 ms). Both misaligned or only load misaligned (1300 ms). Only store misaligned (1375 ms) is worst. That is a first confirmation that on some platforms (e.g. Cascade Lake) alignment of stores is more important than alignment of loads.
23-04-2025

ILW = Large performance regression (20%), with targeted microbenchmark on specific hardware, no known workaround = MMH = P3
22-04-2025

Ran it on my machine: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz, aka Tiger Lake. Slightly modified to run 10x longer. echo "1" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo [empeter@emanuel JDK-8355094-split-store]$ ./micro 0 0 [time] 849 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 0 0 [time] 844 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 0 0 [time] 849 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 1 0 [time] 1064 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 1 0 [empeter@emanuel JDK-8355094-split-store]$ ./micro 0 1 [time] 1045 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 0 1 [time] 1046 ms [~qmai] I also always thought that misaligned loads / stores are not very consequential for performance. On my machine it does not seem if the load or the store is misaligned, but I do get a very noticable 20% slowdown if there is a misaligned load or store. Interestingly, if both are misaligned equally, we are only 15% slower: [empeter@emanuel JDK-8355094-split-store]$ ./micro 1 1 [time] 1004 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 1 1 [time] 1001 ms [empeter@emanuel JDK-8355094-split-store]$ ./micro 1 1 [time] 987 ms It seems then that this effect is very micro-architecture dependent. That is probably whey I could not measure this slowdown when I made the changes. I need to do some experiments on Cascade Lake. [~jbhateja] Are there any other micro-arch that are affected equally? I think the fix would be relatively simple. The code now looks quite different than in JDK17. We select the alignment reference in: VTransform::determine_mem_ref_and_aw_for_main_loop_alignment Right now, we just pick the largest width. But we could prioritize stores over loads easily. It is now all about creating a nice JMH benchmark where we can rule out "store to load forwarding", and running it on various micro-arch to see where we can see the effect.
22-04-2025

[~jbhateja] Aha. I see. You mentioned "split store". From that alone I did not realize that you were thinking it is an alignment issue, where unaligned stores are split which could lead to a slowdown. I thought it was a "store to load forwarding failure" problem, because of the forward dependency. In the end, it could really be a mix of issues here, depending on the platform. I'll look at your benchmark. From a quick glance, it looks like you are able to show an effect without the forward dependency, so ruling out "store to load forwarding failure". We should also create JMH versions of it, so we can integrate it. We could also have a Vector API example to show the effects of "store to load forwarding failure" as well as "splilt store". I'll have to investigate again how we pick the alignment reference. Prioritizing stores would hopefully be an easy fix.
22-04-2025

[~qamai] [~epeter], please find the results below. SPR2> SPR2>perf stat -C 1 -e cycles,instructions,mem_inst_retired.all_loads,mem_inst_retired.split_loads,mem_inst_retired.all_stores,mem_inst_retired.split_stores taskset -c 1 ./micro 0 0 [time] 42 ms Performance counter stats for 'CPU(s) 1': 565,556,910 cycles 1,050,924,298 instructions # 1.86 insn per cycle 459,241,810 mem_inst_retired.all_loads 418 mem_inst_retired.split_loads 392,526,961 mem_inst_retired.all_stores 661 mem_inst_retired.split_stores 0.145991039 seconds time elapsed SPR2>perf stat -C 1 -e cycles,instructions,mem_inst_retired.all_loads,mem_inst_retired.split_loads,mem_inst_retired.all_stores,mem_inst_retired.split_stores taskset -c 1 ./micro 16 0 [time] 51 ms Performance counter stats for 'CPU(s) 1': 473,498,237 cycles 1,050,817,513 instructions # 2.22 insn per cycle 459,212,100 mem_inst_retired.all_loads 128,000,390 mem_inst_retired.split_loads 392,511,009 mem_inst_retired.all_stores 621 mem_inst_retired.split_stores 0.122545798 seconds time elapsed SPR2>perf stat -C 1 -e cycles,instructions,mem_inst_retired.all_loads,mem_inst_retired.split_loads,mem_inst_retired.all_stores,mem_inst_retired.split_stores taskset -c 1 ./micro 0 16 [time] 64 ms Performance counter stats for 'CPU(s) 1': 646,746,797 cycles 1,050,912,229 instructions # 1.62 insn per cycle 459,239,820 mem_inst_retired.all_loads 430 mem_inst_retired.split_loads 392,532,016 mem_inst_retired.all_stores 128,000,608 mem_inst_retired.split_stores 0.167128061 seconds time elapsed
21-04-2025

[~jbhateja] But do misaligned stores have any performance implication in JDK-25, can you try to craft 3 assembly versions of this benchmark, one aligns with respect to the loads, one aligns with respect to the stores, and one does not align at all and see if there is difference between the 3?
21-04-2025

[~qamai], forget jdk17, I have removed it, latest PMU events are only with JDK25.
21-04-2025

I don't think misaligned stores explain the difference here. Your perf stat of JDK-17 shows a completely different instruction count, which may suggest that it runs a completely different machine code sequence. Modern architectures are very good at doing misaligned stores/loads. C++ compilers don't even try to align vectorized memory accesses. https://godbolt.org/z/eEshxPh71
21-04-2025

[~qamai], with jdk17-u : 736.701 ns/op with jdk25 : 905.233 ns/op Please note that in 25 almost every other store is misaligned, which is the main concern here. Misaligned stores have a higher penalty in comparison to the misaligned loads. 92,41,57,74,459 cycles (66.61%) 28,50,05,95,275 instructions # 0.31 insn per cycle (83.32%) 9,55,23,82,331 mem_inst_retired.all_stores (83.36%) 4,44,73,21,398 mem_inst_retired.split_stores (83.33%) 10,48,89,59,668 mem_inst_retired.all_loads (83.30%) 13,43,279 mem_inst_retired.split_loads (83.39%) 32.697425274 seconds time elapsed 33.000569000 seconds user 0.166979000 seconds sys Commandline: perf stat -e cycles,instructions,mem_inst_retired.all_stores,mem_inst_retired.split_stores,mem_inst_retired.all_loads,mem_inst_retired.split_loads java -Xbatch -XX:-TieredCompilation -jar target/benchmarks.jar -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.vm.compiler.VectorLoadToStoreForwarding.VectorLoadToStoreForwardingSuperWord.benchmark_20
21-04-2025

[~jbhateja] What you could do: Run the newer benchmarks from https://github.com/openjdk/jdk/pull/19880 on that machine. And play around with SuperWordStoreToLoadForwardingFailureDetection. Maybe we just need to set a different default value for your specific machine.
21-04-2025

My question is also how relevant this exact benchmark_20 really is. You found it in my benchmarks where I was very thorough. It sits on a scale from "small" distances, where I avoid vectorization to "large" distances where vectorization is profitable. But in the middle, it is hard to get the cases exactly right with the existing heuristic. Of course if someone actually needs this exact pattern in an important library, we should invest more. But before we know that for sure, I think the heuristic I implemented was a big step in the right direction, and a more complicated heuristic may not be worth it. [~jbhateja] What do you think?
21-04-2025

[~jbhateja] It seems to me this is a known and acceptable regression from JDK-8325155 / JDK-8334431. I don't know if this case is really worth fixing. Please look at the PR: https://github.com/openjdk/jdk/pull/21521 Essencially, we have the issue with Store to Load Forwarding, and we need some kind of cut-off for the distance at which the store to load forwarding failure penalty is outweighed by the vectorization benefits. If you set the cut-of slightly off for a type, then you get this kind of 20% regression, because vectorization is a little slower than scalar. Currently, we have a hard threshold SuperWordStoreToLoadForwardingFailureDetection which cannot be perfectly accurate. If we wanted to be really really accurate, we would probably have to determine a very accurate latency for a failed store-to-load forwarding. And then we would have to have some kind of cost model that takes into account both latency and throughput of each instruction / dependency. But that is incredibly complex. Do you have any good ideas about how to make this more accurate?
21-04-2025

Hi [~epeter], it seems earlier we were preferring stores for aligning main loops. https://github.com/openjdk/jdk17u-dev/blob/master/src/hotspot/share/opto/superword.cpp#L853
21-04-2025

[~jbhateja] It would be great if you can show the benchmark results, the machine code before and after. From the perf output I see that everything is different and I'm not convinced that split stores are the reason.
21-04-2025

[~epeter], what is your answer to not giving preference to store as alignment base ? I can't find any mention about that in your comments.
21-04-2025