Bug ID: JDK-8324221 Benchmark regressions after JDK-8320317 and JDK-8321371

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 23

Priority: P3
Status: Closed
Resolution: Won't Fix
OS: linux

Submitted: 2024-01-19
Updated: 2024-02-29
Resolved: 2024-02-23

Benchmark regressions after the introduction of JDK-8320317 and JDK-8321371 in 23+5:

-6.4%  SPECjvm2008-XML.transform-ZGC  on  Linux-aarch64
-3%     DaCapo-pmd-large  on  Linux-x64

Regressions were isolated to jdk-23+5-230 by measuring CI builds.

This makes me wonder if SPECjvm2008 should be modified to not run when ZGC or Generational ZGC is in use... [~ecaspole] - is this a reasonable change to make?
29-02-2024
Removing NotRunnable lead to more spinning on monitors. Since SPECjvm2008 are intended to run 100% busy, and ZGC is intended to give low pause times, it's unlikely that anyone would run their application server 100% busy when using ZGC. The extra spinning on monitors is using more CPU, thereby starving ZGC activity. This is not a too realistic situation for real world use of ZGC. This single regression is understood and acceptable assuming other benchmarks show a benefit from JDK-8320317, which they do. Removing NotRunnable will also simplify the JVM code, which is also good from a maintenance perspective. The small DaCapo-pmd-large regression is also acceptable, given that other benchmarks show a benefit. Bottom line, this bug will be closed as "Will Not Fix".
23-02-2024
[~dice] Benchmarks were not run under virtualization.
14-02-2024
For historical background, NotRunnable() was connected, on Solaris, to the "schedctl" facility. This let a thread efficiently query the OS scheduling state of threads with just a simple load in user-land. It was generally prudent to avoid spinning if the current monitor owner wasn't on a CPU, as the odds of that spinning episode being successful (avoiding a voluntary context switch) were low-ish. The waiting thread was just better off blocking immediately. Schectl was documented although considered unstable, but we lobbied Solaris to provide a longer-term stable interface that the JVM could use. On other platforms I just connected NotRunnable to check the local JVM-specific thread state, which wasn't nearly as effective as using schedctl. As noted elsewhere, it's possible that using NotRunnable in its current form on linux might provide "accidentally" confer benefit simply by extended the spinning period (because of the call overheads) instead of actually informing the waiting thread about the owner's state. It's worth noting that there's some recent effort in the linux community to try emulate bits of the schedctl functionality with eBPF. At the moment, this isn't mature but it's something to keep an eye on. I never looked into it, but an older idea was to inspect the contents of the KUSER_SHARED_DATA on windows and try to inform NotRunnable() accordingly, based on the number of idle CPUs available. Having said that, it's likely that NotRunnable should still be removed. Finally, were the benchmarks run under virtualization? I ask because the monitor spin loop is sensitive to PLE (pause loop exit) detection in virtual machines. PLE is something of a confounding factor when benchmarking spin loop variations.
14-02-2024
The NotRunnable function provided an exit out of a spin loop if the thread that owns the lock that we're spinning on was observed to either be stalled on acquiring another lock or was in a blocked or native state. pmd uses one core but is multi threaded. I observed that the threads that own the monitor allocate arrays and stop for GC. The states observed are transient, but still provide a quick exit from the spin loop. The improvement to h2 seems to be because of removing NotRunnable and not exiting spin loop. The improvement to h2 seems to make the change worth keeping. But it would be good to solve the mystery of SPECjvm2008-XML.transform-ZGC (and only ZGC) having a 6% regression because of this change.
14-02-2024
[~coleenp] I've now rerun the test multiple times and unfortunately 'yield' does not seem to help the aarch64 case. I might be different reasons for this. When I first removed NotRunnable and got regression on aarch64, I saw that implementing SpinPause with a single yield instruction helped to reduce the regression. However when I implemented SpinPause on aarch64 for bsd it was decided that it should be configurable, so it's not just a single yield instruction that got merged. And also, you wrote "...may be that the extra loads to other places in NotRunnable, changes the behavior..." which might be just what's happening. Calling NotRunnable in the TrySpin loop might have acted as a SpinPause. [~aboldtch] Alerted me to the fact that SpinPause was only called once per 256 turns of the loop. Moving that SpinPause to where NotRunnable used to be fixed the regression in both DaCapo-pmd-large and SPECjvm2008-XML.transform-ZGC. Unfortunately it seems to hurt other tests like DaCapo-h2-large.
05-02-2024
[~fbredberg] Does 'yield' help the aarch64 case? have you rerun this?
29-01-2024
It may be that the extra loads to other places in NotRunnable, changes the behavior of some of the caching, making the spin pause more effective (?) on linux-x64. The linux-aarch64 regression might be explained by the SpinPause default of none.
29-01-2024

Relates :	JDK-8321371 - SpinPause() not implemented for bsd_aarch64/macOS
Relates :	JDK-8320317 - ObjectMonitor NotRunnable is not really an optimization