Bug ID: JDK-8361376 Regressions 1-6% in several Renaissance in 26-b4 only MacOSX aarch64

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 26

Priority: P4
Status: Resolved
Resolution: Fixed
OS: os_x
CPU: aarch64

Submitted: 2025-07-03
Updated: 2025-09-15
Resolved: 2025-09-09

JDK 26
26 b15Fixed

Seems to be related to JDK-8358821

Happens with LogRegression, NaiveBayes, Neo4jAnalytics, and PageRank

Changeset: f9640398 Branch: master Author: Dean Long <dlong@openjdk.org> Date: 2025-09-09 23:27:33 +0000 URL: https://git.openjdk.org/jdk/commit/f96403986b99008593e025c4991ee865fce59bb1
09-09-2025
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/26399 Date: 2025-07-19 01:39:12 +0000
23-07-2025
I think Martin guessed correctly that the extra overhead is from locking or contention on the lock. Possibly solutions I can think of: 1. use a per-nmethod lock 2. use atomic CAS w/o a lock 3. only do self-disarm, skip it in GC threads Option 3 is based on the unconfirmed theory that the benchmark regression is due to disarm calls from GC threads, and not from self-disarm calls. A GC expert can correct me if I'm wrong, but I think the disarm() that the GC threads are doing is optional. If we skip it, because it's slightly expensive and not all nmethods are going to get entered between GC cycles, then we move the overhead out of the GC and into the mutator thread when it does a self-disarm.
16-07-2025
The G1 "Pause Remark" phase is ~5x slower in the 327 (the patch in question) build. See the attached chart. Under debug log, there is a phase "G1 Complete Cleaning" where the time is going. baseline: [88.305s][debug][gc,phases,start ] GC(1900) G1 Complete Cleaning [88.308s][debug][gc,phases ] GC(1900) G1 Complete Cleaning 2.400ms patch: [25.202s][debug][gc,phases,start ] GC(444) G1 Complete Cleaning [25.217s][debug][gc,phases ] GC(444) G1 Complete Cleaning 14.955ms
11-07-2025
I did the same screen shot sequence on an x64 mini and there is no noticeable increase in system time. I will see if better profiling is available on Mac.
11-07-2025
Thanks Eric. Higher system time could indeed by the key. I don't know how to investigate system time details on Mac, but it may point to higher lock contention like Martin suspected. But then why are we seeing it only on macos aarch64?
10-07-2025
I made a custom build of Renaissance JMH so it would run NaiveBayes with only 3 worker threads. In the attached screen shot, you can see it stays off the efficiency cores fairly well, but in the right hand run, which is the JDK-8358821 chnage, there is much more system time (red) than in the left side run which is the immediately previous CI build without JDK-8358821. The scores for this 3 threads version still repro the problem - baseline: 697.149 ± 2.636 ms/op JDK-8358821: 773.131 ± 6.625 ms/op So it seems like the system time might be the key here? For reference I added a similar pic of the default jar with 8 workers and the same high system time is visible there.
10-07-2025
OCI has a "boost" feature, but I don't know if it is implemented by moving from economy to perf cores. I noticed that Renaissance-ScalaKmeans is very noise and doesn't give reproducible results even with the same build, but if I use ActiveProcessorCount the results are stable.
09-07-2025
We will discuss it in the perf team - but it seems like if all the runs use all the cores and we are running a lot of samples, it will all average out? Also, in the real world, everything is running in cloud vms and this situation would not occur because the cloud vendor would not have a system to map customer vms across perf and economy cores?
09-07-2025
It seems that some of these benchmarks automatically scale the number of threads to the number of cores, which seems problematic if the cores are not all the same. It also means that there are no space cores for background tasks on the machine, so background tasks compete with the benchmark and cause noise.
08-07-2025
I noticed on my Mac M4 that the cores are not all the same: Total Number of Cores: 16 (12 performance and 4 efficiency) The benchmark is running on 16 threads: " NOTE: 'naive-bayes' benchmark uses Spark local executor with 16 (out of 16) threads. " It seems like this could lead to inconsistent results. [~ecaspole], wouldn't it be better to benchmark with 12 threads or less, to match the number of performance cores?
08-07-2025
It seems that moving the ThreadWXEnable didn't help. I am also trying the benchmarks with ZGC, but so far that isn't helping either.
08-07-2025
ILW = performance regression, MacOS aarch64 only, no workaround = MMH = P3
08-07-2025
[~mdoerr], yes, good theory. I'll try moving the ThreadWXEnable.
07-07-2025
Because the regression was only observed on MacOS aarch64: Is the ThreadWXEnable too expensive! If so, it can be moved like this: https://github.com/TheRealMDoerr/jdk/commit/522b1ef2e75509d91ac18a1acd27275fc0305e8e Could somebody check if that helps, please?
07-07-2025
Maybe we should have a per-nmethod lock as ZGC already has? Maybe that could be moved to shared code?
04-07-2025
Dean, could you please have a look?
04-07-2025
My guess is that there's contention on the lock when many threads are disarming many nmethods: https://github.com/openjdk/jdk/blob/da0a51ce97453a47b2c7d11e5206774232309e69/src/hotspot/share/gc/shared/barrierSetNMethod.cpp#L81
03-07-2025

Causes :	JDK-8358821 - patch_verified_entry causes problems, use nmethod entry barriers instead
Causes :	JDK-8367325 - [s390x] build failure due to JDK-8361376
Duplicate :	JDK-8363937 - G1: Do not take the global nmethod lock during Code cache cleanup