JDK-8365493 : Regression on Pet Clinic app with Compact Object Headers
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 26
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: x86_64
  • Submitted: 2025-08-13
  • Updated: 2025-10-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Blocks :  
Relates :  
Description
Observing a regression in requests / second on Spring Pet Clinic app when loading testing with oha (HTTP load generator).
Regression ranges from 8% - 10% when running oha remote from the Pet Clinic app, (oha on one machine, Pet Clinic app on remote machine — full network stack).  The regression is much larger (~ 30%) when running both oha and Pet Clinic on the same machine but isolating each in processor id sets using numactl.

Steps to reproduce:
1.) Grab a copy of Spring Pet Clinic (https://github.com/spring-projects/spring-petclinic). Follow instructions to build it. Instructions on how to launch Pet Clinic can also be found at previously mentioned download URL.

2.) Grab a copy of oha, a http load generator, (https://sourceforge.net/projects/oha.mirror/)

3.) Start Pet Clinic. The following command line was used to test with +UseCompactObjectHeaders on a 16 core AMD Rome machine running Linux:
$  JAVA=$HOME/jdks/jdk-26-b10/bin/java
$  numactl --physcpubind 8-15,24-31 ${JAVA} -Xmx16g -Xms16g -Xmn12g -XX:MetaspaceSize=128m -XX:ReservedCodeCacheSize=256m -XX:+UseParallelGC -Xlog:gc*:file=/tmp/coh-parallel-petclinic-gc.log -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch -XX:-UseCompactObjectHeaders -jar ./target/spring-petclinic-3.5.0-SNAPSHOT.jar
* Note, you may have to adjust the range of processor ids for the machine you are running on.
For a baseline test without compact object headers, change -XX:+UseCompactObjectHeaders to -XX:-UseCompactObjectHeaders

4.) Start the http load generator oha with the following command line:
$ OHA=<path to the oha http load generator>
$ numactl --physcpubind=1-7,17-23 ${OHA} -n 500000 --no-tui http://localhost:8080/vets.html
* Note: you may need to adjust the range of processor ids for the machine you are running on.

The oha http load generator will report statistics when it finishes. In its “Summary” section the last line is “Requests/sec:”

Averaging 5 runs with -UseCompactObjectHeaders, the system under test does 1086 requests / second.
Averaging 5 runs with +UseCompactObjectHeaders, the system under test does 688 requests / second.
Comments
Note that I am no longer working for Amazon, and I am not sure if I will be able to spend a lot of time on this issue (or OpenJDK for that matter) in the future. Does that mean that the application juggles ~1024 locks *and* actually uses them (as opposed to throwaway locks like single-threaded use of StringBuffer) *and* reaches significant contention on them (or why else would they all be inflated?). Also, why does that lack-of-cache only show up on some hardware but not on other? If the solution is really to provide a thread-local cache of 1024 OMs, then we need to think hard about how to make it so that normal-behaving applications are not wasting 16KB per thread. Also, a cache of this size would perhaps benefit from a more clever lookup than a linear search (not sure what would be feasible to do in assembly, though).
13-10-2025

Ramki, can you please look at this? I currently don't have time, and perhaps more importantly, don't have access to affected hardware.
13-10-2025

The old ObjectMonitor system used to keep a cache of up to 1024 ObjectMonitor attached to each thread along with the global ObjectMonitor lists. It's interesting that 1024 ObjectMonitors being thread-local has popped up again.
10-10-2025

An update: we instrumented the code with more counters in order to find out the distributions of object monitor accesses through the cache and lock counts (i.e. how many times an inflated monitor has been successfully entered). The results are presented in the screenshots attached. The conclusion is that the vast majority (ca 60% of all OMs) have more than 1e6 accesses via cache and more than 1e6 lock events throughout their lifetime. From the numbers we cannot conclude if these groups are the same in terms of population, but it is likely to be so. We think that the only solution for the problem is to increase the OMCache capacity to a large value.
07-10-2025

An update on the issue: As described before, we think that the issue can be mitigated by minimizing the number of slow path walks in the C2 emitted code. One more or less obvious solution is to increase the object monitor cache size (per thread). The default capacity is 8. I experimented a bit with really large values (up to 1024), and was able to reduce the regression from around 25% to just 8%. The results for runs made on my OCI instance are attached. One observation: when experimenting on my own laptop with a Linux running inside of Virtual Box, I was able to exceed the performance of the default case (i.e. OMT off) with large enough object monitor cache.
03-10-2025

I added a table with scaling with problem size.
01-10-2025

Here is what me and [~fbredberg] have found on the issue. First, we profiled the code with gpofng. The major difference is in the time spent in ObjectMonitor::try_lock(). The hierarchy of calls suggested also that something is happening in C2 emitted code: try_lock() <- try_enter() <- LightweightSynchronizer::inflate_and_enter() <- LightweightSynchronizer::enter() <- SharedRuntime::monitor_enter_helper() <- SharedRuntime::complete_monitor_locking_C() <- C2 Runtime complete_monitor_locking_blob Gprofng is a sampler, it does not provide any counters. We had to add those ourselves in the runtime code. Without OMT (object monitor table): 1.1x10e9 calls to complete_monitor_locking_C() With OMT 2.4x10e9 calls to complete_monitor_locking_C(). This means that the the slow path is taken appox 2.18 times more often when one uses OMT. We also added counters to measure how many time a thread was parked, the results are negligible compared to the overall load. The next step was to study what is happening in the C2 emitted code. We added 11 counters in C2_MacroAssembler::fast_lock_lightweight(). The results are presented in the tables below: -= WITHOUT OBJECT MONITOR TABLE =- -XX:+UnlockDiagnosticVMOptions -XX:-UseObjectMonitorTable nCalls % counter label explanation ========================================================= 12616092809 100% XYZ_0: entry (Calls to fast_lock_lightweight) 6027954426 47% XYZ_1: fast_path (Not locked when we entered fast_lock_lightweight, go fast-path) 6030862583 47% XYZ_2: push (Same as p1 but with recursive locking as well, p2-p1: 2908157 recursive locks) 6589215079 52% XYZ_3: inflated (Number of times an inflated monitor was found) 0 0% XYZ_4: monitor_found (Label not used) 5525186608 43% XYZ_5: monitor_locked (Sucessfully locked an inflated monitor found in the mark word, go fast-path) 38 0% XYZ_6: slow_path_0 (Lock-stack is full, go slow-path) 8336 0% XYZ_7: slow_path_1 (Already locked when entering fast_lock_lightweight, go slow-path) 0 0% XYZ_8: slow_path_2 (Label not used) 1118166433 8% XYZ_9: slow_path_3 (Failed to lock the inflated monitor found in the mark word, go slow-path) 10979600557 87% XYZ_10: locked (Sucessfull fast-path returns 87% => 13% returns slow-path) Resulted in slow-path: 1636492252 = approx 1.6x10e9 = approx 9% of all calls. Number of complete_monitor_locking_C() calls at runtime: approx 1.1x10e9 Corresponding number of try_enter() calls at runtime: 1274712298 = approx 1.2x10e9 Sucess rate without going to the runtime is 87%. Percents do not sum up to 100 because the counters were not atomic, i.e. the are underestimations. We can have more insights from the the table (OMT off): 1) Less than a half of calls result in fast locking (i.e. with locking bits). 2) The OM is in the markword, one can think of it as of a cache with a hit rate of 1. 3) The overall efficiency of inflated locking is 1 * 0.83 = 0.83. Same measurements with OMT on: -= USE OBJECT MONITOR TABLE =- -XX:+UnlockDiagnosticVMOptions -XX:+UseObjectMonitorTable nCalls % counter label explanation ========================================================= 12693895823 100% XYZ_0: entry (Calls to fast_lock_lightweight) 7421395388 58% XYZ_1: fast_path (Not locked when entering fast_lock_lightweight, we got it, go fast-path) 7374712402 58% XYZ_2: push (Same as p1 but with recursive locking as well, p1-p2: 46682986 recursive locks) 5306447455 41% XYZ_3: inflated (Number of times an inflated monitor was found) 3878538243 30% XYZ_4: monitor_found (Number of times the monitor was found in the OMT-cache) 3025048955 23% XYZ_5: monitor_locked (Successfully locked a monitor found in the OMT-cache, go fast-path) 46 0% XYZ_6: slow_path_0 (Lock-stack is full, go slow-path) 9663155 0% XYZ_7: slow_path_1 (Locked by someone else when we entered fast_lock_lightweight, go slow-path) 1478786346 11% XYZ_8: slow_path_2 (Failed to find the monitor in the OMT-cache, go slow-path) 908060162 7% XYZ_9: slow_path_3 (Failed to lock the monitor found in the OMT-cache, go slow-path) 10010005726 78% XYZ_10: locked (Sucessfull fast-path returns 78% => 22% returns slow-path) Resulted in slow-path: 2396509709 = approx 2.3 x 10^9 = approx 19% Number of complete_monitor_locking_C calls at runtime: approx 2.3 x 10^9 Corresponding number of try_enter calls at runtime: 2697708213 = approx 2.7 x 10^9 Sucess rate without going to the runtime is 78%. Percents do not sum up to 100 because the counters were not atomic, i.e. the are underestimations. We can have more insights from the the table (OMT on): 1) Fast locking (i.e. with bits in the markword) is more successfull with OMT is on. This is effect of spinning (see below). 2) XYZ_4 / XYZ_3 = approx 0.75, i.e. thread-local OM cache hit rate is about 75%. 3) XYZ_5 / XYZ_4 = approx 0.77, out of found-in-cache OMs, only 77% were successfully locked. 4) The combined "efficiency" of inflated locking is 0.75 * 0.77 = 0.57. One would probably like to minimize the number of slow-path takes. We tried to fiddle around with the size of the thread-local OM cache, but it does not seem to have a noticeable effect on the cache hit rate. Another available impact point is a diagnostic option LightweightFastLockingSpins, which, as stated in the option description "Specifies the number of times lightweight fast locking will attempt to CAS the markWord before inflating. Between each CAS it will spin for exponentially more time, resulting in a total number of spins on the order of O(2^value)" Practically this means that low values provide less fast locking, but instead proper OMs are inflated and OM locking is performed. This is preferable with high contention, as with proper OMs one can avoid wasting resources while spinning, but instead part a thread which could not enter the OM and do something useful. High values provide more fast locking and fewer proper OM will be infalted and used as a result. We observed that higher value of LightweightFastLockingSpins increases the numbers of successful fast locks (i.e. bits only) in the C2 emitted code, reduces the number of inflated lockings. However, with a price of higher CPU consumption. Below are the results from measurements done in a not-strictly scientific way on an OCI machine. Value Performance (Requests/sec) =============================================== OMT off 1502.4774 1 1164.5188 3 1165.7632 5 1121.2268 7 1167.0685 9 1220.5972 11 1171.9015 13 (default) 1117.1044 15 1165.2829 17 1135.5648 19 1157.5625 21 1173.9941 23 1177.0004 No significant impact observed on performance. This could probably be explained with the fact that low LightweightFastLockingSpins volumes imply more OMs inflated, but getting them in/out of the OM table is not for free. It probably balances out any possible gains from having proper OMs. [~eosterlund] suggested disabling spinning before inflation, effectively setting LightweightFastLockingSpins to 0. This would inflate the monitor sooner, just after the 1st failed fast locking attempt. This is suitable for scenario with high contention. In this case, resoures are not spent on spinning, but rather an OM is inflated asap and a thread can be parked when needed. Regression is preserved in this case. So it looks like spinning has little to no effect on performance for this scenario with OMT on. Conclusion so far: we have not found a single reason for the performance regression. Things to check: 1) Is it really the number of slow-path walks out of C2 emitted code which determines the performance? 2) If it is, can we increase the overall success rate of inflated locking there? Currently success_rate = cache_hit_rate * single_OM_lock_sucess_rate = 0.75 * 0.77 = 0.57. We do not have controll over the 1st multiplier, can we increase the 2nd? In case without OMT, that rate is 0.83. Can we make it even higher? Obviously, we cant have 0.75 * X > 0.83 since X <= 1, but maximizing X (single_OM_lock_sucess_rate) may be doable. * Tables look nice when I edit them, but not so nice when get published, sorry.
19-09-2025

We were able to reproduce the problem on two Intel machines, one with a Linux being a guest OS inside of VirtualBox, another one with "native" installation. The regression is around 30%. We also reproduced it on Apple M3 Pro equipped laptop, there the regression is around 40%. Note that numactl is not available, as far as I know, on MacOS. We tend to think that that problem is not AMD - specific.
05-09-2025

I am able to reproduce on AMD EPYC ROME machine from OCI. The numbers I observe after a single run (not scientific, but still): -UseCompactObjectHeaders 1561.4794 requests/sec +UseCompactObjectHeaders 1273.1161 requests/sec
04-09-2025

This is a perf profile of a regression run: 8.05% http-nio-8080-e libjvm.so [.] ObjectMonitor::try_spin 6.08% http-nio-8080-e libjvm.so [.] ObjectMonitor::try_lock 5.59% http-nio-8080-e libjvm.so [.] LightweightSynchronizer::get_or_insert_monitor_from_table 3.27% http-nio-8080-e libjvm.so [.] LightweightSynchronizer::enter 2.12% http-nio-8080-e libjvm.so [.] LightweightSynchronizer::inflate_and_enter 2.01% http-nio-8080-e libjvm.so [.] SharedRuntime::complete_monitor_locking_C 1.15% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7be54bd3 1.12% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7bf92c85 1.05% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7b6d58ca 1.04% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7bb70e86 0.80% http-nio-8080-e libjvm.so [.] LightweightSynchronizer::fast_lock_spin_enter 0.76% http-nio-8080-e libjvm.so [.] AccessInternal::PostRuntimeDispatch<CardTableBarrierSet::AccessBarrier<594020ul, CardTableBarrierSet>, (AccessInterna 0.53% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7be54f09 0.53% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7b656b32 0.50% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7be54338 0.50% http-nio-8080-e [JIT] tid 92363 [.] 0x00007fdd7ba53e9b 0.50% http-nio-8080-e libjvm.so [.] ObjectMonitor::spin_enter 0.50% http-nio-8080-e libjvm.so [.] ObjectSynchronizer::FastHashCode 0.45% http-nio-8080-e libjvm.so [.] LightweightSynchronizer::get_or_insert_monitor and for comparison a profile from the exact same run on an unaffected x86 machine: 22.40% http-nio-8080-e libjvm.so [.] ObjectMonitor::try_lock 16.19% http-nio-8080-e libjvm.so [.] ObjectMonitor::try_spin 3.12% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567ead5c8 3.00% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567ead265 2.81% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b18a85 2.79% http-nio-8080-e libjvm.so [.] SpinPause 2.76% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b18ddd 2.53% http-nio-8080-e libjvm.so [.] ObjectMonitor::spin_enter 2.12% http-nio-8080-e libjvm.so [.] LightweightSynchronizer::inflate_and_enter 2.06% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f25676df9ca 1.73% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b96004 1.33% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b1ffe0 1.00% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567ead5d0 0.98% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b18d5f 0.87% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b18de5 0.82% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567ead5b8 0.59% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f25676576d5 0.50% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567b18dcd 0.43% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f25676528f1 0.39% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f25676573df 0.31% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f256765754c 0.29% http-nio-8080-e libjvm.so [.] SharedRuntime::complete_monitor_locking_C 0.28% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f25676577ac 0.28% http-nio-8080-e [JIT] tid 2410295 [.] 0x00007f2567657632 0.25% http-nio-8080-e libjvm.so [.] ObjectMonitor::enter_with_contention_mark The cmd line in both scenarios was: ${JAVA} -Xmx16g -Xms16g -Xmn12g -XX:MetaspaceSize=128m -XX:ReservedCodeCacheSize=256m -XX:+UseParallelGC '-Xlog:gc*:file=/tmp/coh-parallel-petclinic-gc.log' -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch -XX:-UseCompactObjectHeaders -XX:+UnlockDiagnosticVMOptions -XX:+UseObjectMonitorTable -jar ./target/spring-petclinic-3.5.0-SNAPSHOT.jar What sticks out is that complete_monitor_locking_C is quite a bit less heavy, and that the OMT stuff (LightweightSynchronizer::get_or_insert_monitor_from_table) seems absent from the non-affected run, even though it is also using OMT.
21-08-2025

I've been able to reproduce the regression on an older AMD processor (AWS instance type m5a). Baseline: 1396 r/s +COH: 1064 r/s That looks like a ~30% regression. In-fact, I have been able to reproduce this without compact headers by only enabling object-monitor-table (-XX:+UseObjectMonitorTable). This mirrors our earlier finding that OMT performs somewhat badly on older AMD processors, and we currently don't have a good explanation for it. That regression doesn't happen on Intel, ARM and newer AMD processors. Maybe [~coleenp] or [~aboldtch] have more ideas? Axel has implemented it, Coleen knows a lot about it, too, and has experimented with its performance as well.
19-08-2025

Today I tried to reproduce the issue, I tried it on a Xeon and a Graviton machine, in both cases I could not observe any regression. For example, averaging over 5 runs on Xeon, I got 724 r/s without COH and 720 r/s with COH. It seems likely that the underlying issue is similar to JDK-8339114. When experimenting with that, we could only see the regression on (oldish) AMD processors, not on Intels and not on ARMs. And it was caused by the object monitor tables that compact object headers also enable (i.e. not directly caused by smaller headers, but by the new implementation of heavyweight object locking). Tomorrow I will try to get my hands on a machine with AMD Rome or similar generation processor and see if I can reproduce there.
18-08-2025

Attached a screenshot of gprofng profiles of a baseline and coh which suggests the regression is related to locking / monitors with +COH.
13-08-2025