Bug ID: JDK-8339114 DaCapo xalan performance with -XX:+UseObjectMonitorTable

JDK-8339114 : DaCapo xalan performance with -XX:+UseObjectMonitorTable

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 24

Priority: P4
Status: Closed
Resolution: Fixed

Submitted: 2024-08-27
Updated: 2025-04-15
Resolved: 2025-04-02

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 25
25 b17Fixed

Related Reports

Blocks :	JDK-8350457 - Support Compact Object Headers as product option
Causes :	JDK-8354523 - runtime/Monitor/SyncOnValueBasedClassTest.java triggers SIGSEGV
Cloners :	JDK-8353588 - [REDO] DaCapo xalan performance with -XX:+UseObjectMonitorTable
Relates :	JDK-8354180 - Clean up uses of ObjectMonitor caches
Relates :	JDK-8353584 - [BACKOUT] DaCapo xalan performance with -XX:+UseObjectMonitorTable

Description

Dacapo xalan benchmark is around 14% slower with -XX:+UseObjectMonitorTable.  For now, the OM table is off so this is when it's turned on by default.


I have tried out a couple of ideas to see if they affect performance of xalan (I'm told it's pronounced zay-lon, not x-Alan).  Ideas

1. adjust size of OMCache from 2, 4, 8, 12, 24.  None matter.  Keeping at 8.
2. not use OMCache at all: worse.
3. not clear OM cache during GC (added oops_do which unfortunately keeps things alive).  Better hit rate but no better performance overall.
4. skip using OM cache in fast path (quick_enter) since it seems to repeat checks, no difference.
5. took out spinning before inflating monitor, worse, even though the hit rate is bad:
    _fast_lock_spin_failure = 37987135
    _fast_lock_spin_success = 556770
    _fast_lock_spin_attempt = 1039882

A table or om-cache lookup for each monitor enter, since these monitors are contended is 14% worse.  
Other benchmarks don't show this regression (except Dacapo23_spring, which is maybe the same thing).

xalan perf shows the code mostly in ObjectMonitor::TrySpin with and without the table.  Adaptive spinning is something that really helps xalan though.

Added some counters to the runtime code (c1-only performance was equivalently slower with OM table, so ignoring c2_MacroAssembler for now)

===== DaCapo 9.12-MR1 xalan PASSED in 4435 msec =====
_om_cache_hits      = 2456302
_om_cache_misses    = 1327485
_try_enter_success  = 1198359
_try_enter_failure  = 1257943
_try_enter_slow_failure = 958268
_try_enter_slow_success = 1672344
_fast_lock_spin_attempt = 33427
_fast_lock_spin_success = 4896
_fast_lock_spin_failure = 28531
_table_lookups = 1339097
_table_hits    = 1338926
_items_count   = 171

Comments

Fix is causing numerous crashes and so is being backed out.
02-04-2025
Changeset: 49cb7aaa Branch: master Author: Roman Kennke <rkennke@openjdk.org> Date: 2025-04-02 15:57:32 +0000 URL: https://git.openjdk.org/jdk/commit/49cb7aaad903aa5209da9f4af4b484ff38c0fb8b
02-04-2025
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/24098 Date: 2025-03-18 12:41:44 +0000
24-03-2025
I commented on your draft PR.
18-03-2025
This is a good observation. I wasn't really focused on the OMCache but by the time we get to the slow enter path, we've looked for the Monitor 3 times in this cache. Since xalan benefits from adaptive spinning, this does improve the scores for me too, by a lot more. I'm down to 4% slower. That's okay for this improvement and this specific benchmark in our opinion. This is great. Thank you for figuring this out! I have a change with a few things - I moved some functions that call ObjectMonitorTable up in the file where the table is (there were too many functions with similar names and I wanted them sorted), inlined some FastHashCode code, use BasicLock.object_monitor_cache() for the exit path, and then change try_enter to spin_enter. I'll send this through another round of our internal testing. If you want to prepare your change for spin_enter and I think you have the BasicLock change in your patch, I can review it and piggy back these other small changes on that in a separate PR.
18-03-2025
I posted a draft PR, which brings the the regression down to ~3% for me. https://github.com/openjdk/jdk/pull/24098 It might be useful to run other relevent performance tests to verify that this does not accidentally regress other workloads. I am not sure about some other parts of the PR, really. For example I am not sure that avoiding push-back of OMCache entries is useful or not. I'll do more experiments around that.
18-03-2025
My most plausible guess so far is that this test exaggerates the cost of to OMCache lookup because of the rather many locks that it is juggling/rotating. I think I can help performance a little by replacing the monitor->try_enter() with monitor->spin_enter() in LWS::quick_enter(). I've noticed that many quick_enter() calls take the slow path because of contention, which means we have to do the cache-lookup (and a bunch of other stuff again in the slow-path. Doing a spin_enter() in quick_enter() gives it more chance to actually succeed with the monitor that it has already found, and avoids to look up the cache again. This gets me down to 'only' ~7% regression.
17-03-2025
[~stuefe] I think you're the one with the old hardware. We're testing on the standard config Oracle OCI machines: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm We run on 1 whole numa node so 36 for Intel or 80 cores for Ampere.
13-03-2025
Strange. I ran dacapo xalan on my benchmark machine (Stock JVM with +/-UseObjectMonitorTable, i7 4770, RHEL9, JVM process pinned to 6 cores). 100 warmups each. Oddly enough, I could not reproduce any performance issues. I measured benchmark timings as well as a battery of perf stats, including context switches. I see no difference between -UseObjectMonitorTable and +UseObjectMonitorTable that raises above standard deviation (differences are way below that). Maybe its the limited number of cores, or the somewhat oldish hardware.
13-03-2025
I did a little histogram to see how many iterations OMCache::get_monitor() typically takes, and it looks like the vast majority of hits is at index 0: 0: 89.5% 1: 7.7% 2: 0.7% 3: 1.0% 4: 0.04% 5: 0.06% 6: 0.49% 7: 0.42%
12-03-2025
I tried an experiment where I linked the monitors to the Klass pointer, which gives a really nice hit rate on quick_enter and avoids the table and the fence. Doesn't help. Still at 13% regression. My experiment where I walk the first 32 entries of the ObjectSynchronizer::_in_use_list worked a lot better. This gets a 10% regression. Also added a park counter, because we were wondering if the timing of the ObjectMonitorTable caused us to park more often. We don't; no difference. Need to get better information out of Linux perf.
11-03-2025
Yes, I found the same thing with linux perf. All the threads were busy in the compiler. The app was barely in the profiles. And stopping the table lookup in the exit path didn't help performance, as I hoped it would.
11-03-2025
I did some experiments. Some observations: - The problem reproduces on both aarch64 and x86_64. - The problem reproduces with C1/without C2. -> appears to be a problem in the runtime. The workload doesn't seem to inflate an excessive number of monitors, nor does it run into excessive deflation. I also configured the heap so large that we don't run into safepoints often, and thus avoid clearing the OM cache all the time. Problem persists. I instrumented OMCache::get_monitor() to count misses and hits. It looks like I am getting about 5% OM cache misses (not found) and 95% hits, and 0% misses (deflating). Doesn't look too bad. I also instrumented OMCache::set_monitor(), and here it gets interesting. It never seems to have more than 6 entries on the cache (good), but it looks like it is very often pushing all entries up. I think the threads of the workload are cycling through ~6 monitors each, and every time a monitor is entered, it is pushed on the stack, and pushes up all the other ~5 entries. However, I have not yet found a better strategy.
11-03-2025
The runtime monitor exit path seems to do a table lookup always. It could use the cache from the BasicLock. But it doesn't help. shrugs
11-03-2025
One more thing: while perf hasn't been very helpful in finding the problem (so far), one thing that sticks out is that with +UOMT, we seem to get more than twice as many context switches than with -UOMT. Not sure yet if that is significant. And with C2 enabled, perf reports a lot of activity in JIT code, and this is one thing that I remember from Xalan: it generates bytecode for that XSLT transforms which then gets compiled. So maybe compiling all that code simply takes more resources with +UOMT?
11-03-2025
I've been testing with a ShouldNotReachHere() in the deflation paths, so I agree that they're not an issue. In his benchmark there are many threads sharing about 100 org/apache/xpath/axes/IteratorPool locks. The OM table is somewhat empty, so the table rate should be good. But our suspicion is that the fence in the table lookup is causing the performance loss, so I've been finding ways to avoid going to the table. As you found, the OM cache seems very effective. And the problem does seem not in the c2 code but in the runtime. I have a change where I search through the om_list for the monitor also (which might not be 100% correct wrt deflation but it's an experiment) since the not all threads have already locked the shared IteratorPool objects. It gets about 3% more so I'm down to 10% worse on my x86 machine. ===== DaCapo 9.12-MR1 xalan PASSED in 1298 msec ===== _om_cache_hits = 50397651 _om_cache_misses = 22262472 _try_enter_success = 24451180 _try_enter_failure = 25946471 <= This is in quick_enter _try_enter_om_list_hit = 8099554 _try_enter_om_cache_hit = 25954978 _try_enter_om_cache_miss = 22266795 _try_enter_slow_success = 29126309 _try_enter_slow_failure = 19095464 _fast_lock_spin_attempt = 455016 _fast_lock_spin_success = 307718 <= this fast lock spinning seems good but we should always already have monitors so ? _fast_lock_spin_failure = 147298 _table_lookups = 22266902 _table_hits = 22266719 <= this is the number I've been trying to reduce _om_hash_code_hits = 22266724 _om_hash_code_misses = 178 _inflated_exit_path = 282239 <= exit path has the monitor and looks it up again so pass monitor to exit. ObjectMonitorTable entries 183 dead 113 table size 32768 All these experiments have had limited success. We're having a brainstorming session today with [~aboldtch]. I'm still very suspicious of the fences in the ConcurrentHashTable. Maybe there's a way to have a lock-free, fence-free lookup ?
11-03-2025