JDK-8353588 : [REDO] DaCapo xalan performance with -XX:+UseObjectMonitorTable
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 24
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2025-04-02
  • Updated: 2025-04-07
  • Resolved: 2025-04-03
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25
25 masterFixed
Related Reports
Blocks :  
Cloners :  
Sub Tasks
JDK-8353584 :  
Description
REDO of JDK-8339114 

Dacapo xalan benchmark is around 14% slower with -XX:+UseObjectMonitorTable.  For now, the OM table is off so this is when it's turned on by default.


I have tried out a couple of ideas to see if they affect performance of xalan (I'm told it's pronounced zay-lon, not x-Alan).  Ideas

1. adjust size of OMCache from 2, 4, 8, 12, 24.  None matter.  Keeping at 8.
2. not use OMCache at all: worse.
3. not clear OM cache during GC (added oops_do which unfortunately keeps things alive).  Better hit rate but no better performance overall.
4. skip using OM cache in fast path (quick_enter) since it seems to repeat checks, no difference.
5. took out spinning before inflating monitor, worse, even though the hit rate is bad:
    _fast_lock_spin_failure = 37987135
    _fast_lock_spin_success = 556770
    _fast_lock_spin_attempt = 1039882

A table or om-cache lookup for each monitor enter, since these monitors are contended is 14% worse.  
Other benchmarks don't show this regression (except Dacapo23_spring, which is maybe the same thing).

xalan perf shows the code mostly in ObjectMonitor::TrySpin with and without the table.  Adaptive spinning is something that really helps xalan though.

Added some counters to the runtime code (c1-only performance was equivalently slower with OM table, so ignoring c2_MacroAssembler for now)

===== DaCapo 9.12-MR1 xalan PASSED in 4435 msec =====
_om_cache_hits      = 2456302
_om_cache_misses    = 1327485
_try_enter_success  = 1198359
_try_enter_failure  = 1257943
_try_enter_slow_failure = 958268
_try_enter_slow_success = 1672344
_fast_lock_spin_attempt = 33427
_fast_lock_spin_success = 4896
_fast_lock_spin_failure = 28531
_table_lookups = 1339097
_table_hits    = 1338926
_items_count   = 171

Comments
Changeset: d894b781 Branch: master Author: Roman Kennke <rkennke@openjdk.org> Date: 2025-04-03 17:12:38 +0000 URL: https://git.openjdk.org/jdk/commit/d894b781b8f245ce8a5d28401c0abb5abb420bc8
03-04-2025

Posted this in the backout issue, did not see that there was a REDO already. Looks like the problem is that when a thread self deoptimizes (like when hitting an uncommon trap) and eliminated locks are restored, the new BasicLock is not cleared. So we may read garbage. Either we clear it when relocking ``` diff --git a/src/hotspot/share/runtime/deoptimization.cpp b/src/hotspot/share/runtime/deoptimization.cpp index 748cd0d7939..1cbccd18039 100644 --- a/src/hotspot/share/runtime/deoptimization.cpp +++ b/src/hotspot/share/runtime/deoptimization.cpp @@ -1662,6 +1662,9 @@ bool Deoptimization::relock_objects(JavaThread* thread, GrowableArray<MonitorInf } BasicLock* lock = mon_info->lock(); if (LockingMode == LM_LIGHTWEIGHT) { + if (UseObjectMonitorTable) { + lock->clear_object_monitor_cache(); + } // We have lost information about the correct state of the lock stack. // Entering may create an invalid lock stack. Inflate the lock if it // was fast_locked to restore the valid lock stack. ``` or we use a null BasicLock* when calling inflate_and_enter from LightweightSynchronizer::enter_for as a signal to not use the lock ``` diff --git a/src/hotspot/share/runtime/lightweightSynchronizer.cpp b/src/hotspot/share/runtime/lightweightSynchronizer.cpp index 908cbe080a3..f8fe70907c7 100644 --- a/src/hotspot/share/runtime/lightweightSynchronizer.cpp +++ b/src/hotspot/share/runtime/lightweightSynchronizer.cpp @@ -637,7 +637,7 @@ void LightweightSynchronizer::enter_for(Handle obj, BasicLock* lock, JavaThread* } else { do { // It is assumed that enter_for must enter on an object without contention. - monitor = inflate_and_enter(obj(), lock, ObjectSynchronizer::inflate_cause_monitor_enter, locking_thread, current); + monitor = inflate_and_enter(obj(), nullptr, ObjectSynchronizer::inflate_cause_monitor_enter, locking_thread, current); // But there may still be a race with deflation. } while (monitor == nullptr); } @@ -1028,7 +1028,9 @@ ObjectMonitor* LightweightSynchronizer::inflate_and_enter(oop object, BasicLock* // There's no need to use the cache if we are locking // on behalf of another thread. if (current == locking_thread) { - monitor = lock->object_monitor_cache(); + if (lock != nullptr) { + monitor = lock->object_monitor_cache(); + } if (monitor == nullptr) { monitor = current->om_get_from_monitor_cache(object); } ``` Both solves the issue. This just enforces what I said in the PR that we should make this BasicLock cache a C2 property. And only use it when we know we come from C2.
03-04-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/24413 Date: 2025-04-03 11:59:58 +0000
03-04-2025