JDK-8320318 : ObjectMonitor Responsible thread
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-11-17
  • Updated: 2024-10-22
  • Resolved: 2024-09-30
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 24
24 b18Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The ObjectMonitor successor protocol designates a Responsible thread to avoid a memory barrier when storing null in the owner for unlocking a monitor.  The cost of memory barriers is a lot lower than 20+ yrs ago when this was implemented.  We could simplify this code by adding the write barrier and removing Responsible.

Performance test this, of course.
Comments
Changeset: 180affc5 Branch: master Author: Fredrik Bredberg <fbredberg@openjdk.org> Date: 2024-09-30 12:28:35 +0000 URL: https://git.openjdk.org/jdk/commit/180affc5718c9bf2f009d6a7aa129cc36335384a
30-09-2024

Please note that I've changed the LockUnlock.testContendedLock from @Threads(2) to @Threads(3) which makes this test run slower. The reason for this change was because it enabled me to increase the code coverage, and thereby execute all(?) the corner cases when doing ObjectMonitor locking. Even if I hadn't changed the number of threads from two to three, there might still be a slight decrease in performance, which is probably due to the added StoreLoad barrier in fast_unlock et al. However it's only there if you have an inflated monitor (i.e. you are experiencing contended locking). In a real world application where you inflate, park and unpark, one added StoreLoad doesn't seem to change the overall performance that much. Which is probably why we don't see any real regression when we run our performance tests (like DaCapo, Renaissance, SPECjvm etc.).
27-09-2024

[~fbredberg] can you please add a description of the new C2 fast_unlock protocol - thanks.
19-09-2024

Why do we have "a responsible thread" and are there any problems with it? C2_MacroAssembler::fast_unlock_lightweight (inflated monitor case): 1. Check if the EntryList and the cxq list and are empty, and if so unlock the monitor (setting the ObjectMonitor owner to NULL) and return through the fast path. Note that we don't issue any memory fence after unlocking the monitor. 2. If any of the lists are not empty, we must check for a successor. If there is no successor we return through the slow path. The successor protocol requires the unlocker to check if the successor is racingly set right after unlocking the monitor. Failing to do so will result in nobody waking up the successor, until the next lock operation on the same ObjectMonitor. If there is no java thread that want to lock the ObjectMonitor, it would lead to an indefinite hang and thus "stranding" the threads in the EntryList and cxq lists. 3. If there is a successor we unlock the monitor and issue a fence. 4. Now we recheck if there is a successor. If there is a successor after the monitor was unlocked (with a fence) we have successfully handed off the monitor and we return through the fast path. 5. If there is no successor we try to relock the monitor by setting the ObjectMonitor owner to the current thread. 6. If we successfully relocked the monitor, we return through the slow path. 7. If we didn't manage to relock the monitor, we assume that it was locked by another regular java thread which will keep the monitor alive when it unlocks the monitor, so we return through the fast path. Problems with regards to the previous steps: 1. In order to perform as few operations as possible we unlock the monitor and return through the fast path as soon as we have ensured that the entry lists are empty. But this opens up for a race when a thread might add itself to the cxq list just after we have checked if the entry lists are empty, but before we unlock the monitor, which may result in stranding. In order to handle this situation we appoint the first thread that enters any of the entry lists the job of being "responsible" for keeping the monitor protocol alive. The responsible thread uses a timed park instead of a normal indefinite park operation, periodically waking up and checking for and recovering from potential strandings as the one described above. 7. The reason we didn't manage to relock the monitor, might not be because it was locked by another regular java thread. Instead it might be locked by the deflator thread which will not unpark any thread if it unlocks the monitor, thus leading to stranding. The reason the deflator thread might need to unlock the monitor is because some one might have added itself to the entry list while the monitor is owned by the deflator thread, and thus deflation must be canceled. The responsible thread will save us from stranding in this case as well. As pointed out by [~pchilanomate], there unfortunately seems to be a hole in the responsible scheme as well. "_Responsible is not always set, it is cleared when the responsible thread acquires the lock, or when the owner releases the monitor in the slow path. And if cxq or EntryList are not empty incoming threads will not set _Responsible even if it is null." Example: T1: Acquires the lock. T2: Tries to acquire the lock. Since the entry lists are empty, T1 will make itself "the responsible". Puts itself on the entry list. Goes into a timed park. T3: Tries to acquire the lock. Since the entry lists are not empty, T3 will not try become the responsible. Goes into an infinite park. T1: Releases the lock and unparks T2. T2: T2 acquires the lock and resigns from being the responsible. At this point the entry lists are not empty and we have no responsible. Further more, any new thread trying to acquire the lock will not make itself responsible, because the entry lists are not empty. We will suffer from stranding if T2 manages to go through all 7 steps described above when releasing the lock, and the reason why T2 didn't managed to reacquire the lock, was because it was locked by the deflator thread.
19-09-2024

[~dholmes] It might appear as a new C2 fast_unlock protocol. But in reality, C2 fast_unlock is now following the same protocol as in the platform independent slow path. This protocol is decribed in the comment above ObjectMonitor::exit(), and in part looks like this: -------- BEGIN COPY FROM ObjectMonitor::exit() COMMENT -------- This is the exit part of the locking protocol, often implemented in C2_MacroAssembler::fast_unlock() 1. A release barrier ensures that changes to monitor meta-data (_succ, _EntryList, _cxq) and data protected by the lock will be visible before we release the lock. 2. Release the lock by clearing the owner. 3. A storeload MEMBAR is needed between releasing the owner and subsequently reading meta-data to safely determine if the lock is contended (step 4) without an elected successor (step 5). 4. If both _EntryList and _cxq are null, we are done, since there is no other thread waiting on the lock to wake up. I.e. there is no contention. 5. If there is a successor (_succ is non-null), we are done. The responsibility for guaranteeing progress-liveness has now implicitly been moved from the exiting thread to the successor. 6. There are waiters in the entry list (_EntryList and/or cxq are non-null), but there is no successor (_succ is null), so we need to wake up (unpark) a waiting thread to avoid stranding. Note that since only the current lock owner can manipulate the _EntryList or drain _cxq, we need to reacquire the lock before we can wake up (unpark) a waiting thread. -------- END COPY FROM ObjectMonitor::exit() COMMENT -------- What really *is* different, is that if C2 fast_unlock decided to exit through the slow path, it would do so while still holding the lock. Now it will exit through the slow path while *not* holding the lock. Instead of adding assembler code in all platforms to reacquire the lock, which is needed in order to manipulate the queue and unpark, the reacquire of the lock is placed in the platform independent SharedRuntime::monitor_exit_helper(). Which monitor to reacquire is, transferred from fast_unlock to the monitor_exit_helper via a pointer in the current JavaThread called _unlocked_inflated_monitor. Hope this info shed some light.
19-09-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/19454 Date: 2024-05-29 12:58:02 +0000
09-09-2024

Hi [~swesonga] - After removing the concept of a the Responsible thread, there is no longer any need to do timed parking. So we started to zoom in on timed vs infinite parking on Windows, and by going back to timed parking (with a very long timeout and only on Windows) the regression went away. // park self NOT_WINDOWS(current->_ParkEvent->park();) WINDOWS_ONLY(current->_ParkEvent->park((jlong) 0x10000000);) This somehow seems to be related to the enabling of high resolution timers. Because if we go back to: // park self current->_ParkEvent->park(); And add calls to enable hi res timer resolution around the call to WaitForSingleObject when doing infinite parking, that also removes the regresion. Anyhow I've created JDK-8339730 in order to let this issue only handle the removal of the ObjectMonitor Responsible thread. So all talking about the Windows regression should take place in JDK-8339730 instead.
09-09-2024

Hi [~swesonga] - we run these benchmarks in Oracle OCI cloud, and we observed the regressions on both the BM.Standard3.64 and the BM.Optimized3.36 shapes, and we are running some version of Windows Server 2019 for all these runs. We are not using any special command line options for Dacapo. We run these in a loop 5 times each. --size large --iterations 3 tomcat --size large --iterations 5 spring Shape info: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm Thanks for helping with this!
05-09-2024

Hi [~coleenp], we do not currently have the DaCapo benchmark suite in-house. Our plan is to bring it in, validate our setup and then carry out performance baselining. Do you have any additional details of how these runs were executed e.g. benchmarking setup (hardware configuration, OS version/build, notable command line options) and any observed run-to-run variations for the listed configurations?
05-09-2024

[~dhanalla] [~swesonga] This change is a correctness fix to remove low probability thread stranding with contended locking. When we run DaCapo benchmarks, we found that in some configurations the Spring and Tomcat benchmark is much slower (like 100% regression, but only some runs. Some runs don't show a regression). Other benchmarks are significantly faster for contended locking on linux OSs. Here is the Draft PR. Would you or someone at Microsoft be able to run this and let us know if what we're seeing is real or just a config problem at our end? Thanks. https://github.com/openjdk/jdk/pull/19454 This is what we see on OracleLinux-aarch64, OracleLinux-x64, Windows-x64 DaCapo23-spring-large 24.20% 6.88% -40.29% DaCapo23-tomcat-large 28.53% 2.05% 1.91% Some other run same binaries: DaCapo23-spring-large 24.76% 8.23% -40.99% DaCapo23-tomcat-large 2.08% 0.00% -104.48%
30-08-2024

[~pchilanomate] Add motivation from your loom locking work.
17-11-2023