Bug ID: JDK-8319137 release _object in ObjectMonitor dtor to avoid races

JDK-8319137 : release _object in ObjectMonitor dtor to avoid races

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 21,22

Priority: P2
Status: Closed
Resolution: Fixed
OS: linux
CPU: x86_64

Submitted: 2023-10-30
Updated: 2024-02-22
Resolved: 2023-11-22

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 21	JDK 22
21.0.3Fixed	22 b26Fixed

Related Reports

Blocks :	JDK-8319896 - Remove monitor deflation from final audit
Relates :	JDK-8256302 - releasing oopStorage when deflating allows for faster deleting

Description

The following stress test failed in my jdk-22+21 stress testing run:

StressWrapper_SetNameAtExit.java

which is a wrapper around the following test:

runtime/Thread/SetNameAtExit.java

Here's a snippet from the log file:

#section:main
----------messages:(6/341)----------
command: main -XX:+HeapDumpOnOutOfMemoryError -Xmx128m SetNameAtExit 6646
reason: User specified action: run main/othervm/timeout=6946 -XX:+HeapDumpOnOutOfMemoryError -Xmx128m SetNameAtExit 6646
started: Sat Oct 28 04:05:29 EDT 2023
Mode: othervm [/othervm specified]
finished: Sat Oct 28 04:48:23 EDT 2023
elapsed time (seconds): 2574.554
----------configuration:(0/0)----------
----------System.out:(18/1156)----------
About to execute for 6646 seconds.
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/work/shared/bug_hunt/thread_SMR_stress/jdk22_exp.git/open/src/hotspot/share/oops/weakHandle.inline.hpp:38), pid=49774, tid=24579
#  assert(!is_null()) failed: Must be created
#
# JRE version: Java(TM) SE Runtime Environment (22.0) (slowdebug build 22-internal-2023-10-26-1411081.dcubed...)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (slowdebug 22-internal-2023-10-26-1411081.dcubed..., mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-aarch64)
# Core dump will be written. Default location: ../core.49774
#
# An error report file with more information is saved as:
# /System/Volumes/Data/work/shared/bug_hunt/thread_SMR_stress/jdk22_exp.git/build/macosx-aarch64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_StressWrapper_SetNameAtExit_java/StressWrapper_SetNameAtExit/hs_err_pid49774.log
[2565.454s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
----------System.err:(0/0)----------
----------rerun:(35/5474)*----------

and here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x000000012e811a10):  JavaThread "MainThread"        [_thread_in_Java, id=24579, stack(0x00000001717b4000,0x00000001719b7000) (2060K)]

Stack: [0x00000001717b4000,0x00000001719b7000],  sp=0x00000001719b5d00,  free space=2055k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x128b66c]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x93c  (weakHandle.inline.hpp:38)
V  [libjvm.dylib+0x128bd04]  VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, char*)+0x0
V  [libjvm.dylib+0x62cef4]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0x523da8]  WeakHandle::peek() const+0x58
V  [libjvm.dylib+0xf2eec4]  ObjectMonitor::object_peek() const+0x3c
V  [libjvm.dylib+0x117ca24]  ObjectSynchronizer::quick_enter(oopDesc*, JavaThread*, BasicLock*)+0x100
V  [libjvm.dylib+0x1094314]  SharedRuntime::monitor_enter_helper(oopDesc*, BasicLock*, JavaThread*)+0x34
V  [libjvm.dylib+0x10944d4]  SharedRuntime::complete_monitor_locking_C(oopDesc*, BasicLock*, JavaThread*)+0xa0
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
v  ~RuntimeStub::_complete_monitor_locking_Java 0x0000000116ae2494
J 295% c2 SetNameAtExit.main([Ljava/lang/String;)V (247 bytes) @ 0x0000000116f977fc [0x0000000116f94d00+0x0000000000002afc]
j  java.lang.invoke.LambdaForm$DMH+0x0000024001002000.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@22-internal
j  java.lang.invoke.LambdaForm$MH+0x0000024001003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@22-internal
j  java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@22-internal
j  jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@22-internal
j  jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@22-internal
j  java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@22-internal
j  com.sun.javatest.regtest.agent.MainWrapper$MainTask.run()V+134
j  java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@22-internal
j  java.lang.Thread.run()V+19 java.base@22-internal
v  ~StubRoutines::call_stub 0x00000001169d417c
Lock stack of current Java thread (top to bottom):
<empty>

The test failed in a single slowdebug config so in 1 of 9 runs during
the jdk-22+21 stress testing cycle.

Comments

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/12 Date: 2023-12-13 11:02:46 +0000
13-12-2023
[jdk21u-fix-request] Approval Request from Aleksey Shipilëv This effectively reverts [JDK-8256302](https://bugs.openjdk.org/browse/JDK-8256302), which introduced race condition in JDK 21. The patch needs a few minor adjustments to match JDK 21. Testing passes.
27-11-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u/pull/401 Date: 2023-11-24 12:03:48 +0000
24-11-2023
Note for backport sequencing: JDK-8319896 needs to be backported first, otherwise the JavaThread::cast in new code would fail when called from final audit. (Edit: or maybe not, I can actually revert to pre-JDK 21 form, and we would not expose JDK 21 to the risk of rewrites right away; let me see...)
24-11-2023
Changeset: c39d001c Author: Patricio Chilano Mateo <pchilanomate@openjdk.org> Date: 2023-11-22 14:59:47 +0000 URL: https://git.openjdk.org/jdk/commit/c39d001c7a1ae9eb322a7bb621a03e18c9bf02a1
22-11-2023
The PR submitted is an effective backout of JDK-8256302 where the race was introduced.
20-11-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16738 Date: 2023-11-20 14:44:08 +0000
20-11-2023
[~pchilanomate] OK. Let's go with your patch. Aleksey wanted to proceed with his limited unlink patch and I think your patch will fit better with that. And then we can create a new Bug that covers the rewrite/PoC of the unlinking/deleting of monitor deflation.
20-11-2023
Regarding 8256302 motivation of moving the release of oop storage out of the ObjectMonitor dtor, I've been running some micro benchmarks to compare how fast we can deflate. I see that the average time to deflate a monitor(accounting from the time we start walking the monitor list until we finished releasing the memory) goes from as low as 0.17us to as high as 1.5us. This is without introducing safepoints stops in the middle since I just wanted to measure the deflation code. It seems there is a correlation where lower batch sizes result in being closer to the faster deflation rate but I have actually seen this is variable. Regardless, this means that even at the worst deflation rate we should be able to handle any workload outside some artificial benchmark which creates inflated monitors at such fast rate. I still measured the time it takes to release the oop storage vs the time it takes to free the ObjectMonitor memory to see if it makes sense to split them and remain blocked for the second part but I got mixed results so not sure if that's worth it.
20-11-2023
[~stefank] Ok, I already have a patch to send for review in case you decide otherwise.
20-11-2023
[~pchilanomate] Axel has a PoC patch ( https://github.com/openjdk/jdk/pull/16412#issuecomment-1818358260) that fixes this bug and the JDK-8319048. I'm assigning this bug to him, so that we can work on creating a proper PR for this. If you still want to own this Bug, then ping me.
20-11-2023
Yes, agreed. Sounds like we have a plan!
17-11-2023
Thanks [~eosterlund]. Okay, then I think the best thing is to move the release back to the ObjectMonitor dtor as it was before JDK-8256302. Yes, the old code was already polling for safepoints on each iteration so time to safepoint shouldn't be a problem.
17-11-2023
[~shade] Is adding incremental polling to the unlink part of the monitor deflation. Erik's suggestion is to add similar polling to the deletion loop (IIUC). If we end up adding polling for the deletion loop we should consider if we need a flag or not to determine the number of monitors to delete between each poll. If we don't need a flag here, maybe we don't need it for the unlink phase either?. Just some food for thought.
17-11-2023
I couldn't really tell from the numbers posted in JDK-8256302 whether moving the WeakHandle release was really a performance improvement, and I thought from reading it that _not_ polling for safepoint in between each ObjectMonitor destructor call was the cause of the performance improvement. If that's the case, the code could poll for safepoint every N iterations of deleting ObjectMonitors, where N is something you can pick based on one of these tests in the original bug that seemed to show the most difference. Releasing the WeakHandle in the destructor is where it really belongs unless there's a complelling performance reason not to do that.
17-11-2023
Thanks, also there's a comment why for CLD creation races that we need to clear the weak handle OopStorage. I added it to not upset the GC barrier gods.
17-11-2023
[~pchilanomate] No, I think we should honour the current protocol of clearing oops before handing them back to OopStorage, which means that if we release the WeakHandle in the destructor of an ObjectMonitor, we simply have to be in_vm when deleting the ObjectMonitor. But isn't that a time to safepoint problem? No, not if we delete with incremental polling in the loop; then there will be no time to safepoint problem.
17-11-2023
[~eosterlund] Sorry, just to clarify, are you saying we could have WeakHandle::release(OopStorage* storage) only do storage->release(_obj);? And not bother clearing the oop?
16-11-2023
> > The current API effectively allows the WH to be deallocated whilst it is still in use! Right. The WeakHandle and OopHandle API make no promises or claims that they are ok wrt concurrency. It's assumed that when you release the {OopWeak}Handle that it's no longer in use.
16-11-2023
Ok, that's what we did prior to JDK-8256302 so we might need to go back to just releasing the storage when we delete the monitor. I think we don't need to be in a safepoint unsafe state to do that so we can remain blocked and keep that loop where we delete the monitors as is. The problem with being in a blocked state was clearing the handle (that's why we moved that call in 8256302 to ObjectMonitor::deflate_monitor()) but if we don't need to do that we are good.
06-11-2023
I thought we could release it in the OM dtor, which hopefully we are not racing with.
06-11-2023
But I think the problem here is the race between releasing the oop storage space and trying to read from it, even if we don't clear the handle, because the memory could be freed already by the time we do the load (I see this cleanup is done by the ServiceThread, i.e not at a safepoint). Or if we let the GC always clear it, will it also release the oop storage space and will this be done at a safepoint?
06-11-2023
I think the main reason the handle is cleared is just history. This used to be a strong oop, but we would run deflation as part of safepoint cleanup to remove as many of these strong roots as possible before the GC could run, which in practice would give it sort of weak ish semantics but not really. The deflation code would clear the strong oop or it would be kept alive by the GC. The oop was placed in an OopHandle, meaning it was still strong, and it worked similarly. I think I might have migrated it to use weak handles as weak semantics was really what the strong + deflate before safepoint operation code was trying to achieve. And I think that when I did that, I should probably have removed the explicit clearing, but didn’t realize it would cause any trouble either. But yeah I don’t think we need to do any explicit clearing from deflation any longer since we moved to weak semantics for these roots.
06-11-2023
> I guess the current design was meant to avoid that overhead in the GC and just rely on the user to do the right thing. But the user can't do the "right thing" here. When you allow this type of direct clearing then the accessing code and the clearing code must have a protocol to ensure the handle is never accessed after being cleared.You would need an additional piece of state indicating the handle is valid, which would have to be checked atomically with respect to acting on the handle - i.e. heavyweight locking. I'm unclear if the current situation is an oversight in the original changes (which were done in 2 parts: first switch to oopStorage, then remove TSM so we actually delete monitors). Or whether we have since changed some assumptions/expectations that previously made it correct? Perhaps [~eosterlund] can comment? I am suspecting that the original design first moved the OM's to a place they could no longer be accessed, and then deleted them, thus freeing the oopStorage and the weakhandle.
06-11-2023
> The unavoidable usage races are safe if only the GC can do the clearing (which would require a safepoint) I guess the current design was meant to avoid that overhead in the GC and just rely on the user to do the right thing. This behavior is actually not unique to weakHandle, the same issue exists when using OopHandle. If a thread calls release() on an OopHandle and another thread concurrently tries to load/store to that OopHandle the load/store could be executed on already freed memory.
03-11-2023
Assuming there are no accesses after the referrent is cleared kind of defeats the whole purpose of the API. // A WeakHandle is a pointer to an oop that is stored in an OopStorage that is // processed weakly by GC. The runtime structures that point to the oop must // either peek or resolve the oop, the latter will keep the oop alive for // the GC cycle. The runtime structures that reference the oop must test // if the value is null. If it is null, it has been cleaned out by GC. // This is the vm version of jweak but has different GC lifetimes and policies, // depending on the type. but I think the flaw here is that we should not be manually releasing a weakHandle as we do in monitor deflation. The weakHandle is supposed to be cleared by the GC. The unavoidable usage races are safe if only the GC can do the clearing (which would require a safepoint). So I think this is a misuse of this API.
03-11-2023
> If I comment that assert the test actually crashes with SIGSEGV instead. Edit: The SIGSEGV crash was not the one I wanted to trigger. This was just a dereference of nullptr now that the assert was removed. I forgot that the call to os::naked_short_nanosleep() makes the compiler reload _obj when we come back. I moved the artificial delay to WeakHandle::peek() instead and used a local to simulate this scenario of reading and verifying _obj is not null but then crashing on the load. But at least for this test it doesn't crash. The race is still there though it's just that we need to hit the case where the memory for the oop storage is actually freed by the OopStorage backend for the bug to be visible. I think this will only happen if the block where this address we are releasing lies is completely empty. > Okay then it seems to me that weakHandle is a fundamentally broken API. The whole point is that the obj is supposed to be reclaimable whilst we still have a weakHandle for it. So releasing the obj should not break the weakHandle as the weakHandle in then supposed to be able to report the obj is now null. > Looking closer at the API I'm not sure why we have the assert in peek/resolve instead of a check. We do the check for peek/resolve for the OopHandle class. But in any case the check won't fix the race. We can always check and see that _obj is not null and then the deflater release the storage right before making the load. > The current API effectively allows the WH to be deallocated whilst it is still in use! Yes, but I don't see how we can prevent that without locking. I guess the API just assumes release() is done at a time where there won't be any concurrent accesses. And given the asserts in peek/resolve is also assuming there actually won't be any future accesses.
02-11-2023
Okay then it seems to me that weakHandle is a fundamentally broken API. The whole point is that the obj is supposed to be reclaimable whilst we still have a weakHandle for it. So releasing the obj should not break the weakHandle as the weakHandle in then supposed to be able to report the obj is now null. The current API effectively allows the WH to be deallocated whilst it is still in use! Or maybe the issue is with the OopStorage mechanism itself - a weak handle needs a way to say to OopStorage "Here's something that was a pointer to an oop, can you tell me if it is still valid and if so return it and keep it alive for a while". ??
02-11-2023
> Do we really have a race here or is the expectation that we should just return null and the assertion is getting in the way? Yes, we still have a race because we can always see that _obj is not null but right before making the load the address could be freed because the deflater released the oop storage. If I comment that assert the test actually crashes with SIGSEGV instead.
02-11-2023
> So peek() can return that the oop value pointed to by _obj is null, but _obj itself has to be != null. Okay the weakHandle code needs a little improvement in that it doesn't distinguish between the WH never being set versus already having been released. The assertion message "Must be created" is misleading in the current case. I also think the API is flawed in that peek/resolve should not assert the way they do, but should simply return null if the object has already been released. Do we really have a race here or is the expectation that we should just return null and the assertion is getting in the way?
02-11-2023
So peek() can return that the oop value pointed to by _obj is null, but _obj itself has to be != null. The issue can be easily reproduced by running SetNameAtExit.java (and pretty much every test from what I see) with the options: "-XX:+UnlockDiagnosticVMOptions -XX:GuaranteedAsyncDeflationInterval=10 -XX:TieredStopAtLevel=3 -XX:LockingMode=0" and adding os::naked_short_nanosleep(1) in ObjectMonitor::object_peek() between the is_null() check and the peek() call. The problem is the race I mentioned above between a thread calling ObjectMonitor::object_peek() and the deflater thread releasing the oop storage for that weak handle in ObjectMonitor::deflate_monitor(). I see this bug was introduced by JDK-8256302 which moved the release of the oop storage from ~ObjectMonitor() to ObjectMonitor::deflate_monitor(). I don't see why we need to do object_peek() in quick_enter() though to identify the deflation case. If the monitor is deflated the owner change will just fail. But I see there are other usages of ObjectMonitor::object_peek() so I need to verify if those are safe and if not if we can get rid of them.
01-11-2023
But peek() is allowed to return null, so I don't see how the assertion inside peek is valid.
01-11-2023
I think this is a race with the deflation thread. So in ObjectSynchronizer::quick_enter()->ObjectMonitor::object_peek(), between checking _object.is_null() and calling _object.peek() (which also calls is_null()) the deflation thread could run and deflate the monitor releasing the oop storage for _object. I'm guessing that we would only see this issue in a slowdebug build though, otherwise that repeated null check for _obj should be optimized out by the compiler. This sounds familiar in nature to JDK-8303086 where the issue was also reproducible only in slowdebug mode due to this repeated load and check. Although even if the _obj is not read twice we might load from a slot that has already been released, so the race is still there. I'll try tomorrow to see if I can reproduce it.
01-11-2023
inline oop WeakHandle::resolve() const { assert(!is_null(), "Must be created"); return NativeAccess<ON_PHANTOM_OOP_REF>::oop_load(_obj); } inline oop WeakHandle::peek() const { assert(!is_null(), "Must be created"); return NativeAccess<ON_PHANTOM_OOP_REF \| AS_NO_KEEPALIVE>::oop_load(_obj); } Isn't the whole point of a weakHandle the fact that the referent may be gone because the weakHandle does not keep it alive? If so those asserts seem bogus to me.
01-11-2023
Is this another lockStack bug?
01-11-2023
ILW = HLH = P2
31-10-2023
Here's the logs from my jdk-22+21 stress run sighting on macos-aarch64: $ unzip -l jdk-22+21_macosx-aarch64.8319137.zip Archive: jdk-22+21_macosx-aarch64.8319137.zip Length Date Time Name --------- ---------- ----- ---- 660587 10-28-2023 04:48 jdk-22+21_2/failures.macosx-aarch64/hs_err_pid49774.log 23200 10-28-2023 04:48 jdk-22+21_2/failures.macosx-aarch64/StressWrapper_SetNameAtExit.jtr.slowdebug --------- ------- 683787 2 files
30-10-2023