JDK-8318757 : VM_ThreadDump asserts in interleaved ObjectMonitor::deflate_monitor calls
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17,21,22
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2023-10-24
  • Updated: 2024-05-28
  • Resolved: 2023-11-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 21 JDK 22
21.0.4Fixed 22 b25Fixed
Related Reports
Blocks :  
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following test failed in the JDK22 CI:

applications/runthese/RunThese8H.java

Here's snippets from the hs_err_pid file:

#  Internal Error (/opt/mach5/mesos/work_dir/slaves/afbc6042-3a24-4198-9369-18c663a3f74c-S26242/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/9bb4b2f2-b367-4f77-8070-a87bcc8390e0/runs/b97f4c9a-293f-423a-9fb9-8eaa8a0903d7/workspace/open/src/hotspot/share/runtime/objectMonitor.inline.hpp:123), pid=1966066, tid=1966105
#  assert(prev == old_value) failed: unexpected prev owner=0x0000000000000002, expected=0x0000000000000000
#
# JRE version: Java(TM) SE Runtime Environment (22.0+21) (fastdebug build 22-ea+21-1634)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-ea+21-1634, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x1463f3e]  ObjectMonitor::deflate_monitor()+0x87e

<snip>

---------------  T H R E A D  ---------------

Current thread (0x00007f955802b180):  VMThread "VM Thread"          [id=1966105, stack(0x00007f9540457000,0x00007f9540557000) (1024K)]

Stack: [0x00007f9540457000,0x00007f9540557000],  sp=0x00007f95405554a0,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1463f3e]  ObjectMonitor::deflate_monitor()+0x87e  (objectMonitor.inline.hpp:123)
V  [libjvm.so+0x17348a6]  ObjectSynchronizer::deflate_monitor_list(Thread*, LogStream*, elapsedTimer*, ObjectMonitorsHashtable*)+0xa6  (synchronizer.cpp:1603)
V  [libjvm.so+0x1734adc]  ObjectSynchronizer::deflate_idle_monitors(ObjectMonitorsHashtable*)+0x18c  (synchronizer.cpp:1687)
V  [libjvm.so+0x1880da8]  VM_ThreadDump::doit()+0x148  (vmOperations.cpp:289)
V  [libjvm.so+0x18815ff]  VM_Operation::evaluate()+0x13f  (vmOperations.cpp:71)
V  [libjvm.so+0x18a2251]  VMThread::evaluate_operation(VM_Operation*)+0x111  (vmThread.cpp:281)
V  [libjvm.so+0x18a2d8f]  VMThread::inner_execute(VM_Operation*)+0x3bf  (vmThread.cpp:435)
V  [libjvm.so+0x18a2f74]  VMThread::loop()+0x84  (vmThread.cpp:502)
V  [libjvm.so+0x18a3082]  VMThread::run()+0x92  (vmThread.cpp:175)
V  [libjvm.so+0x179c5da]  Thread::call_run()+0xba  (thread.cpp:220)
V  [libjvm.so+0x14a6e2a]  thread_native_entry(Thread*)+0x12a  (os_linux.cpp:785)
Comments
[jdk21u-fix-request] Approval Request from Aleksey Shipilëv This resolves potentially catastrophic bug in monitor deflation due to interleaving with thread dumping code that is frequently exercised by profilers. This is a part of atomic 21u integration, see 21u-dev PR for more details. Applies cleanly, but the new test needs -XX:+UnlockExperimentalVMOptions to gain access to -XX:LockingMode. After that amendment, all tests pass. Was in mainline for several months. Small bugtail: needs JDK-8320515 as the followup.
12-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/337 Date: 2024-03-06 19:12:00 +0000
07-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u/pull/384 Date: 2023-11-20 12:28:56 +0000
20-11-2023

Changeset: 87be6b69 Author: Stefan Karlsson <stefank@openjdk.org> Date: 2023-11-16 14:33:50 +0000 URL: https://git.openjdk.org/jdk/commit/87be6b69fe985eee01fc3344f9153d774db792c1
16-11-2023

This looks a generic bug in monitor deflation, and it blocks JDK-8319048 that would expose the same trouble on unlinking path quite reliably. The unlinking code is already broken, but it fails rarely since it does not check for safepoints too often (the trouble that JDK-8319048 would fix).
09-11-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16519 Date: 2023-11-06 09:46:11 +0000
06-11-2023

[~pchilanomate] Thanks for taking a look at this bug. Yes, I'd be happy to polish and try to get my patch upstreamed. My intention was to first discuss it with [~dcubed]. Dan could you ping me when you have time to discuss my proposed rewrite?
01-11-2023

I attached a reproducer. The problem is what [~stefank] pointed out about concurrent deflation operations by the VMThread and MonitorDeflationThread. In particular, in ObjectMonitor::deflate_monitor() the conditional "if (ObjectSynchronizer::is_final_audit() && owner_is_DEFLATER_MARKER())" doesn't consider that the call could be made by the VMThread due to a VM_ThreadDump operation, so in the cases where the monitor has already been deflated, instead of bailing out we continue and hit the assert when expecting the owner to be null. After fixing that small bug running the reproducer results in SIGSEGV or NMT verification code crashes instead, revealing a bigger issue: after the VM_ThreadDump finishes, if the MonitorDeflationThread was blocked in deflate_monitor_list() or unlink_deflated() it will resume iterating the list from monitors that could have already been deleted by the VMThread. So regarding the solutions, we can either keep the current overall logic and guard against this case (we could add a check after the MonitorDeflationThread unblocks to see if another deflation operation happened, and in that case restart the iteration from the head again) or we can go with [~stefank] approach of rewriting VM_ThreadDump so we avoid deflating in that operation and instead we just collect in a table the monitors we care about. [~stefank] I looked at your prototype and seems you have already figured out all the details, want to give your patch a try?
01-11-2023

FWIW, I have a prototype to rewrite VM_DumpThreads. As a part of that patch I remove the problematic monitor deflation from VM_DumpThreads: https://github.com/stefank/jdk/tree/object-monitor-dump As I see it VM_DumpThreads doesn't have to deflate monitors and even get better performance if we *instead* apply a correct filter over those monitors that we want to collect into the table. Experiments I did with millions of ObjectMonitors matched that assumption. There might be some benefits of cleaning out the _in_use_list, but I think that could be handled by triggering the MonitorDeflationThread to do an async deflation pass, instead of doing it in a safepoint operation.
31-10-2023

[~dcubed] I think this has started to happen now because the test was updated a week ago.
31-10-2023

ILW = HLH = P2 Priority may change if we have a known work-around.
31-10-2023

[~stefank] - Thanks for the analysis. Do you have any theory about why this failure mode started about a week ago? The possibility of two threads both doing monitor deflation has been in the code for a very long time, but this failure mode is recent... We used to have quite elaborate mechanisms in place to keep the two threads from messing with each other, but all that was simplified when [~eosterlund] (and I) got rid of Type Stable Memory and greatly simplified the list management stuff.
31-10-2023

I peeked at the core file for one of the problematic cases. Two threads are running monitor deflation. The first thread is the crashing VMThread running a safepointed VM_ThreadDump, which include a monitor deflation pass. The other thread is the MonitorDeflationThread that runs an async monitor deflation pass: SafepointSynchronize::block SafepointMechanism::process SafepointMechanism::process_if_requested ThreadBlockInVMPreprocess<void (JavaThread*)>::~ThreadBlockInVMPreprocess ThreadBlockInVM::~ThreadBlockInVM ObjectSynchronizer::chk_for_block_req ObjectSynchronizer::deflate_monitor_list ObjectSynchronizer::deflate_idle_monitors MonitorDeflationThread::monitor_deflation_thread_entry The stack trace doesn't contain all frames so I'm guessing how this thread got to chk_for_block_req. It seems likely that the thread is inside the loop that is deflating object monitors on the _in_use_list list, but that it hasn't reached the point where it tries to unlink the deflated monitors. In the deflation loop we have checks to see if it is time safepoint (chk_for_block_req). At some point it *is* time to block (the stack trace above), and the VMThread starts to run the VM_ThreadDump safepoint operation, which also tries to deflate monitors on the _in_use_list. I think this is what then causes the assert to fail. The MonitorDeflationThread sets the _owner to DEFLATER_MARKER (value 0x2) and release the _object OopHandle. Then when the VMThread finds this ObjectMonitor it sees that _object is "cleared" and expects _owner to be nullptr and not 0x2. The assert seems to come from this path: ``` if (obj == nullptr) { // If the object died, we can recycle the monitor without racing with // Java threads. The GC already broke the association with the object. set_owner_from(nullptr, DEFLATER_MARKER); ``` So, I think the problem is that we have these two interleaved monitor deflation passes.
31-10-2023

I was wondering if the dangling 0x2 was due to an anonymous owner value being accidentally left behind. However, that's not the case: src/hotspot/share/runtime/objectMonitor.hpp: // Separate _header and _owner on different cache lines since both can // have busy multi-threaded access. _header and _object are set at initial // inflation. The _object does not change, so it is a good choice to share // its cache line with _header. DEFINE_PAD_MINUS_SIZE(0, OM_CACHE_LINE_SIZE, sizeof(volatile markWord) + sizeof(WeakHandle)); // Used by async deflation as a marker in the _owner field. // Note that the choice of the two markers is peculiar: // - They need to represent values that cannot be pointers. In particular, // we achieve this by using the lowest two bits. // - ANONYMOUS_OWNER should be a small value, it is used in generated code // and small values encode much better. // - We test for anonymous owner by testing for the lowest bit, therefore // DEFLATER_MARKER must *not* have that bit set. #define DEFLATER_MARKER reinterpret_cast<void*>(2) public: // NOTE: Typed as uintptr_t so that we can pick it up in SA, via vmStructs. static const uintptr_t ANONYMOUS_OWNER = 1; So we have a dangling DEFLATER_MARKER value. Please note that we have only seen this failure starting a week ago (2023.10.24) so I think this failure mode is due to a somewhat recent change. I still suspect that this bug is due to something in the LM_LIGHTWEIGHT implementation, but I don't have a smoking gun yet.
31-10-2023

Here's snippets from the hs_err_pid file from the jdk-22+22-1699-tier3 sighting: applications/runthese/RunThese30M.java # Internal Error (/opt/mach5/mesos/work_dir/slaves/afbc6042-3a24-4198-9369-18c663a3f74c-S38645/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/c338546e-d243-4a4b-a798-bb484eb5f7a6/runs/9b6b9c1a-1ca6-401a-9f55-af2ee98bdfe8/workspace/open/src/hotspot/share/runtime/objectMonitor.inline.hpp:123), pid=628310, tid=628349 # assert(prev == old_value) failed: unexpected prev owner=0x0000000000000002, expected=0x0000000000000000 # # JRE version: Java(TM) SE Runtime Environment (22.0+22) (fastdebug build 22-ea+22-1699) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-ea+22-1699, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x1464bee] ObjectMonitor::deflate_monitor()+0x87e --------------- T H R E A D --------------- Current thread (0x00007ff8a002b180): VMThread "VM Thread" [id=628349, stack(0x00007ff880374000,0x00007ff880474000) (1024K)] Stack: [0x00007ff880374000,0x00007ff880474000], sp=0x00007ff8804724a0, free space=1017k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x1464bee] ObjectMonitor::deflate_monitor()+0x87e (objectMonitor.inline.hpp:123) V [libjvm.so+0x173bcd6] ObjectSynchronizer::deflate_monitor_list(Thread*, LogStream*, elapsedTimer*, ObjectMonitorsHashtable*)+0xa6 (synchronizer.cpp:1603) V [libjvm.so+0x173bf0c] ObjectSynchronizer::deflate_idle_monitors(ObjectMonitorsHashtable*)+0x18c (synchronizer.cpp:1687) V [libjvm.so+0x18880f8] VM_ThreadDump::doit()+0x148 (vmOperations.cpp:289) V [libjvm.so+0x188894f] VM_Operation::evaluate()+0x13f (vmOperations.cpp:71) V [libjvm.so+0x18a95a1] VMThread::evaluate_operation(VM_Operation*)+0x111 (vmThread.cpp:281) V [libjvm.so+0x18aa0df] VMThread::inner_execute(VM_Operation*)+0x3bf (vmThread.cpp:435) V [libjvm.so+0x18aa2c4] VMThread::loop()+0x84 (vmThread.cpp:502) V [libjvm.so+0x18aa3d2] VMThread::run()+0x92 (vmThread.cpp:175) V [libjvm.so+0x17a3a0a] Thread::call_run()+0xba (thread.cpp:220) V [libjvm.so+0x14a7aca] thread_native_entry(Thread*)+0x12a (os_linux.cpp:785)
27-10-2023