JDK-8311218 : fatal error: stuck in JvmtiVTMSTransitionDisabler::VTMS_transition_disable
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jvmti
  • Affected Version: 22
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2023-07-02
  • Updated: 2025-11-08
  • Resolved: 2023-12-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 22 JDK 23
22Fixed 23 b03Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following test failed in the JDK22 CI:

applications/kitchensink/Kitchensink8H.java

Here are snippets from the hs_err_pid file:

#  Internal Error (/opt/mach5/mesos/work_dir/slaves/cd627e65-f015-4fb1-a1d2-b6c9b8127f98-S75464/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/f1e03cf8-a64b-4242-99b1-e99a91d8f432/runs/459520a4-c3ba-400b-9236-45383c6c91ff/workspace/open/src/hotspot/share/prims/jvmtiThreadState.cpp:358), pid=2574174, tid=2574242
#  fatal error: stuck in JvmtiVTMSTransitionDisabler::VTMS_transition_disable
#
# JRE version: Java(TM) SE Runtime Environment (22.0+5) (fastdebug build 22-ea+5-322)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22-ea+5-322, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x11c1507]  JvmtiVTMSTransitionDisabler::VTMS_transition_disable_for_all()+0x1f7

<snip>

---------------  T H R E A D  ---------------

Current thread (0x00007f4348011970):  JavaThread "Jvmti-AgentSampler" daemon [_thread_in_vm, id=2574242, stack(0x00007f43b2137000,0x00007f43b2238000) (1028K)]

Stack: [0x00007f43b2137000,0x00007f43b2238000],  sp=0x00007f43b2236b40,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x11c1507]  JvmtiVTMSTransitionDisabler::VTMS_transition_disable_for_all()+0x1f7  (jvmtiThreadState.cpp:358)
V  [libjvm.so+0x1152968]  JvmtiEnv::ResumeThread(_jobject*)+0x28  (jvmtiEnv.cpp:1082)
V  [libjvm.so+0x110459c]  jvmti_ResumeThread+0x18c  (jvmtiEnter.cpp:655)
C  [libJvmtiStressModule.so+0x25c8]  agent_sampler+0x5a8  (libJvmtiStressModule.c:292)
V  [libjvm.so+0x118fba9]  JvmtiAgentThread::call_start_function()+0x59  (jvmtiImpl.cpp:89)
V  [libjvm.so+0xeb322c]  JavaThread::thread_main_inner()+0xcc  (javaThread.cpp:719)
V  [libjvm.so+0x178be5a]  Thread::call_run()+0xba  (thread.cpp:217)
V  [libjvm.so+0x148d1dc]  thread_native_entry(Thread*)+0x11c  (os_linux.cpp:778)


This fatal error reminds me of:

JDK-8308985 vmTestbase/nsk/jvmti/scenarios/allocation/AP04/ap04t002/TestDescription.java stuck during VTMS_transition_disable_for_all

which was closed as a duplicate of:

JDK-8308978 regression with a deadlock involving FollowReferences
Comments
A pull request was submitted for review. URL: https://git.openjdk.org/jdk22/pull/23 Date: 2023-12-20 21:28:04 +0000
20-12-2023

Changeset: 0f8e4e0a Author: Serguei Spitsyn <sspitsyn@openjdk.org> Date: 2023-12-19 17:26:55 +0000 URL: https://git.openjdk.org/jdk/commit/0f8e4e0a81257c678e948c341a241dc0b810494f
19-12-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/17011 Date: 2023-12-07 06:28:43 +0000
07-12-2023

I've posted PR with the fix. Please, let me know if there are any comments, suggestions or objections.
07-12-2023

> JVMTI is overly invasive but we could live with some try-finally around "critical code". Alan, could you, give an example of what do you suggest on this fragment? : Thread.State threadState() { . . . case RUNNING: // if mounted then return state of carrier thread synchronized (carrierThreadAccessLock()) { Thread carrierThread = this.carrierThread; if (carrierThread != null) { return carrierThread.threadState(); <== needs to refactor to allow jvmtiNotify } Do you mean something like this? : case RUNNING: notifyJvmtiHideSync(true); try { // if mounted then return state of carrier thread synchronized (carrierThreadAccessLock()) { Thread carrierThread = this.carrierThread; if (carrierThread != null) { return carrierThread.threadState(); } } } finally { notifyJvmtiHideSync(false); } > VirtualThread.unmount is performance critical. Thread.interrupted is also performance > critical but is thread confined (only the current thread can reset its own interrupt status). I've intrincified the notifyJvmtiHideSync method, so it has to be fast. Tested my prototype with mach5 tiers 1-6. It looks very good - no regressions noticed. Will try to come up with a better name for notification method notifyJvmtiHideSync.
06-12-2023

I have a prototype for the second approach which seems to be working well and reliably. The test from Patricio reliably deadlocks for each run without fix, and it never deadlocks with a fix. The fix on Handshake side is very simple: diff --git a/src/hotspot/share/runtime/handshake.cpp b/src/hotspot/share/runtime/handshake.cpp index 50c93d666e2..3dcabc94bb6 100644 --- a/src/hotspot/share/runtime/handshake.cpp +++ b/src/hotspot/share/runtime/handshake.cpp @@ -487,6 +487,9 @@ HandshakeOperation* HandshakeState::get_op_for_self(bool allow_suspend, bool che assert(_handshakee == Thread::current(), "Must be called by self"); assert(_lock.owned_by_self(), "Lock must be held"); assert(allow_suspend || !check_async_exception, "invalid case"); + if (allow_suspend && _handshakee->is_in_critical_section()) { + allow_suspend = false; + } if (!allow_suspend) { return _queue.peek(no_suspend_no_async_exception_filter); } else if (check_async_exception && !_async_exceptions_blocked) { The fix in VirtualThread.java is kind of bigger and may have a performance overhead concerns. The notification can be intrincified to fix this potential overhead. It is interesting to know Alan's opinion on this. The fix in VirtualThread.java is: diff --git a/src/java.base/share/classes/java/lang/VirtualThread.java b/src/java.base/share/classes/java/lang/VirtualThread.java index c0bd7d30932..9de790e30c4 100644 --- a/src/java.base/share/classes/java/lang/VirtualThread.java +++ b/src/java.base/share/classes/java/lang/VirtualThread.java @@ -355,12 +355,14 @@ private void mount() { if (interrupted) { carrier.setInterrupt(); } else if (carrier.isInterrupted()) { + notifyJvmtiHideSync(true); synchronized (interruptLock) { // need to recheck interrupt status if (!interrupted) { carrier.clearInterrupt(); } } + notifyJvmtiHideSync(false); } // set Thread.currentThread() to return this virtual thread @@ -378,10 +380,12 @@ private void unmount() { Thread carrier = this.carrierThread; carrier.setCurrentThread(carrier); + notifyJvmtiHideSync(true); // break connection to carrier thread, synchronized with interrupt synchronized (interruptLock) { setCarrierThread(null); } + notifyJvmtiHideSync(false); carrier.clearInterrupt(); // notify JVMTI after unmount @@ -738,6 +742,7 @@ void unpark() { submitRunContinuation(); } } else if ((s == PINNED) || (s == TIMED_PINNED)) { + notifyJvmtiHideSync(true); // unpark carrier thread when pinned. synchronized (carrierThreadAccessLock()) { Thread carrier = carrierThread; @@ -745,6 +750,7 @@ void unpark() { U.unpark(carrier); } } + notifyJvmtiHideSync(false); } } } @@ -840,6 +846,7 @@ boolean joinNanos(long nanos) throws InterruptedException { public void interrupt() { if (Thread.currentThread() != this) { checkAccess(); + notifyJvmtiHideSync(true); synchronized (interruptLock) { interrupted = true; Interruptible b = nioBlocker; @@ -851,6 +858,7 @@ public void interrupt() { Thread carrier = carrierThread; if (carrier != null) carrier.setInterrupt(); } + notifyJvmtiHideSync(false); } else { interrupted = true; carrierThread.setInterrupt(); @@ -868,10 +876,12 @@ boolean getAndClearInterrupt() { assert Thread.currentThread() == this; boolean oldValue = interrupted; if (oldValue) { + notifyJvmtiHideSync(true); synchronized (interruptLock) { interrupted = false; carrierThread.clearInterrupt(); } + notifyJvmtiHideSync(false); } return oldValue; } @@ -893,6 +903,7 @@ boolean getAndClearInterrupt() { // runnable, not mounted return Thread.State.RUNNABLE; case RUNNING: + notifyJvmtiHideSync(true); // if mounted then return state of carrier thread synchronized (carrierThreadAccessLock()) { Thread carrierThread = this.carrierThread; @@ -900,6 +911,7 @@ boolean getAndClearInterrupt() { return carrierThread.threadState(); } } + notifyJvmtiHideSync(false); // runnable, mounted return Thread.State.RUNNABLE; case PARKING: @@ -990,6 +1002,7 @@ public String toString() { sb.append("]/"); Thread carrier = carrierThread; if (carrier != null) { + notifyJvmtiHideSync(true); // include the carrier thread state and name when mounted synchronized (carrierThreadAccessLock()) { carrier = carrierThread; @@ -1000,6 +1013,7 @@ public String toString() { sb.append(carrier.getName()); } } + notifyJvmtiHideSync(false); } // include virtual thread state when not mounted if (carrier == null) { @@ -1097,6 +1111,8 @@ private void setCarrierThread(Thread carrier) { @JvmtiMountTransition private native void notifyJvmtiHideFrames(boolean hide); + private native void notifyJvmtiHideSync(boolean hide); + private static native void registerNatives(); static { registerNatives();
06-12-2023

Patricio's reproducer is very good. A 4th thread joins the party when the failure handle run jcmd Thread.dump_to_file as getting the stack trace of a target virtual thread needs the target to be mounted or unmounted, it has to back-off and retry if the target is in transition. > For the first approach it is up to Alan to decide if it is feasible to implement or not. It looks non-trivial to me. The first approach isn't really feasible as this time as there are a number of operations that have to coordinate with a target thread's carrier. Once the current effort on monitors gets Object.wait then there is a bit more scope as the coordination can be reduced to cases to pinned cases. As regards the second approach. JVMTI is overly invasive but we could live with some try-finally around "critical code". VirtualThread.unmount is performance critical. Thread.interrupted is also performance critical but is thread confined (only the current thread can reset its own interrupt status).
06-12-2023

Patricio, thank you for the nice simple test and analysis, and David, thank you for sharing you opinion! For the first approach it is up to Alan to decide if it is feasible to implement or not. It looks non-trivial to me. The second approach is also non-trivial, it is what I wanted to try to implement: > ... or we prevent suspending a thread while holding the interrupt lock in the other methods: > unpark(), interrupt(), getAndClearInterrupt(), threadState(), toString().
05-12-2023

I attached a very simple reproducer that can be used when testing out a fix for this. I used getState() to show the deadlock rather than unpark() because it's easier to reproduce, but any method in the VirtualThread class that synchronizes on the interrupt lock can be used. In fact, with this exact test changing getState() by interrupt() or toString() also reproduces the issue. I guess there are two ways to fix this. We either avoid synchronizing on the interrupt lock in mount/unmount while inside the JVMTI transition, or we prevent suspending a thread while holding the interrupt lock in the other methods: unpark(), interrupt(), getAndClearInterrupt(), threadState(), toString(). Some ideas about the first option. Can't we move notifyJvmtiMount() to after we synchronized on the interrupt lock and notifyJvmtiUnmount() to before we synchronize on it? With respect to JVMTI operations, one issue I see with that is that we could see that a vthread has a carrierThread set but it is not actually mounted, so getting the stacktrace would fail for example. But if I see JvmtiEnvBase::get_JavaThread_or_null(), we are already checking not only that the vthread has the carrierThread field set but also that the continuation is mounted on that carrier, otherwise we return a null JavaThread* indicating the vthread is unmounted. So even if we move those calls we would still treat the vthread us unmounted in that window of time since the continuation will not be mounted. We would need to correct the other callers of java_lang_VirtualThread::carrier_thread() in JVMTI code to do the same, ie. only return a non-null carrier when the field is set and the continuation is mounted. Then I guess the other issue is that for JVMTI operations on the carrier we could read values for the interrupt status that actually correspond to the vthread. But that can already happen from what I see. JvmtiVTMSTransitionDisabler is a no op for platform threads (unless we are using a JvmtiVTMSTransitionDisabler for all vthreads which will indirectly wait for all carriers to finish a transition). Are there other issues I'm missing?
02-12-2023

JVMTI SuspendThread and all other SR functions set JvmtiVTMSTransitionDisabler which waits in the constructor for all VTMS transitions to complete. But the Handshake::do_self_suspend() is not covered by the .JvmtiVTMSTransitionDisabler. It has to be NOT Handshake-safe to self-suspend while in critical section (thread is in a VTMS transition). For instance, we have this assert: bool HandshakeState::suspend() { JVMTI_ONLY(assert(!_handshakee->is_in_VTMS_transition(), "no suspend allowed in VTMS transition");) . . . But the java_thread->is_in_VTMS_transition() is not set when VirtualThread.unpark() in this context: @ChangesCurrentThreadvoid unpark() { Thread currentThread = Thread.currentThread(); if (!getAndSetParkPermit(true) && currentThread != this) { int s = state(); boolean parked = (s == PARKED) || (s == TIMED_PARKED); if (parked && compareAndSetState(s, RUNNABLE)) { if (currentThread instanceof VirtualThread vthread) { vthread.switchToCarrierThread(); <= started TMP transition try { submitRunContinuation(); <= is_in_tmp_VTMS_transition bit is set } finally { switchToVirtualThread(vthread); <= finished TMP transition } } else { submitRunContinuation(); } } else if ((s == PINNED) || (s == TIMED_PINNED)) { // unpark carrier thread when pinned. synchronized (carrierThreadAccessLock()) { Thread carrier = carrierThread; if (carrier != null && ((s = state()) == PINNED || s == TIMED_PINNED)) { U.unpark(carrier); <== is_in_VTMS_transition bit is NOT set in this context !!! } } } } I wonder if the annotation @ChangesCurrentThread (or some other annotation) can be used. Will try to come up with a fix and discuss it with Patricio and David. One complication in an attempt to fix it on the Handshakes implementation side is that self suspension is done in a ThreadSelfSuspensionHandshake closure which is a derived class of AsyncHandshakeClosure . The HandshakeState::do_self_suspend() could check if current thread in an attempt to self suspend is in a critical section. It is executed on the target thread, so if it is in a critical section then the ThreadSelfSuspensionHandshake closure has to be rejected and re-submitted again. If we do it inside the HandshakeState::do_self_suspend() then on return the target thread will discover re-submitted ThreadSelfSuspensionHandshake closure in the queue and will try to execute it right away as well (with another rejection). We need to allow the target thread to go out of a critical section and from this unsafe checkpoint. It is hard to do from the target thread and would be more natural to do from a requesting thread. But the problem is that the ThreadSelfSuspensionHandshake closure is asynchronous. It feels like it is feasible to solve but the Handshake submission needs to be re-arranged to allow current thread to reject the suspend request when is in a critical section.
15-11-2023

[~dholmes] can the update to the interrupt state not be made outside of the critical part of the transition? Right now, unmount has synchronize with methods that access the carrier, e.g. the carrier's state is needed when a virtual thread is blocked on monitorenter or Object.wait. Once we have the monitor work in the loom repo then we will have the first piece to remove the need for this coordination. The other piece is the equivalent of nioBlocker to safety propagate interrupt when a virtual thread is in Object.wait or to unpark the right thread when pinned parked. There are a few other smaller pieces and I think we can get there.
14-11-2023

[~alanb] can the update to the interrupt state not be made outside of the critical part of the transition? Maybe we need to re-examine how the interrupt state and unblocking is handled with VT's.
12-11-2023

> It seems inherently risky for the unmount transition to be able to block on this interruptLock - the transition is too critical to the JVMTI suspend/resume logic. Thread.interrupt has to be propagate the interrupt status to the target's carrier when mounted, this is a forced move as the target may be blocked in Object.wait which blocks the carrier. This requires coordination with unmount. Once the monitor work is further on and specifically Object.wait, then we should be able to remove the propagation of the interrupt status. In the mean time, if JVMTI SuspendThread is used to suspend a thread holding this critical lock then deadlock is possible.
10-11-2023

Great job [~pchilanomate]! And thanks for digging into this too [~stefank]. It seems inherently risky for the unmount transition to be able to block on this interruptLock - the transition is too critical to the JVMTI suspend/resume logic.
10-11-2023

Leonid suggested to disable a module which does JVMTI SuspendThread/ResumeThread in Kitchensink and RunThese30 stress tests. So, the tests will be running but without failing with this scenario.
09-11-2023

The root cause of this deadlock seems to be that the Thread9 was suspended while holding the interruptLock. The java.util.concurrent.CountDownLatch.countDown() is involved into this bad code path. So, replacing the CountDownLatch class in Kitchensinkwith with a simplified custom version would work at least as a workaround. It still does not prevent this scenario to happen with some other classes. I do not have a good enough knowledge of the java.util.concurrent package to understand what can be done to prevent this deadlock scenario in general. It seems we need Alan here.
09-11-2023

Patricio, thank you for great analysis of this deadlock. Don't you recognize it is the one related to CountDownLatch class implementation we discussed a month ago? :) Am I right, it is the same issue? I also discussed this issue before with Leonid as one of my newly developed test had this kind of deadlock. So, I had to replace the CountDownLatch class with my own simplified CoundDownLatch implementation. I'm thinking on how to fix it. For instance, getting rid of CountDownLatch class from the Kitchensink could do the trick. Will talk to Leonid. Also, it is something that Alan can be interested to look at.
09-11-2023

[~pchilanomate] Okay, thanks. Right, the CoundDownLatch issue was related to carrier threads starvation.
09-11-2023

[~sspitsyn] I think the CoundDownLatch issue you mentioned is a different bug. I run into that issue with test VThreadEventTest.java where I would get intermittent timeouts. But eventually I found out that the underlying issue was starvation on the ForkJoinPool where a task was submitted but never processed. I was able to reliably reproduce that and verified that it was fixed after JDK-8288899 was integrated. I don't remember seeing this issue with suspend and the interruptLock before.
09-11-2023

[~pchilanomate] Great sleuthing! I thought about locking at the Java stack traces but never got to it.
09-11-2023

[~stefank] I followed your analysis looking at the crash from November 3 instead of the last one (jstack has issues with UseZGC) and I found the same scenario you describe. The reason why _is_in_VTMS_transition is true is because this thread it's stuck in the unmount transition. I will use the equivalent of the Thread ids you found in the last crash to explain the issue, i.e. I'm assuming the deadlock paths are exactly the same. The problem is that this lock is not some random lock, is the interruptLock of the vthread mounted on carrier Thread201. Thread201 is stuck in java.lang.VirtualThread.unmount() trying to grab this interruptLock. The lock is taken by Thread9 which was suspended in java.lang.VirtualThread.unpark() while unparking the vthread mounted in Thread201. So Thread1 is waiting on Thread201 to finish the unmount transition, Thread201 is waiting on Thread9 to release the interruptLock, Thread9 is waiting on Thread1 to resume him...and we deadlock. Here are the relevant C and Java stacktraces from November 3 crash: C and Java stack of thread stuck in unmount transition: Thread 67 (LWP 1458748): #0 0x00007f2b6a1ad898 in pthread_cond_timedwait@@GLIBC_2.3.2 () #1 0x00007f2b68f6e185 in PlatformEvent::park_nanos (nanos=<optimized out>, this=<optimized out>) #2 PlatformEvent::park_nanos (this=0x7f29ec1d4000, nanos=<optimized out>) #3 0x00007f2b68f1ab93 in ObjectMonitor::EnterI (this=this@entry=0x7f2a1405dca0, current=current@entry=0x7f2a180be3f0) #4 0x00007f2b68f1cee2 in ObjectMonitor::enter (this=0x7f2a1405dca0, current=current@entry=0x7f2a180be3f0) #5 0x00007f2b691f57ab in ObjectSynchronizer::enter (obj=..., lock=lock@entry=0x7f2978fce7e0, current=current@entry=0x7f2a180be3f0) #6 0x00007f2b690bea83 in SharedRuntime::monitor_enter_helper (obj=obj@entry=0x52813a910, lock=lock@entry=0x7f2978fce7e0, current=current@entry=0x7f2a180be3f0) #7 0x00007f2b690bed2a in SharedRuntime::complete_monitor_locking_C (obj=0x52813a910, lock=0x7f2978fce7e0, current=0x7f2a180be3f0) #8 0x00007f2b54c643c8 in ?? () #9 0x00000000a3e850b8 in ?? () "ForkJoinPool-1-worker-7" #7935 daemon prio=5 tid=0x00007f2a180be3f0 nid=1458748 waiting for monitor entry [0x00007f2978fce000] java.lang.Thread.State: BLOCKED (on object monitor) JavaThread state: _thread_blocked - java.lang.VirtualThread.unmount() @bci=16, line=382 (Compiled frame) - waiting to lock <0x000000051f4285c0> (a java.lang.Object) - java.lang.VirtualThread.runContinuation() @bci=74, line=235 (Compiled frame) - java.lang.VirtualThread$$Lambda+0x00007f2aefefd168.run() @bci=4 (Compiled frame) - java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute() @bci=4, line=1726 (Compiled frame) - java.util.concurrent.ForkJoinTask$RunnableExecuteAction.compute() @bci=1, line=1717 (Compiled frame) - java.util.concurrent.ForkJoinTask$InterruptibleTask.exec() @bci=51, line=1641 (Compiled frame) - java.util.concurrent.ForkJoinTask.doExec() @bci=10, line=507 (Compiled frame) - java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.util.concurrent.ForkJoinTask, java.util.concurrent.ForkJoinPool$WorkQueue, int) @bci=49, line=1 - java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, long, int) @bci=271, line=2077 (Compiled frame) - java.util.concurrent.ForkJoinPool.runWorker(java.util.concurrent.ForkJoinPool$WorkQueue) @bci=68, line=2028 (Compiled frame) - java.util.concurrent.ForkJoinWorkerThread.run() @bci=31, line=187 (Compiled frame) C and Java stack of thread suspended with interruptLock held: Thread 56 (LWP 1458628): #0 0x00007f2b6a1ad4ac in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00007f2b68f6f39c in PlatformMonitor::wait (this=this@entry=0x7f2a1c1d6c78, millis=millis@entry=0) #2 0x00007f2b68ec7524 in Monitor::wait_without_safepoint_check (this=this@entry=0x7f2a1c1d6c70, timeout=timeout@entry=0) #3 0x00007f2b6886c4c3 in HandshakeState::do_self_suspend (this=this@entry=0x7f2a1c1d6c60) a1db-478c-8104-60c8b0af87dd-0196/executors/b4c42ec6-fe97-467c-9527-4ef28171ccf8/runs/f1b339ab-45c7-4eaa-87f8-53508338af71/workspace/open/src/hotspot/share/runtime/handshake.cpp:693 #4 0x00007f2b6886dde0 in ThreadSelfSuspensionHandshake::do_thread (this=<optimized out>, thr=0x7f2a1c1d66a0) #5 0x00007f2b6886acf6 in HandshakeOperation::do_handshake (this=this@entry=0x7f2adcbbd750, thread=0x7f2a1c1d66a0) #6 0x00007f2b6886b06f in HandshakeState::process_by_self (this=this@entry=0x7f2a1c1d6c60, allow_suspend=allow_suspend@entry=true, check_async_exception=check_async_exception@entry=true) #7 0x00007f2b690abed7 in SafepointMechanism::process (thread=thread@entry=0x7f2a1c1d66a0, allow_suspend=allow_suspend@entry=true, check_async_exception=check_async_exception@entry=true) #8 0x00007f2b6897b7ca in SafepointMechanism::process_if_requested (check_async_exception=true, allow_suspend=true, thread=0x7f2a1c1d66a0) #9 SafepointMechanism::process_if_requested_with_exit_check (check_async_exception=true, thread=0x7f2a1c1d66a0) #10 JavaThread::check_special_condition_for_native_trans (thread=0x7f2a1c1d66a0) #11 0x00007f2b552599f8 in ?? () #12 0x000000051ab232f8 in ?? () #13 0x000000051f434708 in ?? () "Thread-2051" #7846 daemon prio=5 tid=0x00007f2a1c1d66a0 nid=1458628 runnable [0x00007f29d0ecc000] java.lang.Thread.State: RUNNABLE JavaThread state: _thread_blocked - jdk.internal.misc.Unsafe.unpark(java.lang.Object) @bci=0 (Compiled frame; information may be imprecise) - java.lang.VirtualThread.unpark() @bci=157, line=745 (Compiled frame) - locked <0x000000051f4285c0> (a java.lang.Object) - java.lang.System$2.unparkVirtualThread(java.lang.Thread) @bci=13, line=2657 (Compiled frame) - jdk.internal.misc.VirtualThreads.unpark(java.lang.Thread) @bci=4, line=93 (Compiled frame) - java.util.concurrent.locks.LockSupport.unpark(java.lang.Thread) @bci=12, line=179 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.signalNext(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) @bci=30, line=645 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(int) @bci=12, line=1147 (Compiled frame) - java.util.concurrent.CountDownLatch.countDown() @bci=5, line=290 (Compiled frame) - javasoft.sqe.tests.api.java.lang.management.ThreadMXBean.DumpAllThreads.validate_DumpAllThreads(boolean, boolean, int, java.util.function.Function) @bci=449, line=177 (Interpreted fra me) - javasoft.sqe.tests.api.java.lang.management.ThreadMXBean.DumpAllThreads.test02(boolean, boolean, int) @bci=11, line=72 (Interpreted frame)
09-11-2023

I've found a deadlock while poking around in the core file: Thread 1 (LWP 2097170) is trying to resume the suspended thread 9 raise abort os::abort VMError::report_and_die report_fatal JvmtiVTMSTransitionDisabler::VTMS_transition_disable_for_all JvmtiVTMSTransitionDisabler::JvmtiVTMSTransitionDisabler JvmtiEnv::ResumeThread jvmti_ResumeThread jvmti_ResumeThread agent_sampler JvmtiAgentThread::call_start_function JavaThread::thread_main_inner Thread::call_run thread_native_entry start_thread clone Thread 9 (LWP 2105185) is waiting for the Thread 1 to call resume AND it is holding the lock that thread 201 tries to enter pthread_cond_wait@@GLIBC_2.3.2 PlatformMonitor::wait Monitor::wait_without_safepoint_check HandshakeState::do_self_suspend ThreadSelfSuspensionHandshake::do_thread HandshakeOperation::do_handshake HandshakeState::process_by_self SafepointMechanism::process_if_requested SafepointMechanism::process_if_requested_with_exit_check JavaThread::check_special_condition_for_native_trans #11 0x00007f150857e2f8 Thread 201 (LWP 2105199) tries to take the lock thread 9 is holding AND at the same time it is blocking thread 1 from making progress because _is_in_VTMS_transition is true pthread_cond_timedwait@@GLIBC_2.3.2 PlatformEvent::park_nanos PlatformEvent::park_nanos ObjectMonitor::EnterI ObjectMonitor::enter ObjectSynchronizer::enter InterpreterRuntime::monitorenter_obj ?? My question is still: why is _is_in_VTMS_transition true?
09-11-2023

I took a look at the latest reported failure. That hang doesn't look to be related to ZGC. I do see something that I can't explain: There are a lot of threads that try to complete the JvmtiVTMSTransitionDisabler, but there is one thread inside the a VTMS transition: (gdb) p ThreadsSMRSupport::_java_thread_list->_threads[55]->_is_in_VTMS_transition $198 = true (gdb) p ThreadsSMRSupport::_java_thread_list->_threads[55]->_osthread->_thread_id $199 = 2105199 and this seems to block the disabler. So, why is that thread in a VTMS_transition? The stack trace for that thread doesn't show any hints of virtual threads and mounting/unmounting: Thread 201 (LWP 2105199): pthread_cond_timedwait PlatformEvent::park_nanos ObjectMonitor::EnterI ObjectMonitor::enter ObjectSynchronizer::enter InterpreterRuntime::monitorenter_obj 0x00007f1507c731d6 0x0000000000000000 I guess the bug has to do with this? That the thread is marked as in a VTMS transition although it is not? OTOH I don't understand the JVMTI code either. Just some question marks around the code (feel free to ignore): JvmtiVTMSTransitionDisabler::VTMS_vthread_mount(jobject vthread, bool hide) { if (hide) { VTMS_mount_begin(vthread); } else { VTMS_mount_end(vthread); if (JvmtiExport::should_post_vthread_mount()) { JvmtiExport::post_vthread_mount(vthread); } } void JvmtiVTMSTransitionDisabler::VTMS_vthread_unmount(jobject vthread, bool hide) { if (hide) { if (JvmtiExport::should_post_vthread_unmount()) { JvmtiExport::post_vthread_unmount(vthread); } VTMS_unmount_begin(vthread, /* last_unmount */ false); } else { VTMS_unmount_end(vthread); } } From the Java code down into the JVM we set _is_in_VTMS_transition through these paths: VirtuaThread::mount calls notifyJvmtiMount(/*hide*/true) calls VTMS_mount_begin(vthread); _is_in_VTMS_transition = true VirtuaThread::yield calls notifyJvmtiUnmount(/*hide*/true); VTMS_unmount_begin(vthread, /* last_unmount */ false); _is_in_VTMS_transition = true yield notifyJvmtiMount(/*hide*/false); VTMS_mount_end(vthread); _is_in_VTMS_transition = true VirtuaThread::unmount calls notifyJvmtiUnmount(/*hide*/false); VTMS_unmount_end(vthread); _is_in_VTMS_transition = false If we zoom in on: VirtuaThread::mount calls notifyJvmtiMount(/*hide*/true) calls VTMS_mount_begin(vthread); _is_in_VTMS_transition = true Where is the corresponding call that sets _is_in_VTMS_transition to false? What does "hide" mean?
09-11-2023

Here's the crash's thread's stack for the jdk-22+21-1636-tier3 sighting: applications/runthese/RunThese30M.java --------------- T H R E A D --------------- Current thread (0x00007f1018001750): JavaThread "Jvmti-AgentSampler" daemon [_thread_in_vm, id=2097170, stack(0x00007f10ae75d000,0x00007f10ae85d000) (1024K)] Stack: [0x00007f10ae75d000,0x00007f10ae85d000], sp=0x00007f10ae85bb40, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x11dabe7] JvmtiVTMSTransitionDisabler::VTMS_transition_disable_for_all()+0x1f7 (jvmtiThreadState.cpp:358) V [libjvm.so+0x116ad78] JvmtiEnv::ResumeThread(_jobject*)+0x28 (jvmtiEnv.cpp:1082) V [libjvm.so+0x111c66c] jvmti_ResumeThread+0x18c (jvmtiEnter.cpp:655) C [libJvmtiStressModule.so+0x25c8] agent_sampler+0x5a8 (libJvmtiStressModule.c:292) V [libjvm.so+0x11a9299] JvmtiAgentThread::call_start_function()+0x59 (jvmtiImpl.cpp:89) V [libjvm.so+0xec412c] JavaThread::thread_main_inner()+0xcc (javaThread.cpp:720) V [libjvm.so+0x179c65a] Thread::call_run()+0xba (thread.cpp:220) V [libjvm.so+0x14a6eaa] thread_native_entry(Thread*)+0x12a (os_linux.cpp:785)
24-10-2023

This seems to be an issue with ZGC. The thread calling JvmtiEnv::ResumeThread() gets stuck in VTMS_transition_disable_for_all() because there is a carrier thread in the middle of an unmount transition apparently stuck in ZPageAllocator::alloc_page_stall() while trying to allocate a new stackChunk object during freeze. This is not the only thread inside that ZGC code. In total I see 38 threads inside ZPageAllocator::alloc_page_stall() when we hit the assert, which makes it more suspicious. This is the stack of the carrier thread that is blocking the thread doing JvmtiEnv::ResumeThread(): Java stack: WARNING: could not get Thread object: java.lang.RuntimeException: ZCollectedHeap.oop_load_barrier not implemented Could not get the java Thread object. Thread info will be limited. tid=0x00007f428c55c610 nid=2660620 waiting on condition [0x00007f43b1218000] JavaThread state: _thread_blocked - jdk.internal.vm.Continuation.doYield() @bci=0 (Compiled frame; information may be imprecise) WARNING: could not get Thread object: java.lang.RuntimeException: ZCollectedHeap.oop_load_barrier not implemented - jdk.internal.vm.Continuation.yield0(jdk.internal.vm.ContinuationScope, jdk.internal.vm.Continuation) @bci=18, line=360 (Compiled frame) - jdk.internal.vm.Continuation.yield(jdk.internal.vm.ContinuationScope) @bci=69, line=351 (Compiled frame) - java.lang.VirtualThread.yieldContinuation() @bci=12, line=431 (Compiled frame) - java.lang.VirtualThread.parkNanos(long) @bci=69, line=621 (Compiled frame) - java.lang.System$2.parkVirtualThread(long) @bci=20, line=2645 (Compiled frame) - jdk.internal.misc.VirtualThreads.park(long) @bci=4, line=67 (Compiled frame) - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=16, line=408 (Compiled frame) - applications.kitchensink.process.stress.modules.LockDeflationStressModule$LpHelloTask.run() @bci=3, line=351 (Compiled frame) - java.lang.Thread.runWith(java.lang.Object, java.lang.Runnable) @bci=5, line=1583 (Compiled frame) - java.lang.VirtualThread.run(java.lang.Runnable) @bci=63, line=309 (Compiled frame) - java.lang.VirtualThread$VThreadContinuation$1.run() @bci=8, line=190 (Compiled frame) - jdk.internal.vm.Continuation.enter0() @bci=4, line=320 (Compiled frame) - jdk.internal.vm.Continuation.enter(jdk.internal.vm.Continuation, boolean) @bci=1, line=312 (Compiled frame) - jdk.internal.vm.Continuation.enterSpecial(jdk.internal.vm.Continuation, boolean, boolean) @bci=0 (Compiled frame) - jdk.internal.vm.Continuation.run() @bci=122, line=248 (Compiled frame) - java.lang.VirtualThread.runContinuation() @bci=71, line=221 (Compiled frame) - java.lang.VirtualThread$$Lambda+0x00000008010d6b48.run() @bci=4 (Compiled frame) - java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec() @bci=4, line=1423 (Compiled frame) - java.util.concurrent.ForkJoinTask.doExec() @bci=10, line=387 (Compiled frame) - java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(java.util.concurrent.ForkJoinTask, java.util.concurrent.ForkJoinPool$WorkQueue) @bci=19, line=1312 (Compiled frame) - java.util.concurrent.ForkJoinPool.scan(java.util.concurrent.ForkJoinPool$WorkQueue, int, int) @bci=211, line=1843 (Compiled frame) - java.util.concurrent.ForkJoinPool.runWorker(java.util.concurrent.ForkJoinPool$WorkQueue) @bci=35, line=1808 (Interpreted frame) - java.util.concurrent.ForkJoinWorkerThread.run() @bci=31, line=188 (Compiled frame [deoptimized]) Native stack: Thread 77 (LWP 2660620): #0 0x00007f4812de5db6 in do_futex_wait.constprop () from /scratch/pchilano/debug/8311218/lib64/libpthread.so.0 #1 0x00007f4812de5ea8 in __new_sem_wait_slow.constprop.0 () from /scratch/pchilano/debug/8311218/lib64/libpthread.so.0 #2 0x00007f4811cf1b56 in PosixSemaphore::wait (this=this@entry=0x7f43b1217f38) at hotspot/os/posix/semaphore_posix.cpp:57 #3 0x00007f48120d7f98 in Semaphore::wait_with_safepoint_check (thread=0x7f428c55c610, this=0x7f43b1217f38) at hotspot/share/runtime/semaphore.inline.hpp:40 #4 ZFuture<bool>::get (this=0x7f43b1217f38) at hotspot/share/gc/z/zFuture.inline.hpp:50 #5 ZPageAllocation::wait (this=0x7f43b1217ee0) at hotspot/share/gc/z/zPageAllocator.cpp:164 #6 ZPageAllocator::alloc_page_stall (this=this@entry=0x7f480c0f1c30, allocation=allocation@entry=0x7f43b1217ee0) at hotspot/share/gc/z/zPageAllocator.cpp:533 --Type <RET> for more, q to quit, c to continue without paging-- #7 0x00007f48120d87e1 in ZPageAllocator::alloc_page_or_stall (this=this@entry=0x7f480c0f1c30, allocation=allocation@entry=0x7f43b1217ee0) at hotspot/share/gc/z/zPageAllocator.cpp:572 #8 0x00007f48120d8c93 in ZPageAllocator::alloc_page (this=this@entry=0x7f480c0f1c30, type=ZPageType::small, size=size@entry=2097152, flags=..., age=ZPageAge::eden) at hotspot/share/gc/z/zPageAllocator.cpp:710 #9 0x00007f481209af1f in ZHeap::alloc_page (this=0x7f480c0f1c30, type=<optimized out>, size=size@entry=2097152, flags=..., age=<optimized out>) at hotspot/share/gc/z/zHeap.cpp:227 #10 0x00007f48120cddb0 in ZObjectAllocator::alloc_page (this=0x7f480c0f1ed8, type=<optimized out>, size=2097152, flags=...) at hotspot/share/gc/z/zHeap.inline.hpp:40 #11 0x00007f48120ce126 in ZObjectAllocator::alloc_object_in_shared_page (this=0x7f480c0f1ed8, shared_page=0x7f480c10e010, page_type=ZPageType::small, page_size=<optimized out>, size=262144, flags=...) at hotspot/share/gc/z/zObjectAllocator.cpp:94 #12 0x00007f481208813c in ZAllocatorEden::alloc_tlab (size=262144, this=0x7f480c0f1ed8) at hotspot/share/gc/z/zAllocator.inline.hpp:46 #13 ZCollectedHeap::allocate_new_tlab (this=<optimized out>, min_size=<optimized out>, requested_size=32768, actual_size=0x7f43b1218180) at hotspot/share/gc/z/zCollectedHeap.cpp:146 #14 0x00007f4811a6b8df in MemAllocator::mem_allocate_inside_tlab_slow (this=this@entry=0x7f43b12182a0, allocation=...) at hotspot/share/gc/shared/memAllocator.cpp:305 #15 0x00007f4811a6c1cb in MemAllocator::mem_allocate_slow (this=0x7f43b12182a0, allocation=...) at hotspot/share/gc/shared/memAllocator.cpp:342 #16 0x00007f4811a6c281 in MemAllocator::mem_allocate (allocation=..., this=0x7f43b12182a0) at hotspot/share/gc/shared/memAllocator.cpp:360 #17 MemAllocator::allocate (this=this@entry=0x7f43b12182a0) at hotspot/share/gc/shared/memAllocator.cpp:367 #18 0x00007f4811160a9c in StackChunkAllocator::allocate (this=this@entry=0x7f43b12182a0) at hotspot/share/runtime/continuationFreezeThaw.cpp:1344 #19 0x00007f481119c8d1 in Freeze<Config<(oop_kind)1, ZBarrierSet> >::allocate_chunk (this=this@entry=0x7f43b1218410, stack_size=56) at hotspot/share/runtime/continuationFreezeThaw.cpp:1376 #20 0x00007f481119d357 in Freeze<Config<(oop_kind)1, ZBarrierSet> >::try_freeze_fast (this=this@entry=0x7f43b1218410) at hotspot/share/runtime/continuationFreezeThaw.cpp:545 #21 0x00007f48111821c2 in freeze_internal<Config<(oop_kind)1, ZBarrierSet> > (current=current@entry=0x7f428c55c610, sp=<optimized out>) at hotspot/share/runtime/continuationFreezeThaw.cpp:387 #22 0x00007f481118269b in Config<(oop_kind)1, ZBarrierSet>::freeze (sp=<optimized out>, thread=0x7f428c55c610) at hotspot/share/runtime/continuationFreezeThaw.cpp:271 #23 freeze<Config<(oop_kind)1, ZBarrierSet> > (current=0x7f428c55c610, sp=<optimized out>) at hotspot/share/runtime/continuationFreezeThaw.cpp:237 #24 0x00007f47fbd34f75 in ?? () #25 0x0000040251c00000 in ?? () --Type <RET> for more, q to quit, c to continue without paging-- #26 0x00007f47fc955368 in ?? () #27 0x000004005b09fed8 in ?? () #28 0x0000000000000000 in ?? () This is the stack of one of those other 37 threads (they all look similar): Thread 75 (LWP 2574250): #0 0x00007f4812de5db6 in do_futex_wait.constprop () from /scratch/pchilano/debug/8311218/lib64/libpthread.so.0 #1 0x00007f4812de5ea8 in __new_sem_wait_slow.constprop.0 () from /scratch/pchilano/debug/8311218/lib64/libpthread.so.0 #2 0x00007f4811cf1b56 in PosixSemaphore::wait (this=this@entry=0x7f43b1a2e468) at hotspot/os/posix/semaphore_posix.cpp:57 #3 0x00007f48120d7f98 in Semaphore::wait_with_safepoint_check (thread=0x7f480c5b4c90, this=0x7f43b1a2e468) at hotspot/share/runtime/semaphore.inline.hpp:40 #4 ZFuture<bool>::get (this=0x7f43b1a2e468) at hotspot/share/gc/z/zFuture.inline.hpp:50 #5 ZPageAllocation::wait (this=0x7f43b1a2e410) at hotspot/share/gc/z/zPageAllocator.cpp:164 #6 ZPageAllocator::alloc_page_stall (this=this@entry=0x7f480c0f1c30, allocation=allocation@entry=0x7f43b1a2e410) at hotspot/share/gc/z/zPageAllocator.cpp:533 #7 0x00007f48120d87e1 in ZPageAllocator::alloc_page_or_stall (this=this@entry=0x7f480c0f1c30, allocation=allocation@entry=0x7f43b1a2e410) at hotspot/share/gc/z/zPageAllocator.cpp:572 #8 0x00007f48120d8c93 in ZPageAllocator::alloc_page (this=this@entry=0x7f480c0f1c30, type=ZPageType::small, size=size@entry=2097152, flags=..., age=ZPageAge::eden) at hotspot/share/gc/z/zPageAllocator.cpp:710 #9 0x00007f481209af1f in ZHeap::alloc_page (this=0x7f480c0f1c30, type=<optimized out>, size=size@entry=2097152, flags=..., age=<optimized out>) at hotspot/share/gc/z/zHeap.cpp:227 #10 0x00007f48120cddb0 in ZObjectAllocator::alloc_page (this=0x7f480c0f1ed8, type=<optimized out>, size=2097152, flags=...) at hotspot/share/gc/z/zHeap.inline.hpp:40 #11 0x00007f48120ce126 in ZObjectAllocator::alloc_object_in_shared_page (this=0x7f480c0f1ed8, shared_page=0x7f480c107010, page_type=ZPageType::small, page_size=<optimized out>, size=262144, flags=...) at hotspot/share/gc/z/zObjectAllocator.cpp:94 #12 0x00007f481208813c in ZAllocatorEden::alloc_tlab (size=262144, this=0x7f480c0f1ed8) at hotspot/share/gc/z/zAllocator.inline.hpp:46 #13 ZCollectedHeap::allocate_new_tlab (this=<optimized out>, min_size=<optimized out>, requested_size=32768, actual_size=0x7f43b1a2e6b0) at hotspot/share/gc/z/zCollectedHeap.cpp:146 #14 0x00007f4811a6b8df in MemAllocator::mem_allocate_inside_tlab_slow (this=this@entry=0x7f43b1a2e720, allocation=...) at hotspot/share/gc/share--Type <RET> for more, q to quit, c to continue without paging-- d/memAllocator.cpp:305 #15 0x00007f4811a6c1cb in MemAllocator::mem_allocate_slow (this=0x7f43b1a2e720, allocation=...) at hotspot/share/gc/shared/memAllocator.cpp:342 #16 0x00007f4811a6c281 in MemAllocator::mem_allocate (allocation=..., this=0x7f43b1a2e720) at hotspot/share/gc/shared/memAllocator.cpp:360 #17 MemAllocator::allocate (this=this@entry=0x7f43b1a2e720) at hotspot/share/gc/shared/memAllocator.cpp:367 #18 0x00007f4811546209 in CollectedHeap::obj_allocate (__the_thread__=0x7f480c5b4c90, size=<optimized out>, klass=0x7f43b1a2e7a8, this=<optimized out>) at hotspot/share/gc/shared/collectedHeap.inline.hpp:36 #19 InstanceKlass::allocate_instance (this=<optimized out>, __the_thread__=__the_thread__@entry=0x7f480c5b4c90) at hotspot/share/oops/instanceKlass.cpp:1452 #20 0x00007f4811cd7cf0 in OptoRuntime::new_instance_C (klass=<optimized out>, current=0x7f480c5b4c90) at hotspot/share/opto/runtime.cpp:235 #21 0x00007f47fb8695c5 in ?? () #22 0x000004023b073900 in ?? () #23 0x00007f47fc3dd6a0 in ?? () #24 0x000000180000000c in ?? () #25 0x000004023b073920 in ?? () #26 0x0000000000000018 in ?? () #27 0x0000000000000000 in ?? ()
11-07-2023

That's ok, David. Thank you for the information.
04-07-2023

[~dqu] Sorry kitchensink is not something we can share or reduce. It's a stress test that is an amalgamation of numerous other tests and running JVMTI commands and a whole heap of stuff. I have to wonder whether the fix for JDK-8303086 may have caused this. Paging [~sspitsyn].
03-07-2023

Hi Daniel, I noticed that the test `applications/kitchensink/Kitchensink8H.java` isn't available in the jdk code base. Is it an internal test case? I was just wondering if it would be possible to share the code of this test case, or perhaps a reduced version of it. Thanks!
03-07-2023