Bug ID: JDK-8360779 Possible deadlock involving JfrSamplerThread

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 26

Priority: P4
Status: New
Resolution: Unresolved
OS: os_x
CPU: aarch64

Submitted: 2025-06-27
Updated: 2025-06-30

An execution of our Runthese8H stress application encountered a timeout. As part of the timeout handling a core dump of the process was taken. That core dump shows that the VMThread is trying to initiate a safepoint but can't get the Threads_lock. Other threads are also blocked trying to acquire the Threads_lock for thread start/remove and other checks. Yet more threads are blocked trying to submit further VM operations. The owner of the Threads_lock is here:

thread #43
    frame #0: 0x000000018d150cc0 libsystem_kernel.dylib`__sigsuspend + 8
    frame #1: 0x0000000105c0302c libjvm.dylib`SR_handler(int, __siginfo*, void*) + 308
    frame #2: 0x000000018d1bade4 libsystem_platform.dylib`_sigtramp + 56
    frame #3: 0x000000018d185a00 libsystem_pthread.dylib`_pthread_create + 996
    frame #4: 0x0000000105aa095c libjvm.dylib`os::create_thread(Thread*, os::ThreadType, unsigned long) + 312
    frame #5: 0x00000001054e15cc libjvm.dylib`JavaThread::JavaThread(void (*)(JavaThread*, JavaThread*), unsigned long, MemTag) + 68
    frame #6: 0x0000000105642844 libjvm.dylib`JVM_StartThread + 1136

and it appears this thread is being suspended by the JfrSamplerThread and is waiting to be resumed. Here is the JfrSamplerThread:

thread #65
    frame #0: 0x000000018d150d9c libsystem_kernel.dylib`__ulock_wait2 + 8
    frame #1: 0x000000018d1b8aac libsystem_platform.dylib`_os_unfair_lock_lock_slow + 180
    frame #2: 0x000000018d183ee8 libsystem_pthread.dylib`pthread_kill + 152
    frame #3: 0x0000000105c02584 libjvm.dylib`PosixSignals::do_resume(OSThread*) + 116
    frame #4: 0x0000000105c026d0 libjvm.dylib`SuspendedThreadTask::internal_do_task() + 80
    frame #5: 0x0000000105575994 libjvm.dylib`JfrSamplerThread::sample_java_thread(JavaThread*) + 76
    frame #6: 0x00000001055756f8 libjvm.dylib`JfrSamplerThread::task_stacktrace(JfrSampleRequestType, JavaThread**) + 332
    frame #7: 0x0000000105575580 libjvm.dylib`JfrSamplerThread::run() + 300
    frame #8: 0x0000000105d34dbc libjvm.dylib`Thread::call_run() + 240
    frame #9: 0x0000000105aa0ef8 libjvm.dylib`thread_native_entry(Thread*) + 312
    frame #10: 0x000000018d1842e4 libsystem_pthread.dylib`_pthread_start + 136

It is trying to resume the target thread but for some reason is blocking at the OS level within pthread_kill.

We have another failure that I think is related - but still on Aarch64 unfortunately. Also unfortunately we don't have native stack information only jhsdb. I can't see direct involvement of the signal code because there is no native stack info but the similarity is that one JavaThread holds the Threads_lock and is preventing the VMThread from establishing a safepoint, so other threads are piling up on different locks, and trying to submit VM operations. Internal VM Mutex ThreadsLockThrottle_lock is owned by RunThese-TestRunner-Thread-0, nid=36355, address=0x000000014e869810 Internal VM Mutex Threads_lock is owned by RunThese-TestRunner-Thread-0, nid=36355, address=0x000000014e869810 Internal VM Mutex JNICritical_lock is owned by Thread-7219, nid=72499, address=0x000000013d53de10 Internal VM Mutex Heap_lock is owned by Thread-7219, nid=72499, address=0x000000013d53de10 ----------------- 36355 ----------------- "RunThese-TestRunner-Thread-0" #35 daemon prio=5 tid=0x000000014e869810 nid=36355 runnable [0x0000000179eae000] java.lang.Thread.State: RUNNABLE JavaThread state: _thread_in_vm 0x0000000182ac8e88 ???????? 0x0000000182af04e4 ???????? 0x1e15000182aee140 ???????? 0xfd4f800182ada840 ???????? 0xe741800104feff00 ???????? 0x0000000104a2f1ec __ZN10JavaThreadC1EPFvPS_S0_Em6MemTag + 0x44 0x0000000104b90d58 _JVM_StartThread + 0x470 0x0000000116001820 java.lang.Thread.start0() + 0xa0 (Native method) 0x0000000115a19614 * java.lang.Thread.start() bci:23 line:1417 (Interpreted frame) 0x000000010e670edc * applications.kitchensink.process.stress.modules.JckStressModule$TestRunner.runTest(java.lang.String, long) bci:400 line:314 (Compiled frame) But there is no JfrSamplerThread in this scenario. Again we need to know why the thread holding the Threads_lock can't proceed.
30-06-2025
> Why is libsystem_platform.dylib`_os_unfair_lock_lock_slow blocking? Has the core file timeout handler suspended an already suspended thread, such as in the case of the jstack SA agent interaction problem again (double suspension, used by thread dumps)? The timeout handler doesn't suspend any threads, it just issues external commands to get stack dumps and core dumps AFAIK. Is it possible for the sampler thread to do a double-suspend itself? i.e. it issues the signal to do the resume which would cause the target to try to return from sigsuspend, but the target doesn't get scheduled due to system load. Meanwhile the sampler tries to take another sample and hits the target with a second signal which hits while the sigsuspend is active - maybe whilst holding an internal lock. Then the next resume attempt hangs as observed because it can't take that lock. ? > It might be necessary to restore the Threads_lock to be taken by the JFRSamplerThread, if only for that safety property. That would certainly avoid the current scenario but I'm not sure it is necessarily a fix as such. We really need to know why the signal can't be sent.
29-06-2025
Hmm, macosx-aarch64... I'll wait until we get to see this issue on a platform where debugging a core file is a practical possibility...
27-06-2025
This could be an interaction problem resulting from the attempt to remove the use of Threads_lock for the JfrThreadSampler; see https://bugs.openjdk.org/browse/JDK-8358429. Perhaps an indirect effect of the JfrSamplerThread holding the Threads_lock during the sampling period is that this kind of situation cannot occur (no safepoints can then interleave during a JFR sampling period, as no suspendee can hold the Threads_lock). In theory, it should be okay for a suspended thread to keep the Threads_lock during the time of suspension. If it's held, then the VM_Thread cannot issue a safepoint operation...) It might be necessary to restore the Threads_lock to be taken by the JFRSamplerThread, if only for that safety property.
27-06-2025
For JavaThreads running in state _thread_in_Java, the JfrSamplerThread does not use/need the Threads_lock. Why is the signal not delivered to the suspendee? Why is libsystem_platform.dylib`_os_unfair_lock_lock_slow blocking? Has the core file timeout handler suspended an already suspended thread, such as in the case of the jstack SA agent interaction problem again (double suspension, used by thread dumps)? It appears from your description that the timeout handlers kicked in due to an existing deadlock or livelock condition. I need to inspect the core files for more info.
27-06-2025