JDK-8334482 : Shenandoah: Deadlock when safepoint is pending during nmethods iteration
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 17,21,23,24
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2024-06-18
  • Updated: 2024-08-05
  • Resolved: 2024-07-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21 JDK 23 JDK 24
17.0.13Fixed 21.0.5Fixed 23.0.2Fixed 24 b09Fixed
Related Reports
Duplicate :  
Relates :  
Description
In one of our applications running Shenandoah on Corretto 17.0.11+10, we see safepoint timeouts showing that the Sweeper thread has not reached a safepoint after 1000ms.

```
# SafepointSynchronize::begin: Timeout detected:
[364.175s][warning][safepoint] # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
[364.175s][warning][safepoint] # SafepointSynchronize::begin: Threads which did not reach the safepoint:
[364.175s][warning][safepoint] # "Sweeper thread" #19 daemon prio=9 os_prio=0 cpu=1516.68ms elapsed=363.34s tid=0x00007ff04c1bbe50 nid=0x7eda runnable  [0x0000000000000000]
[364.175s][warning][safepoint]    java.lang.Thread.State: RUNNABLE
[364.175s][warning][safepoint]
[364.175s][warning][safepoint] # SafepointSynchronize::begin: (End of list)
```

```
Threads waiting in SuspendibleThreadSet:join for ShenandoahConcurrentWeakRootsEvacUpdate Task:

#0 0x00007fa753404377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa7526f2a1b in os::PlatformMonitor::wait(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#2 0x00007fa7526a0489 in Monitor::wait_without_safepoint_check(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#3 0x00007fa7528c81fa in SuspendibleThreadSet::join() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#4 0x00007fa7527dc27d in ShenandoahConcurrentWeakRootsEvacUpdateTask::work(unsigned int) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#5 0x00007fa7529e42bf in GangWorker::loop() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#6 0x00007fa7529e431f in GangWorker::run() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#7 0x00007fa752930118 in Thread::call_run() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#8 0x00007fa7526e7131 in thread_native_entry(Thread*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#9 0x00007fa7533fe44b in start_thread () from /lib64/libpthread.so.0
#10 0x00007fa752f3552f in clone () from /lib64/libc.so.6

and our blocked sweeper thread, waiting for the evac threads to notify it:

#0 0x00007fa753404377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fa7526f2a1b in os::PlatformMonitor::wait(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#2 0x00007fa7526a0489 in Monitor::wait_without_safepoint_check(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#3 0x00007fa75282169b in ShenandoahNMethodTable::flush_nmethod(nmethod*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#4 0x00007fa7526ad05a in nmethod::flush() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#5 0x00007fa7528c8fe2 in NMethodSweeper::process_compiled_method(CompiledMethod*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#6 0x00007fa7528c95a3 in NMethodSweeper::sweep_code_cache() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#7 0x00007fa7528c9eec in NMethodSweeper::sweep() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#8 0x00007fa7528ca126 in NMethodSweeper::sweeper_loop() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#9 0x00007fa75292c58b in JavaThread::thread_main_inner() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#10 0x00007fa752930118 in Thread::call_run() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#11 0x00007fa7526e7131 in thread_native_entry(Thread*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so
#12 0x00007fa7533fe44b in start_thread () from /lib64/libpthread.so.0
#13 0x00007fa752f3552f in clone () from /lib64/libc.so.6
``` 

This appears to be a deadlock when another vm op happens at a bad time. See attached reproducer - run with `javac SweeperStuck.java && java -Xcomp -XX:+UseShenandoahGC -Xlog:safepoint=info -XX:+UnlockDiagnosticVMOptions -XX:+AbortVMOnSafepointTimeout -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000 SweeperStuck` 
Comments
[jdk21u-fix-request] Approval Request from Aleksey Shipilëv Fixes a Shenandoah deadlock. Applies with minor fuzz. The patch was in mainline for a short time, but passed multi-day stress testing. Risk is medium-low: the code change is simple enough, but affects a generic path in Shenandoah. We are picking this up into our downstream releases ahead of time as well.
31-07-2024

[jdk17u-fix-request] Approval Request from Aleksey Shipilëv Fixes a Shenandoah deadlock. Applies with minor fuzz. The patch was in mainline for a short time, but passed multi-day stress testing. Risk is medium-low: the code change is simple enough, but affects a generic path in Shenandoah. We are picking this up into our downstream releases ahead of time as well.
31-07-2024

[jdk23u-fix-request] Approval Request from Aleksey Shipilëv Fixes a Shenandoah deadlock. Applies cleanly. The patch was in mainline for a short time, but passed multi-day stress testing. Risk is medium-low: the code change is simple enough, but affects a generic path in Shenandoah.
30-07-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk17u-dev/pull/2751 Date: 2024-07-30 13:27:53 +0000
30-07-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk23u/pull/45 Date: 2024-07-30 10:57:34 +0000
30-07-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk21u-dev/pull/880 Date: 2024-07-30 11:56:49 +0000
30-07-2024

Changeset: 2aeb12ec Branch: master Author: Aleksey Shipilev <shade@openjdk.org> Date: 2024-07-26 11:20:40 +0000 URL: https://git.openjdk.org/jdk/commit/2aeb12ec03944c777d617d0be48982fd225b16e7
26-07-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/20309 Date: 2024-07-24 09:10:35 +0000
24-07-2024

It looks to me the real problem is indeed with STS interaction: we cannot first start the nmethod iteration in ShenandoahConcurrentWeakRootsEvacUpdateTask constructor, and then go to STS in its work(), which can potentially block. Whatever triggers the safepoint when between constructor call and worker picking up the task would deadlock the whole thing. Sweeper is not really related to this, any pending safepoint would deadlock, if there is a pending operation that waits for nmethod iteration to be over. Could be Sweeper, could be just compiler patching the code and registering a new nmethod version. I think I have a fix for this: avoid starting nmethod iteration before we enter STS; do it after. A crude version is like this: 8334482-crude-17u.patch, see attached.
23-07-2024

The STS in question was added by JDK-8307395, so maybe that one needs to be reconsidered.
23-07-2024

We have seen a second path to this deadlock, when C1 hotpatches the code and calls `ShenandoahNMethodTable::register_nmethod`: Thread 37 (Thread 0x7f8a004c1700 (LWP 11028)): #0 0x00007f8a260c9377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f8a253bb5eb in os::PlatformMonitor::wait(long) () #2 0x00007f8a253687f9 in Monitor::wait_without_safepoint_check(long) () #3 0x00007f8a254ea3b3 in ShenandoahNMethodTable::register_nmethod(nmethod*) () #4 0x00007f8a24c6ea8d in Runtime1::patch_code(JavaThread*, Runtime1::StubID) () #5 0x00007f8a24c6fe97 in Runtime1::move_klass_patching(JavaThread*) () ShenandoahConcurrentWeakRootsEvacUpdateTask is also parked on STS in that failure mode.
23-07-2024

It seems to be a deadlock, not just a delay. When the safepoint timeout triggers, we have: ``` Threads waiting in SuspendibleThreadSet:join for ShenandoahConcurrentWeakRootsEvacUpdate Task: #0 0x00007fa753404377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa7526f2a1b in os::PlatformMonitor::wait(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #2 0x00007fa7526a0489 in Monitor::wait_without_safepoint_check(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #3 0x00007fa7528c81fa in SuspendibleThreadSet::join() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #4 0x00007fa7527dc27d in ShenandoahConcurrentWeakRootsEvacUpdateTask::work(unsigned int) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #5 0x00007fa7529e42bf in GangWorker::loop() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #6 0x00007fa7529e431f in GangWorker::run() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #7 0x00007fa752930118 in Thread::call_run() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #8 0x00007fa7526e7131 in thread_native_entry(Thread*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #9 0x00007fa7533fe44b in start_thread () from /lib64/libpthread.so.0 #10 0x00007fa752f3552f in clone () from /lib64/libc.so.6 and our blocked sweeper thread, waiting for the evac threads to notify it: #0 0x00007fa753404377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa7526f2a1b in os::PlatformMonitor::wait(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #2 0x00007fa7526a0489 in Monitor::wait_without_safepoint_check(long) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #3 0x00007fa75282169b in ShenandoahNMethodTable::flush_nmethod(nmethod*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #4 0x00007fa7526ad05a in nmethod::flush() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #5 0x00007fa7528c8fe2 in NMethodSweeper::process_compiled_method(CompiledMethod*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #6 0x00007fa7528c95a3 in NMethodSweeper::sweep_code_cache() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #7 0x00007fa7528c9eec in NMethodSweeper::sweep() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #8 0x00007fa7528ca126 in NMethodSweeper::sweeper_loop() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #9 0x00007fa75292c58b in JavaThread::thread_main_inner() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #10 0x00007fa752930118 in Thread::call_run() () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #11 0x00007fa7526e7131 in thread_native_entry(Thread*) () from /local/apollo/package/local_1/AL2_x86_64/JDK17/JDK17-3703.0-0/jdk-17/lib/server/libjvm.so #12 0x00007fa7533fe44b in start_thread () from /lib64/libpthread.so.0 #13 0x00007fa752f3552f in clone () from /lib64/libc.so.6 ``` It seems that non-gc safepoints at the wrong time can trigger this deadlock. I have attached reproducer which reliably triggers the timeout within ~15 seconds on my host. `javac SweeperStuck.java && java -Xcomp -XX:+UseShenandoahGC -Xlog:safepoint=info -XX:+UnlockDiagnosticVMOptions -XX:+AbortVMOnSafepointTimeout -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000 SweeperStuck`
24-06-2024

Thanks Zhengyu - yes this is Corretto 17.0.11+10. Investigating further, actually the application never recovers from the stuck sweeper thread, so it seems more likely that there's something causing the sweeper not to be notified at all, not just delayed. I will do more investigation on the application and update here with my findings.
19-06-2024

CodeCache_lock is only taken at begin and end of iteration, not during the iteration, I doubt it is the cause to block safepoints. BTW, this is jdk17? Sweeper was removed in jdk20.
18-06-2024

Zhengyu, do you remember why this code acquires CodeCache_lock, and/or why it acquires without checking for safepoint?
18-06-2024