JDK-8274687 : JDWP deadlocks if some Java thread reaches wait in blockOnDebuggerSuspend
  • Type: Bug
  • Component: core-svc
  • Sub-Component: debugger
  • Affected Version: 11,17,18
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2021-10-04
  • Updated: 2022-12-08
  • Resolved: 2021-11-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 17 JDK 18
11.0.17Fixed 17.0.4Fixed 18 b24Fixed
Related Reports
Relates :  
Relates :  
Description
Case 1: Deadlock on resume by debugger
======================================

The JDWP agent deadlocks the vm if

* A thread T is blocked in blockOnDebuggerSuspend because it called
  j.l.Thread.resume() on a thread "resumee" that is currently suspended by the
  debugger

* The debugger tries to resume one or all threads

because T owns handlerLock waiting for a resume by the debugger and the debugger
needs handlerLock for the resume.

Stacks on Deadlock
------------------

### Stack of Thread T

#0  futex_wait_cancelable
#1  __pthread_cond_wait_common
#2  __pthread_cond_wait
#3  os::PlatformEvent::park
#4  JvmtiRawMonitor::simple_wait
#5  JvmtiRawMonitor::raw_wait
#6  JvmtiEnv::RawMonitorWait
#7  debugMonitorWait
#8  blockOnDebuggerSuspend
#9  handleAppResumeBreakpoint
#10 event_callback
#11 cbBreakpoint
#12 JvmtiExport::post_raw_breakpoint
#13 InterpreterRuntime::_breakpoint

### JDWP Agent Stack

#0  futex_wait_cancelable
#1  __pthread_cond_wait_common
#2  __pthread_cond_wait
#3  os::PlatformEvent::park
#4  JvmtiRawMonitor::simple_enter
#5  JvmtiRawMonitor::raw_enter
#6  JvmtiEnv::RawMonitorEnter
#7  debugMonitorEnter
#8  eventHandler_lock
#9  threadControl_resumeThread
#10 resume
#11 debugLoop_run
#12 connectionInitiated
#13 attachThread
#14 JvmtiAgentThread::call_start_function
#15 JavaThread::thread_main_inner
#16 Thread::call_run
#17 thread_native_entry
#18 start_thread
#19 clone

See attachment for jtreg reproducer.

Case 2: Deadlock on JDWP Dispose command
========================================

We see sporadic timouts running
test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003 because
the debuggee main thread and the JDWP agent thread deadlock with the following
stacks:

### Debuggee Main Thread "M"

#0  futex_wait_cancelable 
#1  __pthread_cond_wait_common 
#2  __pthread_cond_wait 
#3  os::PlatformEvent::park 
#4  JvmtiRawMonitor::simple_wait 
#5  JvmtiRawMonitor::raw_wait 
#6  JvmtiEnv::RawMonitorWait 
#7  debugMonitorWait 
#8  blockOnDebuggerSuspend 
#9  handleAppResumeBreakpoint 
#10 event_callback 
#11 cbBreakpoint 
#12 JvmtiExport::post_raw_breakpoint 
#13 InterpreterRuntime::_breakpoint 

### JDWP Agent Thread "A"

#0  futex_wait_cancelable 
#1  __pthread_cond_wait_common 
#2  __pthread_cond_wait 
#3  os::PlatformEvent::park 
#4  JvmtiRawMonitor::simple_enter 
#5  JvmtiRawMonitor::raw_enter 
#6  JvmtiEnv::RawMonitorEnter 
#7  debugMonitorEnter 
#8  eventHandler_free 
#9  threadControl_onDisconnect 
#10 debugLoop_run 
#11 connectionInitiated 
#12 attachThread 
#13 JvmtiAgentThread::call_start_function 
#14 JavaThread::thread_main_inner 
#15 Thread::call_run 
#16 thread_native_entry 
#17 start_thread 
#18 clone

#### How to reproduce

The deadlock will likely be reached with the following patch. Apply and run dispose003.

--- a/src/jdk.jdwp.agent/share/native/libjdwp/debugLoop.c
+++ b/src/jdk.jdwp.agent/share/native/libjdwp/debugLoop.c
@@ -180,6 +180,9 @@ debugLoop_run(void)
             shouldListen = !lastCommand(cmd);
         }
     }
+    /* Sleep to trigger deadlock in test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003 */
+    fprintf(stderr, "debugLoop: sleep\n");
+    sleep(1);
     threadControl_onDisconnect();
     standardHandlers_onDisconnect();

#### Analysis

M hit the internal breakpoint in j.l.Thread.resume()[1]. The resumee
"testedThread" (named "thread2" in log output[2]) is currently suspended
therefore M waits on threadLock until resumee is not suspended anymore while
owning handlerLock (acquired in event_callback)[3].

A should call threadControl_reset to resume all threads including "testedThread" so
that M can continue but it is blocked before that in eventHandler_free trying to
enter handlerLock owned by M.

Note that the vm.dispose() call by the debugger immediately returns. Resuming
all suspended threads is done asynchronously[4].

[1] M calls j.l.Thread.resume() and hits the internal breakpoint set by the JDWP agent
    https://github.com/openjdk/jdk/blob/32811026ce5ecb1d27d835eac33de9ccbd51fcbf/test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003a.java#L139

[2] "testedThread" is named "thread2" in log output.
    https://github.com/openjdk/jdk/blob/32811026ce5ecb1d27d835eac33de9ccbd51fcbf/test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003a.java#L137

[3] M calls `blockOnDebuggerSuspend()` when hitting the internal
    breakpoint in j.l.Thread.resume(). There it waits while the resumee is
    suspended by the debugger.
    https://github.com/openjdk/jdk/blob/32811026ce5ecb1d27d835eac33de9ccbd51fcbf/src/jdk.jdwp.agent/share/native/libjdwp/threadControl.c#L749

[4] vm.dispose() call by debugger returns immediately. Threads are resumed asynchronously.
    https://github.com/openjdk/jdk/blob/32811026ce5ecb1d27d835eac33de9ccbd51fcbf/test/hotspot/jtreg/vmTestbase/nsk/jdi/VirtualMachine/dispose/dispose003.java#L228

Comments
Fix request (11u) Applies cleanly except for one minor conflict because of the comments on the declarations of `current_ei` and `pendingStop` not present in jdk11. Otherwise just like 17u backport above. TODO: the included test needs to be adapted.
05-07-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1189 Date: 2022-06-30 06:53:10 +0000
01-07-2022

Fix request (17u) I would like to backport this to jdk17u to avoid the described issues. Applies cleanly. The fix passed CI testing at SAP. This includes JCK and JTREG tests on the standard platforms and also on Linux/PPC64le. I'd consider the risk of this change low. The code is only triggered if a thread calls j.l.Thread.suspend() which is deprecated since Java 8.
09-05-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk17u-dev/pull/375 Date: 2022-04-29 08:12:39 +0000
09-05-2022

Changeset: ca2efb73 Author: Richard Reingruber <rrich@openjdk.org> Date: 2021-11-15 07:02:22 +0000 URL: https://git.openjdk.java.net/jdk/commit/ca2efb73f59112d9be2ec29db405deb4c58dd435
15-11-2021