JDK-8016304 : ThreadMXBean.getDeadlockedThreads reports bogus deadlocks on JDK 8
  • Type: Bug
  • Component: core-svc
  • Sub-Component: javax.management
  • Affected Version: 8
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2013-06-11
  • Updated: 2014-01-15
  • Resolved: 2013-06-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8 Other
8Fixed hs25Fixed
Related Reports
Relates :  
Relates :  
Description
Hello,

we have started to use the method for deadlock detection but are getting numerous reports from people using JDK 8 preview builds that the method returns bogus results on JDK 8.

Please see one of the reports in our bug database with the thread dump generated from the information provided by the method:
https://netbeans.org/bugzilla/show_bug.cgi?id=230989

There are also other reports for the wrong behaviour. If you want to see the code that we are using, it is here:

http://hg.netbeans.org/main-silver/file/tip/deadlock.detector/src/org/netbeans/modules/deadlock/detector/Detector.java

If you need more information please ask for more specific details.

David
Comments
> Hole #1) when T1 tries to acquire an unlocked monitor: > > 398 EnterI (THREAD) ; > > there may be a pending safepoint that T1 will block for before T1 manages > to acquire the monitor. I do not see how this can happen. The thread state transition that will check for a safepoint occurs with the TBIVM at line 388. I don't see any further safepoint checks in EnterI. However because this thread is TBIVM it is in a safepoint-safe state. So if the monitor was released and this thread was woken up, then the monitor may now be unowned but this thread has not yet had a chance to Self->set_current_pending_monitor(NULL);
18-06-2013

Hi Staffan, I have also observed ThreadMXBean.findDeadlokThreads() returning bogus results in test/java/util/logging/DrainFindDeadlockTest.java - which is failing intermittently. Once it even returned an array containing a single thread. (JDK-8014218) I had found a workaround for DrainFindDeadlockTest.java (by double checking with the results of ThreadMXBean.getThreadInfo(long[])) - but maybe I should just wait for JDK-8016304 to get fixed before actually pushing anything...
17-06-2013

Thanks, Dan. I have verified that the suggested fix makes the problem in the reproducer go away.
17-06-2013

The following function: src/share/vm/services/threadService.cpp: 289 // Find deadlocks involving object monitors and concurrent locks if concurrent_locks is true 290 DeadlockCycle* ThreadService::find_deadlocks_at_safepoint(bool concurrent_locks) { <snip> 322 // heavyweight monitor a thread is waiting to lock. 323 waitingToLockMonitor = (ObjectMonitor*)jt->current_pending_monitor(); <snip> 327 while (waitingToLockMonitor != NULL || waitingToLockBlocker != NULL) { 328 cycle->add_thread(currentThread); 329 if (waitingToLockMonitor != NULL) { 330 currentThread = Threads::owning_thread_from_monitor_owner( 331 (address)waitingToLockMonitor->owner(), 332 false /* no locking needed */); 333 if (currentThread == NULL) { 334 // This function is called at a safepoint so the JavaThread 335 // that owns waitingToLockMonitor should be findable, but 336 // if it is not findable, then the previous currentThread is 337 // blocked permanently. We record this as a deadlock. 338 num_deadlocks++; 339 340 cycle->set_deadlock(true); 341 342 // add this cycle to the deadlocks list 343 if (deadlocks == NULL) { 344 deadlocks = cycle; 345 } else { 346 last->set_next(cycle); 347 } 348 last = cycle; 349 cycle = new DeadlockCycle(); 350 break; 351 } 352 } else { is making an assumption on this line: 329 if (waitingToLockMonitor != NULL) { and that assumption is that if waitingToLockMonitor, which was set by this: 323 waitingToLockMonitor = (ObjectMonitor*)jt->current_pending_monitor(); is non-NULL, then the Monitor must be owned. In the HotSpot naming model, the "at_safepoint" suffix means that this function is only called at a safepoint. Hmmm, usually there's an assert to verify that, but I don't see that. Would be good to add one... Let's see if that assumption that the Monitor must be owned actually holds... Here's the block of code that sets _current_pending_monitor for Java monitors: src/share/vm/runtime/objectMonitor.cpp: 316 void ATTR ObjectMonitor::enter(TRAPS) { <snip> 379 { // Change java thread status to indicate blocked on monitor enter. 380 JavaThreadBlockedOnMonitorEnterState jtbmes(jt, this); 381 382 DTRACE_MONITOR_PROBE(contended__enter, this, object(), jt); 383 if (JvmtiExport::should_post_monitor_contended_enter()) { 384 JvmtiExport::post_monitor_contended_enter(jt, this); 385 } 386 387 OSThreadContendState osts(Self->osthread()); 388 ThreadBlockInVM tbivm(jt); 389 390 Self->set_current_pending_monitor(this); 391 392 // TODO-FIXME: change the following for(;;) loop to straight-line code. 393 for (;;) { 394 jt->set_suspend_equivalent(); 395 // cleared by handle_special_suspend_equivalent_condition() 396 // or java_suspend_self() 397 398 EnterI (THREAD) ; 399 400 if (! ddddd(jt)) break ; 401 402 // 403 // We have acquired the contended monitor, but while we were 404 // waiting another thread suspended us. We don't want to enter 405 // the monitor while suspended because that would surprise the 406 // thread that suspended us. 407 // 408 _recursions = 0 ; 409 _succ = NULL ; 410 exit (false, Self) ; 411 412 jt->java_suspend_self(); 413 } 414 Self->set_current_pending_monitor(NULL); 415 } Pretty straight forward stuff. Set at: 390 Self->set_current_pending_monitor(this); and cleared at: 414 Self->set_current_pending_monitor(NULL); The idea is that T1 sets the field right before we try to acquire the monitor and if T1 blocks, then another thread can see what T1 is blocked on. And once T1 acquires the monitor, T1 clears the field because T1 is no longer blocked on that monitor. Nice theory, but it has some holes. Hole #1) when T1 tries to acquire an unlocked monitor: 398 EnterI (THREAD) ; there may be a pending safepoint that T1 will block for before T1 manages to acquire the monitor. In fact, that pending safepoint could be due to the request to check for a deadlock. T1 is trying to advertise when it is pending on a monitor, but it is actually advertising when it might be pending on a monitor in this case. On the other hand, ThreadService::find_deadlocks_at_safepoint() is presuming that �� non-NULL _current_pending_monitor field means that the monitor is owned. That is also not correct. Hole #2) after T2 has acquired the monitor, the _current_pending_monitor is still set until this line: 414 Self->set_current_pending_monitor(NULL); For a JavaThread that is going to be suspended right after acquiring (and releasing) the monitor, _current_pending_monitor field will still be set even though the target monitor may or may not be owned at that point in time. What this all means is that ThreadService::find_deadlocks_at_safepoint() is making a bad assumption here: 329 if (waitingToLockMonitor != NULL) { 330 currentThread = Threads::owning_thread_from_monitor_owner( 331 (address)waitingToLockMonitor->owner(), 332 false /* no locking needed */); 333 if (currentThread == NULL) { 334 // This function is called at a safepoint so the JavaThread 335 // that owns waitingToLockMonitor should be findable, but 336 // if it is not findable, then the previous currentThread is 337 // blocked permanently. We record this as a deadlock. Something like this is more correct: if (waitingToLockMonitor != NULL) { address currentOwner = (address)waitingToLockMonitor->owner(); if (currentOwner != NULL) { // the current_pending_monitor is actually owned at this time currentThread = Threads::owning_thread_from_monitor_owner( currentOwner), false /* no locking needed */); if (currentThread == NULL) { .... } } }
14-06-2013

I think this is caused by the fix for JDK-8007476: http://hg.openjdk.java.net/hsx/hotspot-rt/hotspot/rev/5ee250974db9 That fix added a case in the deadlock detection code for when we could not find the thread given a monitor's owner (threadService.cpp:333). When that happens, the new code considers this a deadlock. The full reasoning is given in a comment in the bug report: https://jbs.oracle.com/bugs/browse/JDK-8007476?focusedCommentId=13298744&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13298744 It looks like this is resulting in too many false positives - the sample in this bug report is one. I did some instrumentation of the code and for the sample in this bug report, the reason Threads::owning_thread_from_monitor_owner() returns NULL is that it is called with a NULL owner value: waitingToLockMonitor->owner() is NULL. I don't know how that can happen. Abbreviated output from when this happens looks like: Found one Java-level deadlock: ============================= "Thread-999": waiting to lock monitor 0x00007ff98b055d58 (object 0x0000000144b1dd20, a java.io.PrintStream), which is held by "Thread-292" Java stack information for the threads listed above: =================================================== "Thread-999": at java.io.PrintStream.println(PrintStream.java:806) - waiting to lock <0x0000000144b1dd20> (a java.io.PrintStream) at DDTest$Test.recurse(DDTest.java:45) at DDTest$Test.recurse(DDTest.java:41) at DDTest$Test.recurse(DDTest.java:41) .... "Thread-292" #300 prio=5 os_prio=31 tid=0x00007ff98b178000 nid=0x2a603 waiting for monitor entry [0x0000000170c79000] java.lang.Thread.State: BLOCKED (on object monitor) at java.io.PrintStream.println(PrintStream.java:806) - locked <0x0000000144b1dd20> (a java.io.PrintStream) at DDTest$Test.recurse(DDTest.java:45) at DDTest$Test.recurse(DDTest.java:41) at DDTest$Test.recurse(DDTest.java:41) .... One workaround could be to only add the deadlock if waitingToLockMonitor->owner() is non-null when the call to Threads::owning_thread_from_monitor_owner() is made. However, it would be good to understand why waitingToLockMonitor->owner() is null.
14-06-2013

Awesome work, David! We'll have a look at it.
12-06-2013

Passing the ball back.
12-06-2013

package org.netbeans.ddtest; import static java.lang.management.ManagementFactory.*; import java.lang.management.ThreadMXBean; import java.util.Random; public class DDTest { private static ThreadMXBean bean; private static boolean running = true; public static void main(String[] args) { bean = getThreadMXBean(); for (int i = 0; i < 1000; i++) { Test t = new Test(); Thread tt = new Thread(t); tt.start(); } try { Thread.sleep(10000); } catch (InterruptedException ex) { } running = false; } public static class Test implements Runnable { public void run() { Random r = new Random(); while (running) { try { synchronized (this) { wait(r.nextInt(1000)+1); } } catch (InterruptedException ex) { } recurse(2000); } if (bean.findDeadlockedThreads() != null) { System.out.println("FOUND!"); } } private void recurse(int i) { if (i>0) { recurse(i-1); } else { if (!running) { System.out.println("Hullo"); } } } } } This program should never print FOUND!. It does print it on my java version "1.8.0-ea" Java(TM) SE Runtime Environment (build 1.8.0-ea-b93) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b34, mixed mode) running on MacOSX.
12-06-2013

Do you have reproducer? What I would like is a Java program that I can run "jcmd <pid> Thread.print" (or "kill -QUIT <pid>) against and get an incorrect deadlock reported.
11-06-2013