Bug ID: JDK-8072439 Further refinement of the fix JDK-8047720

JDK-8072439 : Further refinement of the fix JDK-8047720 - Xprof hangs on Solaris

Type: Bug
Component: hotspot
Sub-Component: runtime

Priority: P4
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2015-02-03
Updated: 2022-09-15
Resolved: 2015-03-03

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 8	JDK 9
8u341Fixed	9 b55Fixed

Related Reports

Relates :	JDK-8047720 - Xprof hangs on Solaris
Relates :	JDK-8074314 - PeriodicTask and WatcherThread APIs need to be cleaned up

Description

The fix for the following bug may need more work:

JDK-8047720 Xprof hangs on Solaris

I've seen two separate e-mails about the fix for 8047720 in less than a week.

First was this e-mail sent directly to me:

On 1/28/15 12:28 PM, Alexander Garthwaite wrote:
> Subject: question about fix for PeriodicTask_lock/WatcherThread deadlock
> Dan,
>     Apologies for writing directly. I am backporting your fix to a version
> 7 copy of the OpenJDK used at Twitter. The fix (which looks good overall)
> is to try the lock and only wake the WatcherThread if we get it. It appears
> there is still a small race (which may not matter) if we have the following
> sequence:
>    *  the WatcherThread has the lock
>    *  some other thread, T, calls stop() and fails to get the lock
>    *  the WatcherThread unlocks the PeriodicTask_lock, still sees
> _should_stop as false and parks
>    *  T doesn't have the lock and so exits stop without doing an unpack
> This would allow the WatcherThread to park without seeing that it should stop.
> I was wondering if the following small change would fix that:
>  void WatcherThread::stop() {
> +  // Get the PeriodicTask_lock if we can. If we cannot, then the
> +  // WatcherThread is using it and we don't want to block on that lock
> +  // here because that might cause a safepoint deadlock depending on
> +  // what the current WatcherThread tasks are doing.
> +  bool have_lock = PeriodicTask_lock->try_lock();
> +
> +  _should_terminate = true;
> +  OrderAccess::fence();  // ensure WatcherThread sees update in main loop
>    // Retry the lock if we don't have it. This handles the case where
>    // the WatcherThread stopped using and parked without seeing
>    // _should_terminate set to true.
>    if (!have_lock) {
>      have_lock = PeriodicTask_lock->try_lock();
>    }
> +  if (have_lock) {
>      WatcherThread* watcher = watcher_thread();
> +    if (watcher != NULL) {
> +      // If we managed to get the lock, then we should unpark the
> +      // WatcherThread so that it can see we want it to stop.
>         watcher->unpark();
> +    }
> +
> +    PeriodicTask_lock->unlock();
>    }
> (Ignore the leading '+'s. They're marking your patch.) Thanks!
>                                -- Alex


Second was this e-mail sent on 2015.02.03 to serviceability-dev@openjdk.java.net:

On 2/3/15 12:00 PM, Carsten Varming wrote:
> Subject: Patch: Clean up fix for JDK-8047720
> Greetings all,
> I was recently introduced to the deadlock described in JDK-8047720 and fixed in JDK9. The fix seems a little messy to me, and it looks like it left some very short races in the code. So I decided to clean up the code. See attached diff.
> The change takes a step back and acquires PeriodicTask_lock in WatcherThread::stop (like before the fix in JDK9), but this time safepoints are allowed to proceed when acquiring PeriodicTask_lock, preventing the deadlock.
> Let me know what you think,
> Carsten

Carsten's patch attachment:

diff -r 8ff882030755 src/share/vm/runtime/thread.cpp
--- a/src/share/vm/runtime/thread.cpp	Wed Jan 28 16:45:35 2015 -0800
+++ b/src/share/vm/runtime/thread.cpp	Tue Feb 03 13:52:32 2015 -0500
@@ -1318,24 +1318,20 @@
 }
 
 void WatcherThread::stop() {
-  // Get the PeriodicTask_lock if we can. If we cannot, then the
-  // WatcherThread is using it and we don't want to block on that lock
-  // here because that might cause a safepoint deadlock depending on
-  // what the current WatcherThread tasks are doing.
-  bool have_lock = PeriodicTask_lock->try_lock();
-
   _should_terminate = true;
-  OrderAccess::fence();  // ensure WatcherThread sees update in main loop
-
-  if (have_lock) {
+  {
+    // Let safepoints happen to avoid the deadlock:
+    // The VM thread has Threads_lock and waits for Java threads to report back for a safepoint.
+    // The watcher thread has PeriodicTask_lock and tries to acquire Threads_lock.
+    // A Java thread executes WatcherThread::stop, tries to acquire PeriodicTask_lock,
+    // and holds up the VM thread by not reaching a safepoint.
+    MutexLocker mu(PeriodicTask_lock);
+
     WatcherThread* watcher = watcher_thread();
     if (watcher != NULL) {
-      // If we managed to get the lock, then we should unpark the
-      // WatcherThread so that it can see we want it to stop.
+      // Unpark the WatcherThread so that it can see that it should terminate.
       watcher->unpark();
     }
-
-    PeriodicTask_lock->unlock();
   }
 
   // it is ok to take late safepoints here, if needed

Comments

URL: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/15b8ca9c2885 User: lana Date: 2015-03-18 00:54:25 +0000
18-03-2015
URL: http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/15b8ca9c2885 User: dcubed Date: 2015-03-03 05:07:14 +0000
03-03-2015
The fix for JDK-8047720 permits the watcher->unpark() call to be skipped if the thread that called WatcherThread::stop() failed to get the PeriodicTask_lock. That missed unpark() can theoretically cause a hang here: src/share/vm/runtime/thread.cpp: 1198 int WatcherThread::sleep() const { 1199 MutexLockerEx ml(PeriodicTask_lock, Mutex::_no_safepoint_check_flag); 1200 1201 // remaining will be zero if there are no tasks, 1202 // causing the WatcherThread to sleep until a task is 1203 // enrolled 1204 int remaining = PeriodicTask::time_to_wait(); 1205 int time_slept = 0; 1206 1207 // we expect this to timeout - we only ever get unparked when 1208 // we should terminate or when a new task has been enrolled 1209 OSThreadWaitState osts(this->osthread(), false /* not Object.wait() */ ); 1210 1211 jlong time_before_loop = os::javaTimeNanos(); 1212 1213 for (;;) { 1214 bool timedout = PeriodicTask_lock->wait(Mutex::_no_safepoint_check_f lag, remaining); On line 1204, if PeriodicTask::time_to_wait() returns zero (0), then the WatcherThread will get stuck on line 1214 in the PeriodicTask_lock->wait() call because of the skipped unpark() call. It looks like PeriodicTask::time_to_wait() only returns zero (0) if there are no remaining PeriodicTasks. Since the WatcherThread is running, that means we had at least one PeriodicTask and in order to hang, we have to be down to none. I don't see a way to make that happen. I created yet another batch of debug code to try and reproduce this hang, but was unable to do so until I simulated PeriodicTask::time_to_wait() returning zero (0). Here's the debug code: $ hg diff --nodates diff -r 53bc5f5b8378 src/share/vm/runtime/thread.cpp --- a/src/share/vm/runtime/thread.cpp +++ b/src/share/vm/runtime/thread.cpp @@ -1211,7 +1211,27 @@ int WatcherThread::sleep() const { jlong time_before_loop = os::javaTimeNanos(); for (;;) { + +if (UseNewCode) { +// At this point, this thread is holding PeriodicTask_lock +// Add a slight delay to improve the odds that the main +// thread will skip the WatcherThread::unpark() call in +// WatcherThread::stop() below. +tty->print_cr("XXX - WatcherThread naked_short_sleep()"); +os::naked_short_sleep(999); +tty->print_cr("XXX - WatcherThread::_should_terminate=%d", _should_terminate); +if (_should_terminate && remaining > 0) { +// simulate no Periodic tasks left +remaining = 0; +} +} + bool timedout = PeriodicTask_lock->wait(Mutex::_no_safepoint_check_flag, remaining); + +if (UseNewCode && _should_terminate) { +tty->print_cr("XXX - WatcherThread::timedout=%d, remaining=%d", timedout, remaining); +} + jlong now = os::javaTimeNanos(); if (remaining == 0) { @@ -1322,7 +1342,19 @@ void WatcherThread::stop() { // WatcherThread is using it and we don't want to block on that lock // here because that might cause a safepoint deadlock depending on // what the current WatcherThread tasks are doing. + +if (UseNewCode) { +// At this point, the WatcherThread is holding PeriodicTask_lock for an +// extra second. Add a slight delay to improve the odds that this thread +// will fail the try_lock() call which means the watcher->unpark() call +// will also be skipped. +os::naked_short_sleep(250); +} + bool have_lock = PeriodicTask_lock->try_lock(); +if (UseNewCode) { +tty->print_cr("XXX - WatcherThread::stop() - have_lock=%d", have_lock); +} _should_terminate = true; OrderAccess::fence(); // ensure WatcherThread sees update in main loop Even though I cannot yet figure out a way to make this hang reproduce in the real world, I think we need to proceed with cleaning up the PeriodicTask_lock code. The version of WatcherThread::stop() introduced by me in: JDK-8047720 Xprof hangs on Solaris has caused enough headaches in folks trying to read and understand the code.
17-02-2015
I backed out the fix for JDK-8047720 and applied my current draft fix. I borrowed the debug code from JDK-8047720 that makes that hang much more reproducible: $ diff 8072439.diffs.txt.00 8072439.diffs.debug.txt.00 0a1,17 > diff -r 53bc5f5b8378 src/share/vm/runtime/fprofiler.cpp > --- a/src/share/vm/runtime/fprofiler.cpp > +++ b/src/share/vm/runtime/fprofiler.cpp > @@ -900,6 +900,13 @@ void FlatProfiler::record_thread_ticks() > // Suspend each thread. This call should just return > // for any threads that have already self-suspended > // Net result should be one safepoint > +if (UseNewCode) { > +// At this point, this thread is holding PeriodicTask_lock > +// Add a slight delay to improve the odds that the main > +// thread will block on PeriodicTask_lock while trying to > +// terminate the VM. > +os::naked_short_sleep(999); > +} > for (int j = 0; j < suspendedthreadcount; j++) { > JavaThread tp = threadsList[j]; > if (tp) { 11c28 < @@ -422,9 +422,11 @@ void before_exit(JavaThread thread) { --- > @@ -422,9 +422,17 @@ void before_exit(JavaThread * thread) { 19a37,42 > +if (UseNewCode) { > +// At this point, the Flat Profiler thread is holding PeriodicTask_lock > +// for an extra second. Add a slight delay to improve the odds that this > +// thread will line up in that window and block in WatcherThread::stop(). > +os::naked_short_sleep(250); > +} 177c200 < @@ -1318,27 +1332,20 @@ void WatcherThread::make_startable() { --- > @@ -1318,27 +1332,21 @@ void WatcherThread::make_startable() { 194c217,218 < + MutexLocker ml(PeriodicTask_lock); --- > +// MutexLocker ml(PeriodicTask_lock); > +MutexLockerEx ml(PeriodicTask_lock, UseNewCode2 && Mutex::_no_safepoint_check_flag); 215c239 < @@ -1358,9 +1365,7 @@ void WatcherThread::stop() { --- > @@ -1358,9 +1366,7 @@ void WatcherThread::stop() { 226c250 < @@ -3557,8 +3562,8 @@ jint Threads::create_vm(JavaVMInitArgs* --- > @@ -3557,8 +3563,8 @@ jint Threads::create_vm(JavaVMInitArgs* The '-XX:+UseNewCode' option enables the selective delays that makes the hang reproducible. The '-XX:+UseNewCode2' option changes WatcherThread::stop() from the fixed lock grab back to the non-safepointing lock grab: 194c217,218 < + MutexLocker ml(PeriodicTask_lock); --- > +// MutexLocker ml(PeriodicTask_lock); > +MutexLockerEx ml(PeriodicTask_lock, UseNewCode2 && Mutex::_no_safepoint_check_flag); With both options: '-XX:+UseNewCode' and '-XX:+UseNewCode2' the original hang in JDK-8047720 reproduces every time. With just '-XX:+UseNewCode', the revised code passed 3317 test runs in an over night run on my Solaris X64 server. So the new code works as well as the original fix for JDK-8047720 using the original failing test. Still need to create a reproducer for the hang reported by Alex and Carsten.
17-02-2015
Finished a crawl through all uses of PeriodicTask_lock and the surrounding code. Added various comments for things that I didn't think were obvious, added some asserts, changed some PeriodicTask_lock grabs into asserts that the caller already held the lock. Basically made a clean-up pass. The key result of this analysis is that I concur with Alex and Carsten that my original fix for: JDK-8047720 Xprof hangs on Solaris is broken. The simplest fix for that bug is this: void WatcherThread::stop() { { - MutexLockerEx ml(PeriodicTask_lock, Mutex::_no_safepoint_check_flag); + MutexLocker ml(PeriodicTask_lock); WatcherThread::stop() is called by another JavaThread so it should honor the safepoint protocol. Cross checking against Alex's and Carsten's e-mail threads.
17-02-2015
The PeriodicTask_lock was added by the following: JDK-7127792 Add the ability to change an existing PeriodicTask's execution interval when it was added it looks like MutexLockerEx was consistently used because sometimes the mutex lock was optional, e.g.: +int PeriodicTask::time_to_wait() { + MutexLockerEx ml(PeriodicTask_lock->owned_by_self() ? + NULL : PeriodicTask_lock, Mutex::_no_safepoint_check_flag); I suspect that there are non-open uses of PeriodicTask::time_to_wait() where the lock is not owned by the caller... Update: Ignore that last sentence. I need to do more research about the PeriodicTask_lock.
13-02-2015
Internalizing the locking is better from an abstraction perspective, but then has to watch for recursive locking (Use of NULL with MutexLockerEx) because the WatcherThread externalizes the locking for its own purposes. In which case you may as well make the caller responsible for the locking - especially when there is only one caller as is the case with PeriodicTask::time_to_wait.
13-02-2015
Carsten's updated (better) patch: diff --git a/src/share/vm/runtime/task.cpp b/src/share/vm/runtime/task.cpp --- a/src/share/vm/runtime/task.cpp +++ b/src/share/vm/runtime/task.cpp @@ -74,8 +74,7 @@ } int PeriodicTask::time_to_wait() { - MutexLockerEx ml(PeriodicTask_lock->owned_by_self() ? - NULL : PeriodicTask_lock, Mutex::_no_safepoint_check_flag); + assert(PeriodicTask_lock->owned_by_self(), "precondition"); if (_num_tasks == 0) { return 0; // sleep until shutdown or a task is enrolled diff --git a/src/share/vm/runtime/thread.cpp b/src/share/vm/runtime/thread.cpp --- a/src/share/vm/runtime/thread.cpp +++ b/src/share/vm/runtime/thread.cpp @@ -1198,6 +1198,10 @@ int WatcherThread::sleep() const { MutexLockerEx ml(PeriodicTask_lock, Mutex::_no_safepoint_check_flag); + if (_should_terminate) { + return 0; // we did not sleep. + } + // remaining will be zero if there are no tasks, // causing the WatcherThread to sleep until a task is // enrolled @@ -1252,7 +1256,7 @@ this->initialize_thread_local_storage(); this->set_native_thread_name(this->name()); this->set_active_handles(JNIHandleBlock::allocate_block()); - while (!_should_terminate) { + while (true) { assert(watcher_thread() == Thread::current(), "thread consistency check"); assert(watcher_thread() == this, "thread consistency check"); @@ -1288,6 +1292,10 @@ } } + if (_should_terminate) { + break; + } + PeriodicTask::real_time_tick(time_waited); } @@ -1318,24 +1326,20 @@ } void WatcherThread::stop() { - // Get the PeriodicTask_lock if we can. If we cannot, then the - // WatcherThread is using it and we don't want to block on that lock - // here because that might cause a safepoint deadlock depending on - // what the current WatcherThread tasks are doing. - bool have_lock = PeriodicTask_lock->try_lock(); - _should_terminate = true; - OrderAccess::fence(); // ensure WatcherThread sees update in main loop - - if (have_lock) { + { + // Let safepoints happen to avoid the deadlock: + // The VM thread has Threads_lock and waits for Java threads to report back for a safepoint. + // The watcher thread has PeriodicTask_lock and tries to acquire Threads_lock. + // A Java thread executes WatcherThread::stop, tries to acquire PeriodicTask_lock, + // and holds up the VM thread by not reaching a safepoint. + MutexLocker mu(PeriodicTask_lock); + WatcherThread* watcher = watcher_thread(); if (watcher != NULL) { - // If we managed to get the lock, then we should unpark the - // WatcherThread so that it can see we want it to stop. + // Unpark the WatcherThread so that it can see that it should terminate. watcher->unpark(); } - - PeriodicTask_lock->unlock(); } // it is ok to take late safepoints here, if needed @@ -1358,9 +1362,7 @@ } void WatcherThread::unpark() { - MutexLockerEx ml(PeriodicTask_lock->owned_by_self() - ? NULL - : PeriodicTask_lock, Mutex::_no_safepoint_check_flag); + assert(PeriodicTask_lock->owned_by_self(), "precondition"); PeriodicTask_lock->notify(); }
11-02-2015
Re: // Terminate watcher thread - must before disenrolling any periodic task I think the only issue with disenrolling dyamically (while the WatcherThread has not been told to terminate) is that it won't wake up the WatcherThread if you are disenrolling the task it was waiting for. So there is no need for the WatcherThread to actually have terminated.
04-02-2015
Correction - it doesn't matter if we set _should_terminate before the try_lock(). If the try_lock fails then the WatcherThread owns it and so it may still have gone past the point where it checks _should_terminate in WatcherThread::sleep. Really _should_terminate should only be set if the lock is held. Alex's suggestion doesn't work because we can still fail to get the lock the second time, and the WatcherThread may still have gone past the point where it checks _should_terminate. Though it will recheck once the wait time for the next periodic event has elapsed. Seems to me there is a further problem here because we have: // Terminate watcher thread - must before disenrolling any periodic task if (PeriodicTask::num_tasks() > 0) WatcherThread::stop(); yet there is no guarantee at all that the WatcherThread has actually stopped before we then proceed to "disenroll" tasks. It isn't clear to me at the moment what may happen if we disenroll while the WatcherThread is still running
04-02-2015
The race Alex points out is due to setting _should_terminate after trying to get the lock. The retry for the lock deals with that but I think it can be dealt with more simply by reordering the original code: _should_terminate = true; bool have_lock = PeriodicTask_lock->try_lock(); Not clear if this is still needed somewhere: OrderAccess::fence(); // ensure WatcherThread sees update in main loop Carsten's patch simply removes the no-safepoint-check from the original lock acquisition that led to the deadlock. This seems simple - so simple why didn't we do that in the first place? :) So I have to think there are other reasons not to be blocking at a safepoint when in this method - but that would need investigation.
04-02-2015