Bug ID: JDK-6822370 ReentrantReadWriteLock: threads hung when there are no threads holding onto the lock (Netra x4450)

JDK-6822370 : ReentrantReadWriteLock: threads hung when there are no threads holding onto the lock (Netra x4450)

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: hs14,6u12,6u14

Priority: P2
Status: Closed
Resolution: Fixed
OS: solaris_10
CPU: x86

Submitted: 2009-03-26
Updated: 2013-01-11
Resolved: 2011-03-08

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 6	JDK 7	Other
6u18Fixed	7Fixed	hs16Fixed

Related Reports

Duplicate :	JDK-6903249 - array blocking queues hangs on new Intel processor (Nehalem)
Relates :	JDK-6801020 - Concurrent Semaphore release may cause some require thread not signaled
Relates :	JDK-7011862 - java/util/concurrent utilities need to handle StackOverflowError:Class loading hang with clss21201m1
Relates :	JDK-6865591 - [TESTBUG] closed/runtime/6471091/Test6471091.java hangs on Solaris-i586
Relates :	JDK-7191630 - ReentrantReadWriteLock in inconsistent state
Relates :	JDK-8004902 - correctness fixes motivated by contended locking work (6607129)
Relates :	JDK-6807483 - java.util.concurrent.locks.Condition.await(timeout, units) hangs forever

Description

Only happen on Hardware: x4450
OS = Solaris 10 (does not happen with Linux)
JDK = JDK 6.0 upd 10 and 12

Problem Description

Java threads are blocked waiting for a lock that is not held by
any threads ie. a lock that has already been released. thread is
seen waiting on this synchronizer -
- parking to wait for <0xfffffd72f0e32118> (a java.util.concurrent.lock
s.ReentrantReadWriteLock$NonfairSync)

Troubleshooting results are available to show that
the lock is not being held by any thread. See details in comments.

This hung issue is only happening on Netra x4450 but not
any other platforms such as x2200s, x4440s, x4600s, and Niagara-based systems like a 5440.

A testcase and more details are available in comments.
A similar problem was reported via the concurrency-interest mailing list and then on the core-libs dev list.

http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-July/002040.html

The problem is not restricted to x4450 but shows up on 4-8way Intel systems. Same basic scenario - the application hangs with a bunch of threads all inside a LinkedBlockingQueue trying to acquire the internal ReentrantLock that nobody seems to own. Again -XX:+UseMembar avoids the problem.

I used the attached test program to reproduce the problem on an 8-way Intel machine that I have access to. However when I tried to probe deeper by using modified classes that allowed me to examine the internal lock state when the hang occurs, it ceased to occur.

Two different hangs are possible with the test program:
a) use of the application LinkedBlockingDeque
b) use of the ThreadPoolExecutors LinkedBlockingQueue

Thanks to Ariel Weisburg and Ryan Betts for reporting the problem and working to get a small reproducible test case.

Comments

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot/hotspot/rev/95e9083cf4a7
23-12-2009
EVALUATION --
05-12-2009
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-rt/hotspot/rev/95e9083cf4a7
03-12-2009
EVALUATION < wrong sub-CR >
02-12-2009
SUGGESTED FIX * 4654,4663 ** --- 4654,4664 ---- void Parker::park(bool isAbsolute, jlong time) { // Optional fast-path check: // Return immediately if a permit is available. if (_counter > 0) { _counter = 0 ; + OrderAccess::fence(); return ; } Thread* thread = Thread::current(); assert(thread->is_Java_thread(), "Must be JavaThread"); * 4696,4705 --- 4697,4707 ---- int status ; if (_counter > 0) { // no wait needed _counter = 0; status = pthread_mutex_unlock(_mutex); assert (status == 0, "invariant") ; + OrderAccess::fence(); return; } #ifdef ASSERT // Don't catch signals while blocked; let the running threads have the signals. * 4735,4745 **** assert_status(status == 0, status, "invariant") ; // If externally suspended while waiting, re-suspend if (jt->handle_special_suspend_equivalent_condition()) { jt->java_suspend_self(); } ! } void Parker::unpark() { int s, status ; status = pthread_mutex_lock(_mutex); --- 4737,4747 ---- assert_status(status == 0, status, "invariant") ; // If externally suspended while waiting, re-suspend if (jt->handle_special_suspend_equivalent_condition()) { jt->java_suspend_self(); } ! OrderAccess::fence(); } void Parker::unpark() { int s, status ; status = pthread_mutex_lock(_mutex);
01-12-2009
EVALUATION The bug is caused by missing memory barriers in various Parker::park() paths that can result in lost wakeups and hangs. (Note that PlatformEvent::park used by built-in synchronization is not vulnerable to the issue). -XX:+UseMembar constitues a work-around because the membar barrier in the state transition logic hides the problem in Parker::. (that is, there's nothing wrong with the use -UseMembar mechanism, but +UseMembar hides the bug Parker::). This is a day-one bug introduced with the addition of java.util.concurrent in JDK 5.0. I developed a simple C mode of the failure and it seems more likely to manifest on modern AMD and Nehalem platforms, likely because of deeper store buffers that take longer to drain. I provided a tentative fix to Doug Lea for Parker::park which appears to eliminate the bug. I'll be delivering this fix to runtime. (I'll also augment the CR with additional test cases and and a longer explanation). This is likely a good candidate for back-ports. Dave Dice
30-11-2009
WORK AROUND Ignore the other suggested workarounds. The true workaround here is to specify -XX:+UseMembar as indicated in the public comments. This has the side-effect of placing a memory barrier into the path on which it is currently missing.
26-11-2009
PUBLIC COMMENTS Note that although this only reproduced on x86 Solaris with the 64-bit JVM it is a generic bug and is not restricted to x86, Solaris or 64-bit.
26-11-2009
EVALUATION See the public comments - the bug, while a day-one bug in the JUC infrastructure, is much more likely to manifest with modern Intel and AMD processors.
25-11-2009
PUBLIC COMMENTS The bug is caused by missing memory barriers in various Parker::park() paths that can result in lost wakeups and hangs. (Note that PlatformEvent::park used by built-in synchronization is not vulnerable to the issue). -XX:+UseMembar constitues a work-around because the membar barrier in the state transition logic hides the problem in Parker::. (that is, there's nothing wrong with the use -UseMembar mechanism, but +UseMembar hides the bug Parker::). This is a day-one bug. I developed a simple C mode of the failure and it seems more likely to manifest on modern AMD and Nehalem platforms, likely because of deeper store buffers that take longer to drain. I provided a tentative fix to Doug Lea for Parker::park which appears to eliminate the bug. I'll be delivering this fix to runtime. (I'll also augment the CR with additional test cases and and a longer explanation). This is likely a good candidate for back-ports. Dave Dice
25-11-2009
WORK AROUND The workaround can be applied by grabbing ReentrantReadWriteLock.java from OpenJDK 6 and modifying the WriteLock.lock() method as follows: public void lock() { //sync.acquire(1); // To avoid a hang that seems to be caused by a lost-wakeup // we repeatedly use tryAcquire in a loop so that we can // poll the lock state long timeout = 1; // 1 second boolean locked = false; boolean interrupted = false; while(!locked) { try { locked = sync.tryAcquireNanos(1, TimeUnit.SECONDS.toNanos(timeout)); } catch (InterruptedException ex) { interrupted = true; } } if (interrupted) { // re-assert interrupt state that occurred while we // were acquiring the lock Thread.currentThread().interrupt(); } } The presumed problem is a lost wakeup, so if that happens the thread will timeout of the tryLock and then call tryLock again, at which point it will see that the lock is actually available and acquire it. I'm assuming that a 1 second timeout is long enough to not impact performance (assuming writes are normally shorter than 1 second, as are read sequences); while short enough to not cause undue delay if the problem occurs. Of course you can experiment with different values and even set it dynamically via a property. If the hang occurs on the ReadLock.lock() then a similar workaround can be applied.
21-04-2009
PUBLIC COMMENTS Interestingly the hang does not seem to occur with -XX:+UseMembars, but I suspect that is more due to changes in the timing than being a root cause.
21-04-2009
PUBLIC COMMENTS The hang reproduces with -client and -server but not -Xint. The hang does not seem to reproduce with only 1 or 2 processors, but does with higher numbers of processors. However some combinations of processors do not reproduce the hang. This all suggests a race-condition in the generated code.
14-04-2009
WORK AROUND One possible workaround for the customer is to replace the lock() calls with tryLock(timeout) in a loop - with a suitable timeout value. This will give the effect of not returning until the lock is available, but should avoid waiting indefinitely due to this bug.
14-04-2009
PUBLIC COMMENTS From investigations (by others) it seems that this hang only occurs: - on the x4450 - running Solaris - with 64-bit JVM only which makes it less likely that this is a flaw in the Java-level algorithms. Can we re-run the test case executing the VM in different size processor sets eg: - 1, 2, 4 to see if the problem reproduces under those conditions.
02-04-2009
PUBLIC COMMENTS Testing with a fix for 6801020 has not resolved this problem, so it looks like it is a seperate issue.
28-03-2009
PUBLIC COMMENTS It's possible that this is the same AQS problem that underlies 6801020
26-03-2009