United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6822370 ReentrantReadWriteLock: threads hung when there are no threads holding onto the lock (Netra x4450)
JDK-6822370 : ReentrantReadWriteLock: threads hung when there are no threads holding onto the lock (Netra x4450)

Details
Type:
Bug
Submit Date:
2009-03-26
Status:
Closed
Updated Date:
2013-01-11
Project Name:
JDK
Resolved Date:
2011-03-08
Component:
hotspot
OS:
solaris_10
Sub-Component:
runtime
CPU:
x86
Priority:
P2
Resolution:
Fixed
Affected Versions:
hs14,6u12,6u14
Fixed Versions:
hs17 (b06)

Related Reports
Backport:
Backport:
Backport:
Backport:
Duplicate:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
Only happen on Hardware: x4450
OS = Solaris 10 (does not happen with Linux)
JDK = JDK 6.0 upd 10 and 12 

Problem Description

Java threads are blocked waiting for a lock that is not held by
any threads ie. a lock that has already been released. thread is
seen waiting on this synchronizer - 
- parking to wait for  <0xfffffd72f0e32118> (a java.util.concurrent.lock
s.ReentrantReadWriteLock$NonfairSync)

Troubleshooting results are available to show that
the lock is not being held by any thread. See details in comments.

This hung issue is only happening on Netra x4450 but not
any other platforms such as x2200s, x4440s, x4600s, and Niagara-based systems like a 5440.

A testcase and more details are available in comments.
A similar problem was reported via the concurrency-interest mailing list and then on the core-libs dev list. 

http://mail.openjdk.java.net/pipermail/core-libs-dev/2009-July/002040.html

The problem is not restricted to x4450 but shows up on 4-8way Intel systems. Same basic scenario - the application hangs with a bunch of threads all inside a LinkedBlockingQueue trying to acquire the internal ReentrantLock that nobody seems to own. Again -XX:+UseMembar avoids the problem.

I used the attached test program to reproduce the problem on an 8-way Intel machine that I have access to. However when I tried to  probe deeper by using modified classes that allowed me to examine the internal lock state when the hang occurs, it ceased to occur.

Two different hangs are possible with the test program:
a) use of the application LinkedBlockingDeque
b) use of the ThreadPoolExecutors LinkedBlockingQueue

Thanks to Ariel Weisburg and Ryan Betts for reporting the problem and working to get a small reproducible test case.

                                    

Comments
PUBLIC COMMENTS

It's possible that this is the same AQS problem that underlies 6801020
                                     
2009-03-26
PUBLIC COMMENTS

Testing with a fix for 6801020 has not resolved this problem, so it looks like it is a seperate issue.
                                     
2009-03-28
PUBLIC COMMENTS

From investigations (by others) it seems that this hang only occurs:

- on the x4450
- running Solaris
- with 64-bit JVM only

which makes it less likely that this is a flaw in the Java-level algorithms.

Can we re-run the test case executing the VM in different size processor sets eg:
- 1, 2, 4 to see if the problem reproduces under those conditions.
                                     
2009-04-02
PUBLIC COMMENTS

The hang reproduces with -client and -server but not -Xint.

The hang does not seem to reproduce with only 1 or 2 processors, but does with higher numbers of processors. However some combinations of processors do not reproduce the hang.

This all suggests a race-condition in the generated code.
                                     
2009-04-14
WORK AROUND

One possible workaround for the customer is to replace the lock() calls with tryLock(timeout) in a loop - with a suitable timeout value. This will give the effect of not returning until the lock is available, but should avoid waiting indefinitely due to this bug.
                                     
2009-04-14
PUBLIC COMMENTS

Interestingly the hang does not seem to occur with -XX:+UseMembars, but I suspect that is more due to changes in the timing than being a root cause.
                                     
2009-04-21
WORK AROUND

The workaround can be applied by grabbing ReentrantReadWriteLock.java from OpenJDK 6 and modifying the WriteLock.lock() method as follows:

        public void lock() {
	    //sync.acquire(1);

	    // To avoid a hang that seems to be caused by a lost-wakeup 
	    // we repeatedly use tryAcquire in a loop so that we can
	    // poll the lock state

	    long timeout = 1; // 1 second
	    boolean locked = false;
	    boolean interrupted = false;

	    while(!locked) {
		try {
		    locked = sync.tryAcquireNanos(1, TimeUnit.SECONDS.toNanos(timeout));
		}
		catch (InterruptedException ex) {
		    interrupted = true;
		}
	    }
	    
	    if (interrupted) {
		// re-assert interrupt state that occurred while we
		// were acquiring the lock
		Thread.currentThread().interrupt();
	    }
	}

The presumed problem is a lost wakeup, so if that happens the thread will timeout of the tryLock and then call tryLock again, at which point it will see that the lock is actually available and acquire it.

I'm assuming that a 1 second timeout is long enough to not impact performance (assuming writes are normally shorter than 1 second, as are read sequences); while short enough to not cause undue delay if the problem occurs. Of course you can experiment with different values and even set it dynamically via a property.

If the hang occurs on the ReadLock.lock() then a similar workaround can be applied.
                                     
2009-04-21
PUBLIC COMMENTS

The bug is caused by missing memory barriers in various Parker::park() paths that can result in lost wakeups and hangs.  (Note that PlatformEvent::park used by built-in synchronization is not vulnerable to the issue).  -XX:+UseMembar constitues a work-around because the membar barrier in the state transition logic hides the problem in Parker::.  (that is, there's nothing wrong with the use -UseMembar mechanism, but +UseMembar hides the bug Parker::).   This is a day-one bug.  I developed a simple C mode of the failure and it seems more likely to manifest on modern AMD and Nehalem platforms, likely because of deeper store buffers that take longer to drain.  I provided a tentative fix to Doug Lea for Parker::park which appears to eliminate the bug.  I'll be delivering this fix to runtime.  (I'll also augment the CR with additional test cases and and a longer explanation).  This is likely a good candidate for back-ports.  

Dave Dice
                                     
2009-11-25
EVALUATION

See the public comments - the bug, while a day-one bug in the JUC infrastructure, is much more likely to manifest with modern Intel and AMD processors.
                                     
2009-11-25
PUBLIC COMMENTS

Note that although this only reproduced on x86 Solaris with the 64-bit JVM it is a generic bug and is not restricted to x86, Solaris or 64-bit.
                                     
2009-11-26
WORK AROUND

Ignore the other suggested workarounds. The true workaround here is to specify -XX:+UseMembar as indicated in the public comments. This has the side-effect of placing a memory barrier into the path on which it is currently missing.
                                     
2009-11-26
EVALUATION

The bug is caused by missing memory barriers in various Parker::park() paths that can result in lost wakeups and hangs.  (Note that PlatformEvent::park used by built-in synchronization is not vulnerable to the issue).  -XX:+UseMembar constitues a work-around because the membar barrier in the state transition logic hides the problem in Parker::.  (that is, there's nothing wrong with the use -UseMembar mechanism, but +UseMembar hides the bug Parker::).   This is a day-one bug introduced with the addition of java.util.concurrent in JDK 5.0.  I developed a simple C mode of the failure and it seems more likely to manifest on modern AMD and Nehalem platforms, likely because of deeper store buffers that take longer to drain.  I provided a tentative fix to Doug Lea for Parker::park which appears to eliminate the bug.  I'll be delivering this fix to runtime.  (I'll also augment the CR with additional test cases and and a longer explanation).  This is likely a good candidate for back-ports.  

Dave Dice
                                     
2009-11-30
SUGGESTED FIX

*** 4654,4663 ****
--- 4654,4664 ----
  void Parker::park(bool isAbsolute, jlong time) {
    // Optional fast-path check:
    // Return immediately if a permit is available.
    if (_counter > 0) {
        _counter = 0 ;
+       OrderAccess::fence();
        return ;
    }
  
    Thread* thread = Thread::current();
    assert(thread->is_Java_thread(), "Must be JavaThread");
*** 4696,4705 ****
--- 4697,4707 ----
    int status ;
    if (_counter > 0)  { // no wait needed
      _counter = 0;
      status = pthread_mutex_unlock(_mutex);
      assert (status == 0, "invariant") ;
+     OrderAccess::fence();
      return;
    }
  
  #ifdef ASSERT
    // Don't catch signals while blocked; let the running threads have the signals.
*** 4735,4745 ****
    assert_status(status == 0, status, "invariant") ;
    // If externally suspended while waiting, re-suspend
    if (jt->handle_special_suspend_equivalent_condition()) {
      jt->java_suspend_self();
    }
! 
  }
  
  void Parker::unpark() {
    int s, status ;
    status = pthread_mutex_lock(_mutex);
--- 4737,4747 ----
    assert_status(status == 0, status, "invariant") ;
    // If externally suspended while waiting, re-suspend
    if (jt->handle_special_suspend_equivalent_condition()) {
      jt->java_suspend_self();
    }
!   OrderAccess::fence();
  }
  
  void Parker::unpark() {
    int s, status ;
    status = pthread_mutex_lock(_mutex);
                                     
2009-12-01
EVALUATION

< wrong sub-CR >
                                     
2009-12-02
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-rt/hotspot/rev/95e9083cf4a7
                                     
2009-12-03
EVALUATION

--
                                     
2009-12-05
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot/hotspot/rev/95e9083cf4a7
                                     
2009-12-23



Hardware and Software, Engineered to Work Together