Bug ID: JDK-6807483 java.util.concurrent.locks.Condition.await(timeout, units) hangs forever

Type: Bug
Component: core-libs
Sub-Component: java.util.concurrent
Affected Version: 6u10

Priority: P4
Status: Closed
Resolution: Cannot Reproduce
OS: solaris_10
CPU: x86

Submitted: 2009-02-19
Updated: 2011-02-16
Resolved: 2009-06-09

FULL PRODUCT VERSION :
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) Server VM (build 11.2-b01, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
SunOS x2001 5.10 Generic_127128-11 i86pc i386 i86pc

EXTRA RELEVANT SYSTEM CONFIGURATION :
Solaris 10, AMD Opteron 2356 (2 CPU quad-core 2312 MHz)

A DESCRIPTION OF THE PROBLEM :
during an execution of
java.util.concurrent.locks.Condition.await(timeout,
TimeUnit.MILLISECONDS) a thread hangs on Solaris/AMD x64 instead of being resumed after timeout is gone.

The environment is:
Java 1.6.0_12
Solaris 10
AMD Opteron 2356 2 CPU x Quad-Core 2312 MHz

The bug is NOT reproduced on the platforms:
Solaris / UltraSPARC T1
Solaris / UltraSPARC T2+
Solaris / UltraSPARC IV+
Linux / Xeon


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the attached test application on Solaris/AMD x64 platform.
In 2-20 minutes the bug should be reproduced with the message "JVM Bug found in 8 threads !!!!" in the log

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
after inocation of java.util.concurrent.locks.Condition.await(timeout,
TimeUnit.MILLISECONDS) a thread should resume
ACTUAL -
a thread hangs forever on java.util.concurrent.locks.Condition.await(timeout,
TimeUnit.MILLISECONDS)

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
// one thread:

                        lock.lock();
                        try {
                            try {
                                while (queueSize == 0) {
                                    if (condition.await(AWAIT, TimeUnit.MILLISECONDS)) {
                                        conditionCount++;
                                    }
                                }
                                awaitCount++;
                            } catch (InterruptedException e) {
                                e.printStackTrace();
                            }
                        } finally {
                            lock.unlock();
                        }

// another thread:

            lock.lock();
            try {
                bufferPos = (int) (eventsCount % LONG_DATA_COUNT);
                if (queueSize == 0 && Math.random() > SIGNAL_PROBABILITY) {
                    condition.signal();
                }
                eventsCount++;
                queueSize++;
            } finally {
                lock.unlock();
            }

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
there is no workaround

EVALUATION The originator has clarified that the initial version was S10u6 so while 6600939 seemed a likely cause, that would appear not to be the case.
10-06-2009
EVALUATION The originator reports that the problem disappeared after updating to Solaris 10 update 7. The previously used Solaris version was S10u5. It is possible that this was caused by 6600939, which was fixed in S10u6.
09-06-2009
PUBLIC COMMENTS With regard to the "incorrect synchronization", if there is only a single writer then no inconsistent value can be seen by the reader.
09-06-2009
PUBLIC COMMENTS The test program seems to be making some assumptions. The basic premise seems to be that within a REPORT_PERIOD (20seconds) every EventRouter thread (there are 8) will be able to "tick" which requires either that an event is enqueued or the timeout elapses. The timeout is 100ms. So under good conditions you expect all event routers to have ticked within 800ms, and you'd expect at worst around 24-25 ticks per reporting period. But that seems to overlook the effects of GC activity and even compilation activity; and the fact that scheduling need not be at all fair. Information on GC pauses would be useful to rule out GC interference. Note also that the test is incorrectly synchronized. A number of long values are read and updated without using a lock to protect them. This can lead to inconsistent values being read once the value is beyond the capacity of a 32-bit unsigned value.
18-05-2009
PUBLIC COMMENTS Can we get a pstack, or "jstack -m", dump of the JVM process when the hang occurs please. Also the version info shows the 32-bit VM, but the start script posted on the forum shows the 64-bit VM being invoked. PLease clarify if the problem exists only on 64-bit.
15-05-2009
PUBLIC COMMENTS I've been unable to reproduce this locally so far. I only have easy access to two machines, one of which runs fine and the other doesn't have enough memory to run the test in the given configuration. If this failure is so platform specific it may be an OS and/or hardware issue rather than a j.u.c or JVM issue.
23-02-2009
PUBLIC COMMENTS Can the submitter modify the program to use Object.wait() rather than Thread.sleep ? I'd like to make sure the problem comes from the awaits() not returning rather thena the sleeps returning early and mis-reporting failures. Thanks, David Holmes
20-02-2009