Bug ID: JDK-6903249 array blocking queues hangs on new Intel processor (Nehalem)

Type: Bug
Component: core-libs
Sub-Component: java.util.concurrent
Affected Version: hs14

Priority: P3
Status: Closed
Resolution: Duplicate
OS: solaris_10
CPU: x86

Submitted: 2009-11-20
Updated: 2010-05-11
Resolved: 2009-11-25

Looking back at my message, I realize that I was very unclear.  This problem happens with the latest versions of Java 6.  I was testing with Java 5 just to see if it was something new, and it appears to have always been there, but you do not have to use an old version of Java.

Otherwise the only key thing is that you run it in a Nehalem chip -- I am using a ``FIRE X4170 SERVER'' with ``a "GenuineIntel Xeon(r) CPU           X5570  @" CPU 2.9 GHz processor.''  Running two copies at once helps, and letting it go for a day or so is sometimes necessary.  I do not think that I was running with 64 bits, although I am not certain.  I am not specifying any other command line parameters.

Thanks,
don
From: Aingworth, Donald:

Sent: Friday, October 23, 2009 1:03 PM

Subject: Synchronization and Nehalem [was: RE: today's meeting]

I have some more details on the problem below, as well as a test case that we can give you.
1)  I have not had any luck reproducing it on Windows (Nehalem/Intel X55[67]0), but I have been able to reproduce it on Solaris, OpenSolaris, and Linux.

[Edited to add: Submitter later clarified that it only happened on Solaris not Linux.]

2)  I have not had any luck reproducing the bug on older hardware, but it does happen with some regularity on Nehalem.  (I _may_ have gotten it to happen once on Harpertown, and I have had no luck on Opteron.)
3)  The problem does not seem to be in array blocking queues proper, but something lower down (AbstractQueuedSynchronizer, ReentrantLock$NonfairSync, or lower).
4)  If you suspend and then resume the hung process, it is quite likely to get back into a good state.
5)  I have reproduced this problem in Java 5, and with Java 5 compiled code run with Java 6.  I have not tried earlier versions of Java.

The sample code (attached) creates a varying number of readers and writers, and passes some trivial information between them via the queue.  If it detects that there is no progress for too long, it starts to warn, and this seems to have the same underlying problem as I cited below.  It is not a particularly frequent problem -- the test code can run for over a day before it hangs -- but it is consistently reproducible.  I have had the best luck making this happen when doing the following things:
A)  Run with no more threads than you have cores available (the attached code does so).
B)  Run with hyperthreading turned off.
C)  Run two instances of the application at the same time.

Doing that, I eventually get something like the following:
    (rnd= 104, #rd= 1, #w= 1) ArrayBlockingQueue :           [417, 498, 262, 341, 327, 261, 314, 399] ms
    (rnd= 104, #rd= 1, #w= 2) ArrayBlockingQueue :           [1050, 983, 834, 1127, 623, 655, 875, 1729] ms
    (rnd= 104, #rd= 1, #w= 3) ArrayBlockingQueue :           [5105, 6719, 4328, 10299, 9871, 1939, 4759, 1469] ms
    (rnd= 104, #rd= 2, #w= 1) ArrayBlockingQueue :           [587, 225, 289, 666, 1440, 1184, 840, 2103] ms
    (rnd= 104, #rd= 2, #w= 2) ArrayBlockingQueue :           [4084, 4179, 3490, 4489, 3610, 5149, 1802, 3189] ms
    (rnd= 104, #rd= 3, #w= 1) ArrayBlockingQueue :           [1670, 2626, 2424, 332, 1365, 2552, 3261, 2619] ms
    (rnd= 105, #rd= 1, #w= 1) ArrayBlockingQueue :           [484, 466, 427, 392, 422, 379, 479, 383] ms
    Reader0 has had 1000000000 consecutive failed polls.  Queue type is java.util.concurrent.ArrayBlockingQueue.Queue size is 0.
    Reader0 has had 2000000000 consecutive failed polls.  Queue type is java.util.concurrent.ArrayBlockingQueue.Queue size is 0.
    Reader0 has had 3000000000 consecutive failed polls.  Queue type is java.util.concurrent.ArrayBlockingQueue.Queue size is 0.
    [...]
The environment is:
    > java -version
    java version "1.6.0_14"
    Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
    Java HotSpot(TM) Server VM (build 14.0-b16, mixed mode)

    > uname -a
    SunOS sxperf00.nyc.deshaw.com 5.10 Generic_141415-03 i86pc i386 i86pc Solaris

    > cat /etc/release 
                            Solaris 10 5/09 s10x_u7wos_08 X86
               Copyright 2009 Sun Microsystems, Inc.  All Rights Reserved.
                            Use is subject to license terms.
                                 Assembled 30 March 2009
Attached are a few jstacks from the hung process.  From the output and the jstacks you can infer that the writer threads are blocked even though the queue is empty.

Are you able to help here or refer our question along to the right place, please?  To get around this problem, we have had to move to our own implementations of blocking queues.  

Thanks,
don

PUBLIC COMMENTS The bug is caused by missing memory barriers in various Parker::park() paths that can result in lost wakeups and hangs. (Note that PlatformEvent::park used by built-in synchronization is not vulnerable to the issue). -XX:+UseMembar constitues a work-around because the membar barrier in the state transition logic hides the problem in Parker::. (that is, there's nothing wrong with the use -UseMembar mechanism, but +UseMembar hides the bug Parker::). This is a day-one bug. I developed a simple C mode of the failure and it seems more likely to manifest on modern AMD and Nehalem platforms, likely because of deeper store buffers that take longer to drain. I provided a tentative fix to Doug Lea for Parker::park which appears to eliminate the bug. I'll be delivering this fix to runtime. (I'll also augment the CR with additional test cases and and a longer explanation). This is likely a good candidate for back-ports. Dave Dice

25-11-2009

PUBLIC COMMENTS This appears to be a duplicate of 6822370 but I need to clarify that the hangs only occur on 64-bit. If this is happening on 32-bit as well then that is a new failure mode for this problem. Also we have not previously seen the problem on Linux. Which versdion of Linux was the problem observed on?

20-11-2009