Looking back at my message, I realize that I was very unclear. This problem happens with the latest versions of Java 6. I was testing with Java 5 just to see if it was something new, and it appears to have always been there, but you do not have to use an old version of Java.
Otherwise the only key thing is that you run it in a Nehalem chip -- I am using a ``FIRE X4170 SERVER'' with ``a "GenuineIntel Xeon(r) CPU X5570 @" CPU 2.9 GHz processor.'' Running two copies at once helps, and letting it go for a day or so is sometimes necessary. I do not think that I was running with 64 bits, although I am not certain. I am not specifying any other command line parameters.
From: Aingworth, Donald:
Sent: Friday, October 23, 2009 1:03 PM
Subject: Synchronization and Nehalem [was: RE: today's meeting]
I have some more details on the problem below, as well as a test case that we can give you.
1) I have not had any luck reproducing it on Windows (Nehalem/Intel X550), but I have been able to reproduce it on Solaris, OpenSolaris, and Linux.
[Edited to add: Submitter later clarified that it only happened on Solaris not Linux.]
2) I have not had any luck reproducing the bug on older hardware, but it does happen with some regularity on Nehalem. (I _may_ have gotten it to happen once on Harpertown, and I have had no luck on Opteron.)
3) The problem does not seem to be in array blocking queues proper, but something lower down (AbstractQueuedSynchronizer, ReentrantLock$NonfairSync, or lower).
4) If you suspend and then resume the hung process, it is quite likely to get back into a good state.
5) I have reproduced this problem in Java 5, and with Java 5 compiled code run with Java 6. I have not tried earlier versions of Java.
The sample code (attached) creates a varying number of readers and writers, and passes some trivial information between them via the queue. If it detects that there is no progress for too long, it starts to warn, and this seems to have the same underlying problem as I cited below. It is not a particularly frequent problem -- the test code can run for over a day before it hangs -- but it is consistently reproducible. I have had the best luck making this happen when doing the following things:
A) Run with no more threads than you have cores available (the attached code does so).
B) Run with hyperthreading turned off.
C) Run two instances of the application at the same time.
Doing that, I eventually get something like the following:
(rnd= 104, #rd= 1, #w= 1) ArrayBlockingQueue : [417, 498, 262, 341, 327, 261, 314, 399] ms
(rnd= 104, #rd= 1, #w= 2) ArrayBlockingQueue : [1050, 983, 834, 1127, 623, 655, 875, 1729] ms
(rnd= 104, #rd= 1, #w= 3) ArrayBlockingQueue : [5105, 6719, 4328, 10299, 9871, 1939, 4759, 1469] ms
(rnd= 104, #rd= 2, #w= 1) ArrayBlockingQueue : [587, 225, 289, 666, 1440, 1184, 840, 2103] ms
(rnd= 104, #rd= 2, #w= 2) ArrayBlockingQueue : [4084, 4179, 3490, 4489, 3610, 5149, 1802, 3189] ms
(rnd= 104, #rd= 3, #w= 1) ArrayBlockingQueue : [1670, 2626, 2424, 332, 1365, 2552, 3261, 2619] ms
(rnd= 105, #rd= 1, #w= 1) ArrayBlockingQueue : [484, 466, 427, 392, 422, 379, 479, 383] ms
Reader0 has had 1000000000 consecutive failed polls. Queue type is java.util.concurrent.ArrayBlockingQueue.Queue size is 0.
Reader0 has had 2000000000 consecutive failed polls. Queue type is java.util.concurrent.ArrayBlockingQueue.Queue size is 0.
Reader0 has had 3000000000 consecutive failed polls. Queue type is java.util.concurrent.ArrayBlockingQueue.Queue size is 0.
The environment is:
> java -version
java version "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) Server VM (build 14.0-b16, mixed mode)
> uname -a
SunOS sxperf00.nyc.deshaw.com 5.10 Generic_141415-03 i86pc i386 i86pc Solaris
> cat /etc/release
Solaris 10 5/09 s10x_u7wos_08 X86
Copyright 2009 Sun Microsystems, Inc. All Rights Reserved.
Use is subject to license terms.
Assembled 30 March 2009
Attached are a few jstacks from the hung process. From the output and the jstacks you can infer that the writer threads are blocked even though the queue is empty.
Are you able to help here or refer our question along to the right place, please? To get around this problem, we have had to move to our own implementations of blocking queues.