Hang when using LockSupport.park() on heavily threaded systems
See Comments for more details.
Uploaded reproducer.zip. Compile and run with a command line, as specified in any of the out-thr64-* files.
Obviously, the code can be reduced to a smaller set of lines, according to the constants used in the test. -DNO_OTHERQ_TEST=true disables a comparative test with the other queues.
You can see the output, mostly from Turnstile.awaitOn method, saved to out-thr64 files. You can see this output is generated only after the wait lasted for longer than specified on the command line - the tests in this case wait for a signal for 30 seconds (-DTIMED_WAIT=30). You can see each thread that timed out prints the state of the waiter list as it was at the time of printing. Each thread starts from the head of the list as it existed at the time the thread got parked.
In out-thr64-nomembar, scroll to the lines:
waiting for: 908
nextTicket: 909
last ticket seen: 886
checked ticket: 1 times
[1, notified by: 0]
[907, notified by: 908]
*[908, notified by: 908]
head->[909, notified by: 909]
[910, blocked]
...
The asterisc before [908, ...] means this output is generated by the thread that waited for ticket 908 (waiting for: 908). At the time of wake up it observes the nextTicket is 909 - the condition that wouldn't make it blocked at the time it entered awaitOn. Last ticket seen is 886, which justifies the thread waiting for 908 to park. Checked ticket: 1 times indicates that the thread never woke up at all - it went through the code path leading to reading the ticket only once, just before parking.
The series of lines with square brackets demonstrate the state of the waiter list, one node at a time. The first node is a sentinel node created at the time PartialOrderSet was constructed. At the time of printing the waiter list state the head of the list points beyond the waiter list node [908, ...] corresponding to this thread, and "notified by:" indicates that the node was visited by another thread that was meant to unblock and wake up every waiter with a ticket value of 908 or less (see Turnstile.release setting the value of h.notifier before calling LockSupport.unpark).
The lines like [910, blocked] mean that the waiters for those waiter list nodes were never visited, and the threads remain parked. This is normal, when the other threads fail to wake up in this test.
The problematic condition is the thread with ticket 908 being notified, but failing to wake up until the whole 30 seconds expire.
The same was tried with a modified version of HotSpot, provided by Dave Dice. The output for that one can be found in out-thr64-nomembar-newjdk7. You can see lines:
waiting for: 1127
nextTicket: 1127
last ticket seen: 1106
checked ticket: 1 times
[-1, notified by: 0]
head->*[1127, notified by: 1127]
which indicate exactly the same symptoms: a node was notified by the correct peer ticket holder, but fails to wake up until the entire 30 seconds expire. (head pointing to this node is a correct condition - head always points to the last waiter list node that was notified). You can see that the Last ticket seen and Checked ticket indicate exactly the same problem - the thread parked legitimately, and didn't wake up between the first call to park and the 30 second timeout expiring.