Bug ID: JDK-8029453 java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java failed by timeout

JDK 8	JDK 9
8u72Fixed	9 b81Fixed

noreg-other: existing tests are sufficient to exercise this change ie java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java
18-08-2015
Note: this is not a P2 in 7u or 8u as the workaround is enabled by default.
11-08-2015
The bug was introduced with the fix for JDK-6900441, but unless someone disabled the workaround (as done in JDK-8130728) they would not see it. Nevertheless this should be backported to all releases that contain 6900441 - 9, 8u and 7u.
11-08-2015
In the PlatformParker code there is a bug in the unlock-before-signal path where we read the value of _cur_index after releasing the lock, which allows the waiting thread to reset it to -1. We then call pthread_cond_signal(_cond[-1]) and so treat "random" memory as a condition variable. This actually happens "all the time" but we only rarely see that piece of memory contain values that lead to this hang.
11-08-2015
The observed failure mode is consistent with using a condition variable that has been destroyed, but we do not ever destroy them (with the workaround disabled). The ParkEvent/PlatformParker objects that encapsulate the condition and mutex are "immortal".
11-08-2015
I was able to gcore a hung process and run jstack -m against the core file. Unfortunately this showed nothing interesting as most of the app threads simply report: ----------------- 22608 ----------------- 0x00007fa0ee7a1d84 __pthread_cond_wait + 0xc4 0x00007fa0eda096bd Unsafe_Park + 0x11d 0x00007fa0dd0d3fba <Unknown compiled code> Only interesting thread stack: ----------------- 22576 ----------------- 0x00007fa0ee7a489c __lll_lock_wait + 0x1c 0x00007fa0dd0d37cc <Unknown compiled code> 0x00007fa0d5c9e944 * java.util.concurrent.locks.LockSupport.unpark(java.lang.Thread) bci:8 line:141 (Compiled frame) * java.util.concurrent.locks.AbstractQueuedSynchronizer.unparkSuccessor(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) bci:80 line:662 (Compiled frame) * java.util.concurrent.locks.AbstractQueuedSynchronizer.cancelAcquire(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) bci:145 line:779 (Interpreted frame) gdb also shows: (gdb) where #0 0x00007fa0ee7a489c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fa0ee7a25bc in pthread_cond_signal@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fa0eda067d9 in Unsafe_Unpark () at /export/users/dh198349/jdk9-hs-rt/hotspot/src/share/vm/prims/unsafe.cpp:1170 So it appears the pthread_cond_signal, which happens after the pthread_mutex_unlock is hanging trying to acquire the cond vars internal lock. There is no other thread inside the futex implementation.
11-08-2015
Currently testing the two different parts to the workaround separately. This uses a modified VM that enables the workaround in two parts. Setting: WorkAroundNPTLTimedWaitHang=0 Action: a) skip condition variable destroy and reinitialization b) perform signal after unlock Result: Hang on iteration 354, Setting: WorkAroundNPTLTimedWaitHang=1 Action: a) skip condition variable destroy and reinitialization b) perform signal before unlock Result: 5000 iterations - no hang Setting: WorkAroundNPTLTimedWaitHang=2 Action: a) perform condition variable destroy and reinitialization b) perform signal before unlock Result: 5000 iterations - no hang
10-08-2015
Problem is seen with: GNU C Library stable release version 2.12, by Roland McGrath et al. Copyright (C) 2010 Free Software Foundation, Inc. and also GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10) stable release version 2.15, by Roland McGrath et al. Copyright (C) 2012 Free Software Foundation, Inc.
10-08-2015
Confirmed that "kill -STOP" followed by "kill -CONT" unblocks the hang. This is reminiscent of a futex bug we saw on the real-time linux kernel many years ago.
10-08-2015
Note that the current b75 related failures are completely unrelated to the original failures for which this bug was opened. Not sure it was a good idea to re-open such an old bug.
10-08-2015
Without JDK-8130728, no hung happen. Tested at two Linux machines, run TimeoutLockLoops.java directly. hostA: * b75, hung reproducible within 100 runs * b75, run with special vm options "-XX:+UnlockExperimentalVMOptions -XX:WorkAroundNPTLTimedWaitHang=1", 10,000 runs completed, all pass, no hung hostB: * With latest dev, hung reproducible within 100 runs * In the same repo, backout JDK-8130728, 10,000 runs completed, all pass, no hung
07-08-2015
Amy: can you re-run your tests adding " -XX:+UnlockExperimentalVMOptions -XX:WorkAroundNPTLTimedWaitHang=1" and see if you can reproduce any failures, please. There were two parts to the workaround and I suspect it may be the second part (signal after unlock) that might either lead to bug in the VM or expose a bug in linux libraries.
07-08-2015
As much as it pains me, this could be related to: JDK-8130728: Disable WorkAroundNPTLTimedWaitHang by default The problem I saw on b74 was with the work around disabled, so it looked like a b75 build would.
07-08-2015
First 1000 runs ok. Somewhere in the next 1000 thousand: Threads: 1 0.200908044s run time Threads: 2 0.200394306s run time Threads: 3 0.200419979s run time Threads: 5 0.200544475s run time Threads: 8 0.200756379s run time Threads: 12 0.200976226s run time Threads: 18 0.201126043s run time Threads: 27 0.201516376s run time Threads: 41 Top shows the process is inactive but massive heap usage! 28562 daholme 20 0 39.4g 188m 10m S 0 0.1 0:02.29 java pstack tool fails to works: 28562: /scratch/dh198349/jdk9-hs-rt/build/b00/se-linux-x64-internal/images/jdk/bin/java -cp JTwork/lasses/:JTwork/classes/java/util... (No symbols found in ) (No symbols found in /lib/x86_64-linux-gnu/libdl.so.2) (No symbols found in /lib/x86_64-linux-gnu/libc.so.6) (No symbols found in /lib64/ld-linux-x86-64.so.2) (No symbols found in /lib/x86_64-linux-gnu/libm.so.6) (No symbols found in /lib/x86_64-linux-gnu/librt.so.1) (No symbols found in /lib/x86_64-linux-gnu/libnss_compat.so.2) (No symbols found in /lib/x86_64-linux-gnu/libnsl.so.1) (No symbols found in /lib/x86_64-linux-gnu/libnss_nis.so.2) (No symbols found in /lib/x86_64-linux-gnu/libnss_files.so.2) 0x7ffd34239148: _fini + 0x7ffcffff7d98 (0, 1000, 0, 100000, 0, 0) + fffffffdc957bcfc 0x00000000: ???? (850000004c25048b, 252c8944645174c0, 24548b480000004c, 1c74d28548c03110, 38b480374db8548, 452b60f08420348) + 1824848e307dd194 crawl: Input/output error Error tracing through process 28562 0x646475c085000000: _fini + 0x646475c050dbec50 gdb showed all thread stacks - but nothing helpful. As you might guess all threads are "waiting" but we can't see what for exactly. Once gdb detached the test continued: Threads: 41 2106.267683571s run time
07-08-2015
I tried such an experiment locally yesterday and I had one case, on b74, where the test hung for a long period of time. During that time I could also not get a stack dump (ctrl-/) but eventually I got the dump and saw the test complete: Threads: 12^\ 172.847054257s run time Threads: 18 2015-08-06 03:32:40 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.9.0-ea-b74 mixed mode): ... However I was running over NFS so that may have contributed to the problem. I will try with a fresh local build.
07-08-2015
The Socket read timeout from the agent seems unrelated to the test. What code is failing? Looks more like an infrastructure issue.
06-08-2015
Also failed recently on linux-i586 test run: http://java.se.oracle.com/mach5/view/All/job/9-dev-tier1-linux-i586-failuresHistory/Failure_History/
29-07-2015
This test, java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java, has been failing intermittently only with linux-x64 builds on mach5 tier1 testing since this set of changes: http://java.se.oracle.com/mach5/job/9-dev-prebuild/604/ Failure history is here: http://java.se.oracle.com/mach5/view/All/job/9-dev-tier1-linux-x64-failuresHistory/Failure_History/ Note that test run #2321 corresponds to prebuild #604(noted 2 lines above). The .jfr has no more useful info than, "execStatus=Error. Agent communication error\: java.net.SocketTimeoutException\: Read timed out; check console log for any additional details"
22-07-2015

Relates :	JDK-8133588 - Place TimeoutLockLoops.java on the problem list
Relates :	JDK-6900441 - PlatformEvent.park(millis) on Linux could still be affected by changes to the time-of-day clock
Relates :	JDK-8133611 - Remove java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java from problem list
Relates :	JDK-8130728 - Disable WorkAroundNPTLTimedWaitHang by default
Relates :	JDK-8133231 - Mark TimeoutLockLoops.java as failing intermittently