JDK-8029453 : java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java failed by timeout
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 7u91,8u66,9
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2013-12-03
  • Updated: 2017-08-17
  • Resolved: 2015-08-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
8u72Fixed 9 b81Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
TimeoutLockLoops.java failed on linux 32  jdk8 b118 by timeout

The test has never failed before. So, it's could be an incidental failure, not a bug.
The jtr is attached. Some quotes:

warning: [options] bootstrap class path not set in conjunction with -source 1.5
warning: [options] source value 1.5 is obsolete and will be removed in a future release
warning: [options] To suppress warnings about obsolete options, use -Xlint:-options.
3 warnings
result: Passed. Compilation successful

command: build TimeoutLockLoops
reason: Named class compiled on demand
elapsed time (seconds): 0.0
result: Passed. All files up to date

command: main TimeoutLockLoops
reason: User specified action: run main TimeoutLockLoops 
elapsed time (seconds): 960.008
Threads: 1	 0.251106581s run time
Threads: 2	 0.202150581s run time
Threads: 3	 0.206090516s run time
Threads: 5	 0.210550358s run time
Threads: 8	 1.457507195s run time
Threads: 12	 153.525323368s run time
Threads: 18	 0.233047631s run time
Threads: 27	 0.213040926s run time
Threads: 41	 695.364596734s run time

noreg-other: existing tests are sufficient to exercise this change ie java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java

Note: this is not a P2 in 7u or 8u as the workaround is enabled by default.

The bug was introduced with the fix for JDK-6900441, but unless someone disabled the workaround (as done in JDK-8130728) they would not see it. Nevertheless this should be backported to all releases that contain 6900441 - 9, 8u and 7u.

In the PlatformParker code there is a bug in the unlock-before-signal path where we read the value of _cur_index after releasing the lock, which allows the waiting thread to reset it to -1. We then call pthread_cond_signal(_cond[-1]) and so treat "random" memory as a condition variable. This actually happens "all the time" but we only rarely see that piece of memory contain values that lead to this hang.

The observed failure mode is consistent with using a condition variable that has been destroyed, but we do not ever destroy them (with the workaround disabled). The ParkEvent/PlatformParker objects that encapsulate the condition and mutex are "immortal".

I was able to gcore a hung process and run jstack -m against the core file. Unfortunately this showed nothing interesting as most of the app threads simply report: ----------------- 22608 ----------------- 0x00007fa0ee7a1d84 __pthread_cond_wait + 0xc4 0x00007fa0eda096bd Unsafe_Park + 0x11d 0x00007fa0dd0d3fba <Unknown compiled code> Only interesting thread stack: ----------------- 22576 ----------------- 0x00007fa0ee7a489c __lll_lock_wait + 0x1c 0x00007fa0dd0d37cc <Unknown compiled code> 0x00007fa0d5c9e944 * java.util.concurrent.locks.LockSupport.unpark(java.lang.Thread) bci:8 line:141 (Compiled frame) * java.util.concurrent.locks.AbstractQueuedSynchronizer.unparkSuccessor(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) bci:80 line:662 (Compiled frame) * java.util.concurrent.locks.AbstractQueuedSynchronizer.cancelAcquire(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) bci:145 line:779 (Interpreted frame) gdb also shows: (gdb) where #0 0x00007fa0ee7a489c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fa0ee7a25bc in pthread_cond_signal@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fa0eda067d9 in Unsafe_Unpark () at /export/users/dh198349/jdk9-hs-rt/hotspot/src/share/vm/prims/unsafe.cpp:1170 So it appears the pthread_cond_signal, which happens after the pthread_mutex_unlock is hanging trying to acquire the cond vars internal lock. There is no other thread inside the futex implementation.

Currently testing the two different parts to the workaround separately. This uses a modified VM that enables the workaround in two parts. Setting: WorkAroundNPTLTimedWaitHang=0 Action: a) skip condition variable destroy and reinitialization b) perform signal after unlock Result: Hang on iteration 354, Setting: WorkAroundNPTLTimedWaitHang=1 Action: a) skip condition variable destroy and reinitialization b) perform signal before unlock Result: 5000 iterations - no hang Setting: WorkAroundNPTLTimedWaitHang=2 Action: a) perform condition variable destroy and reinitialization b) perform signal before unlock Result: 5000 iterations - no hang

Problem is seen with: GNU C Library stable release version 2.12, by Roland McGrath et al. Copyright (C) 2010 Free Software Foundation, Inc. and also GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10) stable release version 2.15, by Roland McGrath et al. Copyright (C) 2012 Free Software Foundation, Inc.

Confirmed that "kill -STOP" followed by "kill -CONT" unblocks the hang. This is reminiscent of a futex bug we saw on the real-time linux kernel many years ago.

Note that the current b75 related failures are completely unrelated to the original failures for which this bug was opened. Not sure it was a good idea to re-open such an old bug.

Without JDK-8130728, no hung happen. Tested at two Linux machines, run TimeoutLockLoops.java directly. hostA: * b75, hung reproducible within 100 runs * b75, run with special vm options "-XX:+UnlockExperimentalVMOptions -XX:WorkAroundNPTLTimedWaitHang=1", 10,000 runs completed, all pass, no hung hostB: * With latest dev, hung reproducible within 100 runs * In the same repo, backout JDK-8130728, 10,000 runs completed, all pass, no hung

Amy: can you re-run your tests adding " -XX:+UnlockExperimentalVMOptions -XX:WorkAroundNPTLTimedWaitHang=1" and see if you can reproduce any failures, please. There were two parts to the workaround and I suspect it may be the second part (signal after unlock) that might either lead to bug in the VM or expose a bug in linux libraries.

As much as it pains me, this could be related to: JDK-8130728: Disable WorkAroundNPTLTimedWaitHang by default The problem I saw on b74 was with the work around disabled, so it looked like a b75 build would.

First 1000 runs ok. Somewhere in the next 1000 thousand: Threads: 1 0.200908044s run time Threads: 2 0.200394306s run time Threads: 3 0.200419979s run time Threads: 5 0.200544475s run time Threads: 8 0.200756379s run time Threads: 12 0.200976226s run time Threads: 18 0.201126043s run time Threads: 27 0.201516376s run time Threads: 41 Top shows the process is inactive but massive heap usage! 28562 daholme 20 0 39.4g 188m 10m S 0 0.1 0:02.29 java pstack tool fails to works: 28562: /scratch/dh198349/jdk9-hs-rt/build/b00/se-linux-x64-internal/images/jdk/bin/java -cp JTwork/lasses/:JTwork/classes/java/util... (No symbols found in ) (No symbols found in /lib/x86_64-linux-gnu/libdl.so.2) (No symbols found in /lib/x86_64-linux-gnu/libc.so.6) (No symbols found in /lib64/ld-linux-x86-64.so.2) (No symbols found in /lib/x86_64-linux-gnu/libm.so.6) (No symbols found in /lib/x86_64-linux-gnu/librt.so.1) (No symbols found in /lib/x86_64-linux-gnu/libnss_compat.so.2) (No symbols found in /lib/x86_64-linux-gnu/libnsl.so.1) (No symbols found in /lib/x86_64-linux-gnu/libnss_nis.so.2) (No symbols found in /lib/x86_64-linux-gnu/libnss_files.so.2) 0x7ffd34239148: _fini + 0x7ffcffff7d98 (0, 1000, 0, 100000, 0, 0) + fffffffdc957bcfc 0x00000000: ???? (850000004c25048b, 252c8944645174c0, 24548b480000004c, 1c74d28548c03110, 38b480374db8548, 452b60f08420348) + 1824848e307dd194 crawl: Input/output error Error tracing through process 28562 0x646475c085000000: _fini + 0x646475c050dbec50 gdb showed all thread stacks - but nothing helpful. As you might guess all threads are "waiting" but we can't see what for exactly. Once gdb detached the test continued: Threads: 41 2106.267683571s run time

I tried such an experiment locally yesterday and I had one case, on b74, where the test hung for a long period of time. During that time I could also not get a stack dump (ctrl-/) but eventually I got the dump and saw the test complete: Threads: 12^\ 172.847054257s run time Threads: 18 2015-08-06 03:32:40 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.9.0-ea-b74 mixed mode): ... However I was running over NFS so that may have contributed to the problem. I will try with a fresh local build.

The Socket read timeout from the agent seems unrelated to the test. What code is failing? Looks more like an infrastructure issue.

Also failed recently on linux-i586 test run: http://java.se.oracle.com/mach5/view/All/job/9-dev-tier1-linux-i586-failuresHistory/Failure_History/

This test, java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java, has been failing intermittently only with linux-x64 builds on mach5 tier1 testing since this set of changes: http://java.se.oracle.com/mach5/job/9-dev-prebuild/604/ Failure history is here: http://java.se.oracle.com/mach5/view/All/job/9-dev-tier1-linux-x64-failuresHistory/Failure_History/ Note that test run #2321 corresponds to prebuild #604(noted 2 lines above). The .jfr has no more useful info than, "execStatus=Error. Agent communication error\: java.net.SocketTimeoutException\: Read timed out; check console log for any additional details"