JDK-8029453 : java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java failed by timeout
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 7u91,8u66,9
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2013-12-03
  • Updated: 2017-08-17
  • Resolved: 2015-08-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8 JDK 9
8u72Fixed 9 b81Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
TimeoutLockLoops.java failed on linux 32  jdk8 b118 by timeout


The test has never failed before. So, it's could be an incidental failure, not a bug.
The jtr is attached. Some quotes:

----------System.err:(4/264)----------
warning: [options] bootstrap class path not set in conjunction with -source 1.5
warning: [options] source value 1.5 is obsolete and will be removed in a future release
warning: [options] To suppress warnings about obsolete options, use -Xlint:-options.
3 warnings
result: Passed. Compilation successful

#section:build
----------messages:(3/99)----------
command: build TimeoutLockLoops
reason: Named class compiled on demand
elapsed time (seconds): 0.0
result: Passed. All files up to date

#section:main
----------messages:(3/121)----------
command: main TimeoutLockLoops
reason: User specified action: run main TimeoutLockLoops 
elapsed time (seconds): 960.008
----------System.out:(9/314)----------
Threads: 1	 0.251106581s run time
Threads: 2	 0.202150581s run time
Threads: 3	 0.206090516s run time
Threads: 5	 0.210550358s run time
Threads: 8	 1.457507195s run time
Threads: 12	 153.525323368s run time
Threads: 18	 0.233047631s run time
Threads: 27	 0.213040926s run time
Threads: 41	 695.364596734s run time


Comments
noreg-other: existing tests are sufficient to exercise this change ie java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java
18-08-2015

Note: this is not a P2 in 7u or 8u as the workaround is enabled by default.
11-08-2015

The bug was introduced with the fix for JDK-6900441, but unless someone disabled the workaround (as done in JDK-8130728) they would not see it. Nevertheless this should be backported to all releases that contain 6900441 - 9, 8u and 7u.
11-08-2015

In the PlatformParker code there is a bug in the unlock-before-signal path where we read the value of _cur_index after releasing the lock, which allows the waiting thread to reset it to -1. We then call pthread_cond_signal(_cond[-1]) and so treat "random" memory as a condition variable. This actually happens "all the time" but we only rarely see that piece of memory contain values that lead to this hang.
11-08-2015

The observed failure mode is consistent with using a condition variable that has been destroyed, but we do not ever destroy them (with the workaround disabled). The ParkEvent/PlatformParker objects that encapsulate the condition and mutex are "immortal".
11-08-2015

I was able to gcore a hung process and run jstack -m against the core file. Unfortunately this showed nothing interesting as most of the app threads simply report: ----------------- 22608 ----------------- 0x00007fa0ee7a1d84 __pthread_cond_wait + 0xc4 0x00007fa0eda096bd Unsafe_Park + 0x11d 0x00007fa0dd0d3fba <Unknown compiled code> Only interesting thread stack: ----------------- 22576 ----------------- 0x00007fa0ee7a489c __lll_lock_wait + 0x1c 0x00007fa0dd0d37cc <Unknown compiled code> 0x00007fa0d5c9e944 * java.util.concurrent.locks.LockSupport.unpark(java.lang.Thread) bci:8 line:141 (Compiled frame) * java.util.concurrent.locks.AbstractQueuedSynchronizer.unparkSuccessor(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) bci:80 line:662 (Compiled frame) * java.util.concurrent.locks.AbstractQueuedSynchronizer.cancelAcquire(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node) bci:145 line:779 (Interpreted frame) gdb also shows: (gdb) where #0 0x00007fa0ee7a489c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007fa0ee7a25bc in pthread_cond_signal@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007fa0eda067d9 in Unsafe_Unpark () at /export/users/dh198349/jdk9-hs-rt/hotspot/src/share/vm/prims/unsafe.cpp:1170 So it appears the pthread_cond_signal, which happens after the pthread_mutex_unlock is hanging trying to acquire the cond vars internal lock. There is no other thread inside the futex implementation.
11-08-2015

Currently testing the two different parts to the workaround separately. This uses a modified VM that enables the workaround in two parts. Setting: WorkAroundNPTLTimedWaitHang=0 Action: a) skip condition variable destroy and reinitialization b) perform signal after unlock Result: Hang on iteration 354, Setting: WorkAroundNPTLTimedWaitHang=1 Action: a) skip condition variable destroy and reinitialization b) perform signal before unlock Result: 5000 iterations - no hang Setting: WorkAroundNPTLTimedWaitHang=2 Action: a) perform condition variable destroy and reinitialization b) perform signal before unlock Result: 5000 iterations - no hang
10-08-2015

Problem is seen with: GNU C Library stable release version 2.12, by Roland McGrath et al. Copyright (C) 2010 Free Software Foundation, Inc. and also GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10) stable release version 2.15, by Roland McGrath et al. Copyright (C) 2012 Free Software Foundation, Inc.
10-08-2015

Confirmed that "kill -STOP" followed by "kill -CONT" unblocks the hang. This is reminiscent of a futex bug we saw on the real-time linux kernel many years ago.
10-08-2015

Note that the current b75 related failures are completely unrelated to the original failures for which this bug was opened. Not sure it was a good idea to re-open such an old bug.
10-08-2015

Without JDK-8130728, no hung happen. Tested at two Linux machines, run TimeoutLockLoops.java directly. hostA: * b75, hung reproducible within 100 runs * b75, run with special vm options "-XX:+UnlockExperimentalVMOptions -XX:WorkAroundNPTLTimedWaitHang=1", 10,000 runs completed, all pass, no hung hostB: * With latest dev, hung reproducible within 100 runs * In the same repo, backout JDK-8130728, 10,000 runs completed, all pass, no hung
07-08-2015

Amy: can you re-run your tests adding " -XX:+UnlockExperimentalVMOptions -XX:WorkAroundNPTLTimedWaitHang=1" and see if you can reproduce any failures, please. There were two parts to the workaround and I suspect it may be the second part (signal after unlock) that might either lead to bug in the VM or expose a bug in linux libraries.
07-08-2015

As much as it pains me, this could be related to: JDK-8130728: Disable WorkAroundNPTLTimedWaitHang by default The problem I saw on b74 was with the work around disabled, so it looked like a b75 build would.
07-08-2015

First 1000 runs ok. Somewhere in the next 1000 thousand: Threads: 1 0.200908044s run time Threads: 2 0.200394306s run time Threads: 3 0.200419979s run time Threads: 5 0.200544475s run time Threads: 8 0.200756379s run time Threads: 12 0.200976226s run time Threads: 18 0.201126043s run time Threads: 27 0.201516376s run time Threads: 41 Top shows the process is inactive but massive heap usage! 28562 daholme 20 0 39.4g 188m 10m S 0 0.1 0:02.29 java pstack tool fails to works: 28562: /scratch/dh198349/jdk9-hs-rt/build/b00/se-linux-x64-internal/images/jdk/bin/java -cp JTwork/lasses/:JTwork/classes/java/util... (No symbols found in ) (No symbols found in /lib/x86_64-linux-gnu/libdl.so.2) (No symbols found in /lib/x86_64-linux-gnu/libc.so.6) (No symbols found in /lib64/ld-linux-x86-64.so.2) (No symbols found in /lib/x86_64-linux-gnu/libm.so.6) (No symbols found in /lib/x86_64-linux-gnu/librt.so.1) (No symbols found in /lib/x86_64-linux-gnu/libnss_compat.so.2) (No symbols found in /lib/x86_64-linux-gnu/libnsl.so.1) (No symbols found in /lib/x86_64-linux-gnu/libnss_nis.so.2) (No symbols found in /lib/x86_64-linux-gnu/libnss_files.so.2) 0x7ffd34239148: _fini + 0x7ffcffff7d98 (0, 1000, 0, 100000, 0, 0) + fffffffdc957bcfc 0x00000000: ???? (850000004c25048b, 252c8944645174c0, 24548b480000004c, 1c74d28548c03110, 38b480374db8548, 452b60f08420348) + 1824848e307dd194 crawl: Input/output error Error tracing through process 28562 0x646475c085000000: _fini + 0x646475c050dbec50 gdb showed all thread stacks - but nothing helpful. As you might guess all threads are "waiting" but we can't see what for exactly. Once gdb detached the test continued: Threads: 41 2106.267683571s run time
07-08-2015

I tried such an experiment locally yesterday and I had one case, on b74, where the test hung for a long period of time. During that time I could also not get a stack dump (ctrl-/) but eventually I got the dump and saw the test complete: Threads: 12^\ 172.847054257s run time Threads: 18 2015-08-06 03:32:40 Full thread dump Java HotSpot(TM) 64-Bit Server VM (1.9.0-ea-b74 mixed mode): ... However I was running over NFS so that may have contributed to the problem. I will try with a fresh local build.
07-08-2015

The Socket read timeout from the agent seems unrelated to the test. What code is failing? Looks more like an infrastructure issue.
06-08-2015

Also failed recently on linux-i586 test run: http://java.se.oracle.com/mach5/view/All/job/9-dev-tier1-linux-i586-failuresHistory/Failure_History/
29-07-2015

This test, java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java, has been failing intermittently only with linux-x64 builds on mach5 tier1 testing since this set of changes: http://java.se.oracle.com/mach5/job/9-dev-prebuild/604/ Failure history is here: http://java.se.oracle.com/mach5/view/All/job/9-dev-tier1-linux-x64-failuresHistory/Failure_History/ Note that test run #2321 corresponds to prebuild #604(noted 2 lines above). The .jfr has no more useful info than, "execStatus=Error. Agent communication error\: java.net.SocketTimeoutException\: Read timed out; check console log for any additional details"
22-07-2015