Bug ID: JDK-7011859 java/util/concurrent/Semaphore/RacingReleases.java failing

JDK 8
8 b113Fixed

Thanks for the empirical check. I committed this update to jsr166 repository. We still are not sure how the situation leading to this can legitimately occur, but in any case it is now covered.
08-10-2013
So I'm running the code on the machine that failed most for me...works so far (3 hours). Obviously addresses the case shown by the trace. Will see if someone from the core-libs team can take this further...
08-10-2013
Updated with a little further instrumentation (queue state): Ring buffer (currentIndex=23) + RingBuffer@4b67cf4d 0 : CurrThread: 1867 DO ACQUIRE ENTER-INTR 1 : CurrThread: 1866 DO ACQUIRE ENTER-INTR 2 : CurrThread: 1867 SET HEAD (DEQ) NodeId:963 CAS success 3 : CurrThread: 1867 SET TAIL (ENQ) Expected NodeId: 963 -> Update NodeId: 962 CAS success 4 : CurrThread: 1866 SET HEAD (DEQ) NodeId:965 CAS fail 5 : CurrThread: 1867 SET WAIT STAT NodeId:963 Node.Thread 0 Expected: Unset(0) -> Update: SIGNAL(-1) CAS success 6 : CurrThread: 1866 SET TAIL (ENQ) Expected NodeId: 962 -> Update NodeId: 964 CAS success 7 : CurrThread: 1868 SET STATE NodeId:0 EX 0 -> UP 1 CAS success -> 8 : CurrThread: 1867 SET STATE NodeId:0 EX 1 -> UP 0 CAS success 9 : CurrThread: 1868 DO RELEASE ENTER 10 : CurrThread: 1866 SET WAIT STAT NodeId:962 Node.Thread 1867 Expected: Unset(0) -> Update: SIGNAL(-1) CAS success 11 : CurrThread: 1868 SET WAIT STAT NodeId:963 Node.Thread 0 Expected: SIGNAL(-1) -> Update: Unset(0) CAS success 12 : CurrThread: 1869 SET STATE NodeId:0 EX 0 -> UP 1 CAS success 13 : CurrThread: 1869 DO RELEASE ENTER 14 : CurrThread: 1869 SET WAIT STAT NodeId:963 Node.Thread 0 Expected: Unset(0) -> Update: PROPAGATE(-3) CAS success 15 : CurrThread: 1869 DO RELEASE EXIT 16 : CurrThread: 1868 SET WAIT STAT NodeId:963 Node.Thread 0 Expected: PROPAGATE(-3) -> Update: Unset(0) CAS success 17 : CurrThread: 1868 UNPARK NodeId:962 Node.Thread 1867 18 : CurrThread: 1868 DO RELEASE EXIT ->19 : CurrThread: 1867 SET HEAD (DEQ) NodeId:962 volatile store ->20 : CurrThread: 1867 PROPAGATE NodeId:962 H NodeId: 963 H waitStatus 0 propagate 0 21 : CurrThread: 1867 DO ACQUIRE EXIT # 22 : CurrThread: 1866 PARK ENTER Acquire operation in Thread 1867, at step 8 leaves tryAcquireShared in doAcquireSharedInterruptibly with r == 0, two release operations come and go before 1867 enters setHeadAndPropagate(node, 0) (step 19) and fails propagate test (step 20). Finally thread 1866 is left behind.
07-10-2013
See 7011859/notes.txt
07-10-2013
Thanks for the extra trace. The situation here is not expected, but the additional race can be covered by performing a re-read of head. Could someone please try to replicate with the following. (I have never seen this on 5 different test machines, so I cannot check). --- a/src/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java Thu Sep 19 09:37:13 2013 -0700 +++ b/src/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java Mon Oct 07 08:37:50 2013 -0400 @@ -723,7 +723,8 @@ * racing acquires/releases, so most need signals now or soon * anyway. / - if (propagate > 0 \|\| h == null \|\| h.waitStatus < 0) { + if (propagate > 0 \|\| h == null \|\| h.waitStatus < 0 \|\| + (h = head) == null \|\| h.waitStatus < 0) { Node s = node.next; if (s == null \|\| s.isShared()) doReleaseShared(); Or in full: private void setHeadAndPropagate(Node node, int propagate) { Node h = head; // Record old head for check below setHead(node); / * Try to signal next queued node if: * Propagation was indicated by caller, * or was recorded (as h.waitStatus) by a previous operation * (note: this uses sign-check of waitStatus because * PROPAGATE status may transition to SIGNAL.) * and * The next node is waiting in shared mode, * or we don't know, because it appears null * * The conservatism in both of these checks may cause * unnecessary wake-ups, but only when there are multiple * racing acquires/releases, so most need signals now or soon * anyway. */ if (propagate > 0 \|\| h == null \|\| h.waitStatus < 0 \|\| (h = head) == null \|\| h.waitStatus < 0) { Node s = node.next; if (s == null \|\| s.isShared()) doReleaseShared(); } }
07-10-2013
This also failed on Solaris 11 in the jdk8/tl nightly. This failure was with a product build (not a fastdebug build). #Test Results (version 2) #Wed Aug 15 21:19:43 GMT-08:00 2012 #checksum:47132cb07591db33 #-----testdescription----- $file=/export/home/aurora/sandbox/testbase/test/java/util/concurrent/Semaphore/RacingReleases.java $root=/export/home/aurora/sandbox/testbase/test keywords=bug6801020 bug6803402 run=ASSUMED_ACTION main RacingReleases\n source=RacingReleases.java title=Try to tickle race conditions in AbstractQueuedSynchronizer "shared" code #-----environment----- #-----testresult----- description=file\:/export/home/aurora/sandbox/testbase/test/java/util/concurrent/Semaphore/RacingReleases.java elapsed=60386 0\:01\:00.386 end=Wed Aug 15 21\:19\:43 GMT-08\:00 2012 environment=regtest execStatus=Failed. Execution failed\: `main' threw exception\: java.lang.Error\: Semaphore stuck\: permits 1, thread waiting true hostname=sc14160169.us.oracle.com javatestOS=SunOS 5.11 (sparc) javatestVersion=4.4 jtregVersion=jtreg 4.1 fcs b04 script=com.sun.javatest.regtest.RegressionScript sections=script_messages build compile main start=Wed Aug 15 21\:18\:42 GMT-08\:00 2012 test=java/util/concurrent/Semaphore/RacingReleases.java user.name=aurora work=/export/home/aurora/sandbox/gresults/testoutput/jdk_util/JTwork/java/util/concurrent/Semaphore #section:script_messages ----------messages:(4/226)---------- JDK under test: (/export/home/aurora/sandbox/jdk) java version "1.8.0-ea" Java(TM) SE Runtime Environment (build 1.8.0-ea-langtools-nightly-h302-20120814-b52-b00) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b19, mixed mode) #section:build ----------messages:(3/99)---------- command: build RacingReleases reason: Named class compiled on demand elapsed time (seconds): 0.121 result: Passed. Build successful #section:compile ----------messages:(3/189)---------- command: compile /export/home/aurora/sandbox/testbase/test/java/util/concurrent/Semaphore/RacingReleases.java reason: .class file out of date or does not exist elapsed time (seconds): 0.12 result: Passed. Compilation successful #section:main ----------messages:(3/128)---------- command: main RacingReleases reason: Assumed action based on file name: run main RacingReleases elapsed time (seconds): 60.262 ----------System.err:(14/669)---------- java.lang.Error: Semaphore stuck: permits 1, thread waiting true at RacingReleases.test(RacingReleases.java:94) at RacingReleases.main(RacingReleases.java:114) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:474) at com.sun.javatest.regtest.MainAction$SameVMRunnable.run(MainAction.java:698) at java.lang.Thread.run(Thread.java:722) JavaTest Message: Test threw exception: java.lang.Error JavaTest Message: shutting down test result: Failed. Execution failed: `main' threw exception: java.lang.Error: Semaphore stuck: permits 1, thread waiting true test result: Failed. Execution failed: `main' threw exception: java.lang.Error: Semaphore stuck: permits 1, thread waiting true
05-10-2013
While we can't rule out a Java-level error, the use of "h" here is intentional to cover a secondary race during cancellation, so is not itself the problem. I will re-review this in context though and see if there is a more conservative construction that avoids tiriggering this still-undiagnosed problem.
05-10-2013
I've added Doug Lea as a watcher of this issue so he should see this new comment. Doug: can you assess David Simm's comments and suggested fix? Thanks. David: thanks for the detailed analysis.
03-10-2013
Instrumented source code, output and notes
02-10-2013
So after messing around with "jdk/test/java/util/concurrent/Semaphore/RacingReleases.java" on a number of platforms, I discovered the bug occurs more than one platform, Solaris/SPARC32 and Linux x86_64. See attached files in 7011859.tbz * Reproducer command: I copied RacingReleases and rewrote slightly as "MyRace": "while ../jdk8/build/linux-x86_64-normal-server-fastdebug/jdk/bin/java -client -XX:+UseMembar -XX:+ShowMessageBoxOnError MyRace; do echo OK; done" Summary: There is one shared Semaphore, initial state/permit = 0. 2 threads attempt "acquire" at the same time 2 threads attempt "release". The test ends with one of the acquire threads stuck in park, whilst all other threads completed. * Runtime versus Java code bug I don't believe the flags ("-client -XX:+UseMembar") are the cause of the bug. I could of course be wrong, but it seems the worse the code runs (i.e. -client = interpreted/C1, -XX:+UserMembar full memory fences), plus more and more instrumentation, the easier it is to reproduce. Since this bug occurs on mulitple platforms I find it less likely that the flags have anything to do with this. They simply affect the timing of a concurrency bug. Change made in JDK-6801020 is indeed (as pointed out by David Holmes) directly related to this problem. - previous fix: http://hg.openjdk.java.net/jdk7/tl/jdk/rev/5303aece2068 * Results For the full results and summary of the instrumentation, see 7011859.tbz:7011859/notes.txt, to summarize: I'm pretty sure, setHeadAndPropagate has a bug, where it fails to propagate: if (propagate > 0 \|\| h == null \|\| h.waitStatus < 0) { Node s = node.next; if (s == null \|\| s.isShared()) doReleaseShared(); } I see the case where propagate = 0, head.waitStatus = -1. But the code use "h", the old head which was just removed, and kind of uninteresting. I assume the meaning was current head waitStatus, i.e. "head.waitStatus", but I can't be sure changing "h.waitStatus" to "head.waitStatus" or "node.waitStatus" is totally correct.
02-10-2013
Instrumented AbstractQueuedSynchronizer, output and notes. SEE "notes.txt"
02-10-2013
I completed my part (reproduction help), assigning back to Mary to further reassign to a developer. If any more help needed on this bug, please let me know.
04-09-2013
First read the test sources, if they fail with java.lang.Error on a timeout, please change the test to print a message and hang. I would recommend the following vm flags (which should be available in fastdebug) -XX:+ShowMessageBoxOnError -XX:+TraceExceptions -XX:+AbortVMOnException=java.lang.RuntimeException This way we can see a stack trace if we throwing a runtime exception, which may mean this is a duplicate of the earlier bug.
27-08-2013
Karen, as I noted way above ... Reproduces with b122 fastdebug (which puts it at odds with 7011862 which is not reproducible on b122)
19-07-2013
Please check first to see if this could possibly be a duplicate of 7011862, i.e. a deadlock in java.util.concurrent potentially caused by a runtime error. If so, we should fix it in jdk9 under PLAN-171. 7011862 should not have been closed, but we will revisit a joint effort in jdk9. If this is a different problem it probably needs fixing in 8.
18-07-2013
PUBLIC COMMENTS This test was introduced with 6801020 to test that fix. My suspicion now is that the fix for 6801020 is flawed - that there is still a race possible that can allow a "lost wakeup". We only see it on certain systems and for client fastdebug VM but it is definitely there.
07-06-2011
PUBLIC COMMENTS But does (sometimes) reproduce on non-Niagara: rialto.ireland : uname -a SunOS rialto 5.10 Generic_127127-11 sun4u sparc SUNW,Sun-Fire-V245 which is only a 2 processor system.
17-01-2011
PUBLIC COMMENTS Doesn't reproduce on non-Niagra system: SunOS graham 5.10 Generic_118833-36 sun4u sparc SUNW,Sun-Fire-V890
14-01-2011
PUBLIC COMMENTS Reproduces with b122 fastdebug (which puts it at odds with 7011862 which is not reproducible on b122). Also reproduces with fastdebug VM with fix for 7010665 (C1 membar placement for unsafe ops). Also reproduces with b99. So this seems to not be a regression which begs the question as to why we have not seen this fail before? Is this a new test, or a new config (client fastdebug) on a new machine?
14-01-2011
PUBLIC COMMENTS This only seems to reproduce on fastdebug using -client. It continues to occur with -XX:+UseMembar
14-01-2011