JDK-8151501 : LockSupport/ParkLoops.java: AssertionError: lost unpark
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.concurrent
  • Affected Version: 9
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2016-03-09
  • Updated: 2016-06-13
  • Resolved: 2016-04-07
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b114Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
JDK9, VM PIT testing, Linux x86 only.
java/util/concurrent/locks/LockSupport/ParkLoops.java: AssertionError: lost unpark


The same issue was described in JDK-8149708, but it was closed:
---
No failures post 2016-02-11 once JDK-8149697 was fixed. Closing as a duplicate. 
---
Now it appears again.
Comments
I check total time execution using jtreg's -verbose:time flag. On a garden variety linux box I see results like this: main: 0.382 seconds It would be good to see how much time it takes on a slow system. Maybe mac or windows is much worse? Maybe architectures with expensive volatile reads are much slower? How much expansion factor should we provide in such a test? 10x is clearly too small. 1000x should be good enough for everybody! Of course the use of a PRNG here theoretically can introduce arbitrary delays, but in practice I don't think it can be a problem. The law of large numbers applies. The bottleneck should be the many park/unpark cycles.
10-03-2016

Could the use of SplittableRandom lead to those rare entropy pauses? The random-ness in the test makes we wonder if once-in-a-blue-moon we might not encounter a pathology where either we get a lot of CAS contention, or we get a long sequence where a particular index doesn't get selected for unpark? Seems unlikely but not impossible. There are two other cases of this test failing on Feb 11. Both linux-x64 running server VM with -Xcomp.
10-03-2016

--- ParkLoops.java 27 Feb 2016 21:15:57 -0000 1.11 +++ ParkLoops.java 10 Mar 2016 03:44:46 -0000 @@ -10,9 +10,11 @@ * @summary Stress test looks for lost unparks * @library /lib/testlibrary/ * @modules java.management + * @run main/timeout=1200 ParkLoops */ import static java.util.concurrent.TimeUnit.MILLISECONDS; +import static java.util.concurrent.TimeUnit.SECONDS; import java.lang.management.ManagementFactory; import java.lang.management.ThreadInfo; @@ -26,6 +28,7 @@ import jdk.testlibrary.Utils; public final class ParkLoops { + static final long TEST_TIMEOUT_SECONDS = Utils.adjustTimeout(1000); static final long LONG_DELAY_MS = Utils.adjustTimeout(10_000); static final int THREADS = 4; static final int ITERS = 30_000; @@ -103,7 +106,7 @@ pool.submit(unparker); } try { - if (!done.await(LONG_DELAY_MS, MILLISECONDS)) { + if (!done.await(TEST_TIMEOUT_SECONDS, SECONDS)) { dumpAllStacks(); throw new AssertionError("lost unpark"); }
10-03-2016

Looking at the changes to ParkLoops more closely, they were more aggressive than they should have been. There should be a greater margin of safety, although I have not observed any failures myself. We will restore a longer timeout.
10-03-2016

It is hard for me to determine whether JDK-8150523 was present in the failing test case. However after 8150523 the timeout has been significantly shortened - the base timeout is now 10 seconds, and with the timeout factor of 3 applied to the test run that becomes 30 seconds. That said under unloaded conditions this test takes 1-2 seconds to run (32-bit -client, mixed mode). Will need to wait and see if this reproduces.
10-03-2016

We recently fiddled with timeout handling in this jtreg test, and it's easy to believe we made a mistake. If this is reproducible, determine whether 8150523 is at fault.
10-03-2016

This failure mode occurs upon a timeout. The timeout is 1000 seconds. The test failed on a 4 core machine. The stackdump in the jtr files shows everything seeming to be executing as expected. I suspect this is just a case of a slow/heavily-loaded machine. Will run some comparative experiments.
10-03-2016