Bug ID: JDK-8253794 TestAbortVMOnSafepointTimeout never timeouts

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 16

Priority: P2
Status: Resolved
Resolution: Fixed

Submitted: 2020-09-29
Updated: 2024-12-04
Resolved: 2020-10-05

JDK 16
16 b19Fixed

[2020-09-29T13:28:55.921199800Z] Gathering output for process 1500948
[2020-09-29T13:29:23.458191300Z] Waiting for completion for process 1500948
[2020-09-29T13:29:23.458191300Z] Waiting for completion finished for process 1500948
----------System.err:(22/1261)----------
 stdout: [CompileCommand: compileonly TestAbortVMOnSafepointTimeout.test_loop
This message would occur after some time with result 715827882
];
 stderr: [Java HotSpot(TM) 64-Bit Server VM warning: Temporarily processing option UseBiasedLocking; support is scheduled for removal in 16.0
]
 exitValue = 0

java.lang.RuntimeException: 'Safepoint sync time longer than' missing from stdout/stderr 

Update: This failure has happened in Tier1 so I'm bumping this
bug from P4 -> P2.

Changeset: c9d0407e Author: Robbin Ehn <rehn@openjdk.org> Date: 2020-10-05 19:16:11 +0000 URL: https://git.openjdk.java.net/jdk/commit/c9d0407e
05-10-2020
> As far as I can tell, the VMThread does not have a timeout mechanism for doing > a handshake poll if the thread never, ever gets to be safe. Does it need one? Yes it obviously does. We know it shouldn't loop forever but that is the whole point of the timeout mechanism - to detect things that should "never" happen.
01-10-2020
The is similar to JDK-8227528, except that now the responsible handshake seems to be HandshakeForDeflation instead of RevokeOneBias.
01-10-2020
> The VMThread should only ever be polling for a thread to become safe if currently in-Java. And that is what is happening. The target thread is in-Java running a loop that intentionally does not pause for a safepoint. If that thread never, ever gets to be safe, then the VMThread never finishes polling (if I understand things correctly). As far as I can tell, the VMThread does not have a timeout mechanism for doing a handshake poll if the thread never, ever gets to be safe. Does it need one? I'm not sure because that is not supposed to happen. In the case of this test, we are (artificially) making it happen to verify that the VM does abort. Robbin has changed the test to cause the safepoint timeout in a different way and that fix is out for review.
01-10-2020
So do you mean that the VMThread is getting "hung" trying to handshake this looping thread (to do what?) before the actual attempted safepoint that would trigger the timeout? If so we are missing a handshake timeout mechanism. The VMThread should only ever be polling for a thread to become safe if currently in-Java.
01-10-2020
To be able to test the safepoint timeout we need a JavaThread that do not reach a safe state. The author of the test choose to create a very very long loop without polls, thus the JavaThread would be stuck unsafe in java state. But now we have handshakes; so an ongoing handshake driven by the VM thread can't finish either since this thread remains unsafe. Thus we never schedule a safepoint since we rely on SafepointALot which is driven by VM thread. So even if we armed the timeout when requester schedules the safepoint this test could fail. The safepoint timeout is started when we do SS::begin() in the check of SafepointTimeout. I have several ideas how to improve stuff here, but let's discuss that in: 8198730.
01-10-2020
[~dholmes] - You mention 8220763 in your comment above, but I think you mean 8212107 since the test failure frequency has increased after that fix was integrated. The TestAbortVMOnSafepointTimeout test is intentionally causing a safepoint timeout by causing the JavaThread to run in a loop that has no safepoint polls. The test is invoked with -XX:-UseCountedLoopSafepoints intentionally. Here's the (crazy) set of options: -XX:+UnlockDiagnosticVMOptions -XX:-UseBiasedLocking -XX:+SafepointTimeout -XX:+SafepointALot -XX:+AbortVMOnSafepointTimeout -XX:SafepointTimeoutDelay=500 -XX:GuaranteedSafepointInterval=500 -XX:-TieredCompilation -XX:-UseCountedLoopSafepoints -XX:LoopStripMiningIter=0 -XX:LoopUnrollLimit=0 -XX:CompileCommand=compileonly,TestAbortVMOnSafepointTimeout::test_loop -Xcomp -XX:-CreateCoredumpOnCrash -Xms64m
01-10-2020
Yes it seems like we speculated correctly: [0.684s][warning][handshake ] Thread 0x000002a88363b1a0 has not cleared its handshake op JavaThread state: _thread_in_Java Java Threads: ( => current thread ) 0x000002a88363b1a0 JavaThread "main" [_thread_in_Java, id=1861576, stack(0x000000ba15400000,0x000000ba15500000)] The "main" is the thread running the very very long loop without polls. In this case the handshake is: HandshakeForDeflation So this issue is unrelated to 8220763 (not considering timing changes). It's just that the more handshakes we add the bigger chance of this happening. I will go head and fix this test. But we should discuss timeouts, I have an old issue here: https://bugs.openjdk.java.net/browse/JDK-8198730
01-10-2020
Yesterday this scenario was discussed: Compiler will generate a very very long loop without polls (safepoint/handshake). Once a JavaThread start executing that loop we can no longer perform any handshakes on that thread nor safepoints. If a handshake all operation (executed by the VM thread) is started after a JavaThread enters this loop it can never complete. Any safepoint scheduled will not be started, since VM thread is waiting to handshaking this JavaThread, which can never happen. The safepoint timeout is armed at the start of a safepoint, but if a safepoint is never started the timeout is never armed. This test would in this case run until jtreg times out, which is happening here.
01-10-2020
-XX:+UseCountedLoopSafepoints ? But it seems a strange coincidence that this starts failing reliably after 8220763 Also the timeout should be covering the time it takes to get to the safepoint - not the time it executes the safepoint VM operation. Have we inadvertently changed that at some point?
01-10-2020
I found an identical but older (older than the integration of 8212107) and still recent test failure in our systems. So I don't suspect it is related to 8212107.
29-09-2020
ILW = MLM = P4
29-09-2020
Not sure if this is the same as 8220763 .
29-09-2020

Relates :	JDK-8220763 - TestAbortVMOnSafepointTimeout.java fails due to "'SIGILL' missing from stdout/stderr"
Relates :	JDK-8198730 - Improve detection of failure to reach a safepoint
Relates :	JDK-8212107 - VMThread issues and cleanup