Bug ID: JDK-8269523 runtime/Safepoint/TestAbortOnVMOperationTimeout.java failed when expecting 'VM operation took too long'

JDK-8269523 : runtime/Safepoint/TestAbortOnVMOperationTimeout.java failed when expecting 'VM operation took too long'

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 11.0.4,17

Priority: P4
Status: Resolved
Resolution: Fixed
OS: os_x
CPU: aarch64

Submitted: 2021-06-28
Updated: 2022-01-11
Resolved: 2021-07-30

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 17	JDK 18
17.0.3-oracleFixed	18 b09Fixed

Related Reports

Relates :	JDK-8181143 - Introduce diagnostic flag to abort VM on too long VM operations
Relates :	JDK-8271931 - Make AbortVMOnVMOperationTimeout more resilient to OS scheduling
Relates :	JDK-8271609 - Misleading message for AbortVMOnVMOperationTimeoutDelay

Description

Test arguments: '-XX:-CreateCoredumpOnCrash -XX:-TieredCompilation';
Build profile: 'macosx-aarch64'.

The test failed with '-XX:AbortVMOnVMOperationTimeoutDelay=5', the largest value stipulated by the test. A testbug / unexpectedly good performance?

Unfortunately, there is no output file for the last process executed, only the test log is available:

...
Command line: [${WORKDIR}/jib-master/install/jdk-17+26-2439/macosx-aarch64.jdk/jdk-17.jdk/Contents/Home/bin/java -cp $WORKDIR/slaves/$SL1/frameworks/$FW1/executors/$EX1/runs/$RUN1/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_runtime/classes/3/runtime/Safepoint/TestAbortOnVMOperationTimeout.d:$WORKDIR/jib-master/install/jdk-17+26-2439/src.full/open/test/hotspot/jtreg/runtime/Safepoint:$WORKDIR/slaves/$SL1/frameworks/$FW1/executors/$EX1/runs/$RUN1/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_runtime/classes/3/test/lib:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/javatest.jar:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/jtreg.jar:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/junit.jar:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/hamcrest.jar:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/testng.jar:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/jcommander.jar:$WORKDIR/jib-master/install/jtreg/6/1/bundles/jtreg-6+1.zip/jtreg/lib/guice.jar -XX:+UnlockDiagnosticVMOptions -XX:+AbortVMOnVMOperationTimeout -XX:AbortVMOnVMOperationTimeoutDelay=5 -Xmx256m -XX:+UseSerialGC -XX:-CreateCoredumpOnCrash TestAbortOnVMOperationTimeout foo ]
[2021-06-11T18:07:17.417343Z] Gathering output for process 43060
[2021-06-11T18:07:17.605063Z] Waiting for completion for process 43060
[2021-06-11T18:07:17.605109Z] Waiting for completion finished for process 43060
----------System.err:(19/977)----------
 stdout: [];
 stderr: []
 exitValue = 0

java.lang.RuntimeException: 'VM operation took too long' missing from stdout/stderr

        at jdk.test.lib.process.OutputAnalyzer.shouldMatch(OutputAnalyzer.java:340)
        at TestAbortOnVMOperationTimeout.testWith(TestAbortOnVMOperationTimeout.java:77)
        at TestAbortOnVMOperationTimeout.main(TestAbortOnVMOperationTimeout.java:57)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312)
        at java.base/java.lang.Thread.run(Thread.java:833)

Comments

Fix request [17u] I backport this for parity with 17.0.3-oracle. No risk, only a test change. Clean backport. SAP nightly tests passed.
11-01-2022
A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk17u-dev/pull/69 Date: 2022-01-10 10:27:45 +0000
10-01-2022
Changeset: 4f42eb66 Author: David Holmes <dholmes@openjdk.org> Date: 2021-07-30 04:03:11 +0000 URL: https://git.openjdk.java.net/jdk/commit/4f42eb6601c3b6011d3c2b30af6b2be264ff7c0e
30-07-2021
The test is simply exercising the timeout mechanism to show that it is capable of working, it isn't trying to demonstrate that it is a precise or accurate mechanism. We could have defined a dummy VMOp for test purposes that would trigger a given (large) timeout, but instead we tried to use existing GC VMOps with a very short timeout and hope that the VMOp takes longer than that. Given the scheduling granularity this test strategy has been shown to be flawed, but I think restricting to the 0 and 1ms cases will suffice to re-stabilise the test. If necessary in the future we can try a different approach.
30-07-2021
> I'm not sure it is worthwhile expending time and effort trying to make the timeout mechanism more precise/accurate. For values of `AbortVMOnVMOperationTimeoutDelay` <10, even if the VM-op takes more than the timeout, the abort can still be non-deterministic due to the low precision & scheduling. I think that's exactly the reason why this test fails from time to time. > Also the issue with the WatcherThread may be that its other tasks are taking far too long ... True; regardless of the root cause, it's still beneficial to report when a task can't carry its work in a timely manner, IMO. Outside of the scope of this JBS ticket, ofc.
28-07-2021
I'm not sure it is worthwhile expending time and effort trying to make the timeout mechanism more precise/accurate. Certainly not as part of this test adjustment. Also the issue with the WatcherThread may be that its other tasks are taking far too long ...
27-07-2021
> The timeout mechanism is somewhat crude/inaccurate. Maybe for small timeouts (<10ms), we can ask VM thread to check it instead; somewhere around `disarm()` is called. > it shows the WatcherThread is not getting enough CPU time as the timeout task should have executed at least 5 times during this 57ms interval, but it didn't! Indeed; this instance of TimeoutTask misses its deadline by a significant margin (>4x its intended interval). Maybe it's good to have some kind of self-checking within TimeoutTask to detect&report this, as the system is not behaving properly.
27-07-2021
> fatal error: VM operation took too long: 2 ms (timeout: 0 ms) This statement can be a bit misleading. Some might interpret it as: the VM operation took 2ms to complete and its duration is greater than the timeout (0ms), so a fatal error is reported. However, the correct interpretation is: 2ms has passed since the VM operation started, but it's unknown if that operation is still running or has completed. Nonetheless, a fatal error is inevitable (2ms > 0ms). Possibly, a preexisting problem regarding the timeout value though: From `VMThread::create`: ``` size_t interval = (size_t)AbortVMOnVMOperationTimeoutDelay / 10; interval = interval / PeriodicTask::interval_gran * PeriodicTask::interval_gran; interval = MAX2<size_t>(interval, PeriodicTask::min_interval); interval = MIN2<size_t>(interval, PeriodicTask::max_interval); _timeout_task = new VMOperationTimeoutTask(interval); ``` The minimal gap btw two timeout checks is `PeriodicTask::min_interval` (10ms), so it doesn't provide enough precision for timeout below 10ms. Therefore, a VM-op taking 9ms could still go undetected even with a 5ms timeout. (Ofc, this depends on the scheduling.)
27-07-2021
[~ayang] yes your analysis is correct - thank you. The timeout mechanism is somewhat crude/inaccurate. - The TimeoutTask will be executed (modulo scheduling) at most once every 10ms due to the PeriodicTask::min_interval. - The Timeout task is armed once the VMThread has brought the VM to a safepoint - that takes a snapshot of the current time in nanos. - The timeout task could execute at any point relative to that arming point, and it will evaluate the current elapsed time since the task was armed. If that is > timeout then we have the fatal error. - The VMThread will disarm the task once the VMOperation evaluation has completed. So the VMOp will timeout when two conditions are met: 1. The timeout task is released within the arming window; and 2. The elapsed time exceeds the timeout value. So yes with a timeout < 10ms we can easily miss that because the task may not be released in the arming window, or may be release before the timeout has elapsed. And the reported time is simply how much time has elapsed since arming, which does not tell you whether the VMop is still running or not. Perhaps the most interesting thing about this analysis is that when we see something like: # fatal error: VM operation took too long: 57 ms (timeout: 5 ms) then it shows the WatcherThread is not getting enough CPU time as the timeout task should have executed at least 5 times during this 57ms interval, but it didn't!
27-07-2021
// These should fail: Serial is not very fast. Traversing 10M objects in 5 ms // means less than 0.5 ns per object, which is not doable. for (int delay : new int[]{0, 1, 5}) { testWith(delay, false); } But the logic does: Object[] arr = new Object[10_000_000]; for (int i = 0; i < arr.length; i++) { arr[i] = new Object(); } so we will only traverse 10M objects if the GC happens as we near the end of the loop. The actual number of traversals will depend on how soon GC actually triggers. That said you would expect this to be fairly constant/predictable so the large variance in GC VMOp time is not readily explainable.
25-07-2021
I think the analysis in this test about how fast/slow SerialGC may be needs some adjusting. If we call reportDiagnosticSummary so we can see the cases that do abort as expected we can see how long the GC VMOperation is taking, and while many are much much larger than the timeout e.g on macos-x64. # fatal error: VM operation took too long: 57 ms (timeout: 5 ms) some are much shorter e.g on linux-x64. # fatal error: VM operation took too long: 2 ms (timeout: 0 ms) Obviously if the above had been the 5ms timeout case then this test would have failed. It is interesting to see the result of the 1ms and 5ms cases: # fatal error: VM operation took too long: 11 ms (timeout: 1 ms) # fatal error: VM operation took too long: 12 ms (timeout: 5 ms) there is obviously a massive variance in execution times possible for SerialGC. I'm unsure how to try and fix the test - perhaps just delete the 5ms case as it isn't long enough.
25-07-2021
ILW = MLM = p4
29-06-2021