JDK-8260062 : runtime/handshake/HandshakeDirectTest.java timed out
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 16,17
  • Priority: P4
  • Status: Resolved
  • Resolution: Duplicate
  • OS: windows
  • CPU: x86_64
  • Submitted: 2021-01-20
  • Updated: 2021-11-17
  • Resolved: 2021-11-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17
17Resolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following test timed out in the JDK17 CI:

runtime/handshake/HandshakeDirectTest.java

Here's a snippet from the log file:

#section:main
----------messages:(6/732)----------
command: main -XX:+UnlockDiagnosticVMOptions -XX:+UseBiasedLocking -XX:GuaranteedSafepointInterval=10 -XX:+HandshakeALot -XX:+SafepointALot -XX:BiasedLockingDecayTime=100000000 -XX:BiasedLockingBulkRebiasThreshold=1000000 -XX:BiasedLockingBulkRevokeThreshold=1000000 HandshakeDirectTest
reason: User specified action: run main/othervm -XX:+UnlockDiagnosticVMOptions -XX:+UseBiasedLocking -XX:GuaranteedSafepointInterval=10 -XX:+HandshakeALot -XX:+SafepointALot -XX:BiasedLockingDecayTime=100000000 -XX:BiasedLockingBulkRebiasThreshold=1000000 -XX:BiasedLockingBulkRevokeThreshold=1000000 HandshakeDirectTest 
Mode: othervm [/othervm specified]
Timeout information:
--- Timeout information end.
elapsed time (seconds): 621.018
----------configuration:(0/0)----------
----------System.out:(1/27)----------
Timeout refired 480 times
----------System.err:(4/626)----------
Java HotSpot(TM) 64-Bit Server VM warning: Option UseBiasedLocking was deprecated in version 15.0 and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: Option BiasedLockingDecayTime was deprecated in version 15.0 and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: Option BiasedLockingBulkRebiasThreshold was deprecated in version 15.0 and will likely be removed in a future release.
Java HotSpot(TM) 64-Bit Server VM warning: Option BiasedLockingBulkRevokeThreshold was deprecated in version 15.0 and will likely be removed in a future release.
----------rerun:(49/6402)*----------

<snip>

result: Error. Program `c:\ade\mesos\work_dir\jib-master\install\jdk-17+6-348\windows-x64-debug.jdk\jdk-17\fastdebug\bin\java' timed out (timeout set to 480000ms, elapsed time including timeout handling was 620987ms).

The test task has these JVM args:

-XX:+CreateCoredumpOnCrash -XX:+UseZGC

I'm starting this bug in hotspot/gc for initial triage.
Comments
I'll close as dup of JDK-8270085.
17-11-2021

This could still be the same underlying problem as JDK-8270085 where a thread suspends while holding a lock, except in this case we don't get a deadlock, just an extreme delay. We would need stacktraces of all threads to confirm this of course.
12-07-2021

Created https://bugs.openjdk.java.net/browse/JDK-8270085 for the Linux suspend/deadlock issue. Re-linked the mach5 failed test to 8270085.
08-07-2021

We are calling deopt all marked with the Compile_lock held, thus asking the VM thread to handshake all threads. While we wait for that in blocked state, the JavaThread gets suspended.
02-07-2021

> It seem like one JavaThread is suspended while holding Compile_lock. Just to state the obvious we should never allow a thread to suspend whilst holding any internal VM resources. Is this a case of nested locks?
02-07-2021

We had one on Linux and got some data. It seem like one JavaThread is suspended while holding Compile_lock. The JavaThread that is suppose to resume the suspended thread is waiting for Compile_lock. We use to avoid the suspend flag when in block_trans when leaving safepoint. Since the Windows timeouts are older than suspend changes, I believe those were infra problems. Linux one is the issue with suspend while holding Compile_lock.
01-07-2021

I logged in to one of these VM.Standard.E2.4 Windows machines a couple of weeks ago but couldn't reproduce the timeout. Looking in mach5 from the past 3 months(mach5-one jobs), the maximum running time of this test on linux and macos is around 2min. For windows, after mid April is around 4min. Before mid April there are runs that took up to 16 minutes. I'm not sure if something changed in the infrastructure after that date. If we make the test shorter on Windows running half the amount of handshakes could be an option.
01-06-2021

I think we should just run the test shorter this very slow OS.
27-05-2021

I get many failures if I add a 20 seconds safepoint timeout (if safepoint synchronization takes more then 20 seconds we call fatal) and a handshake timeout of 20 seconds: https://mach5.us.oracle.com/mdash/jobs/rehn-dev-jdk-20210212-1123-18325945/results?search=status%3Afailed%20AND%20-state%3Ainvalid runtime/CommandLine/OptionsValidation/ can be ignored since they are 'special'. All other test failure indicate huge thread CPU runtime starvation. If e.g. jcmd/jstack is executed the 'thread' gets un-stuck. (might indicate an issue with OS/Hypervisor) Also calling fatal un-stucks the VM. (looking at the mdmp the handshake is already performed when we actually abort) This spring I could run with 3 seconds safepoint/handshake timeout with no unexpected failures. (some test removes safepoint polls etc...) All issues I have looked is also on Windows Server 2016 10.0 (amd64). E.g. Windows Server 2019 10.0 (amd64) doesn't seem to timeout ? All above plus no issues found in VM indicate the problem is outside the VM.
18-02-2021

[~rehn] Above you say: > With a such unreasonably over-provisioned machine it's expected that tests will timeout. but I don't see any stats or info to show that the Win* machines are over provisioned. Please explain.
16-02-2021

If a test has minimum resource requirements, you could express it with "@requires": E.g.: @requires os.maxMemory > 6G @requires os.processors > 4 Alternatively, you could increase timeout on a test.
16-02-2021

There are number of timeouts on windows, everything I have seen indicates no issues in the VM. The threads just get so starved that synchronization for a safepoint often takes more than 20 seconds. With a such unreasonably over-provisioned machine it's expected that tests will timeout. On my big-machine running Linux during startup I see one very long safepoint synchronization at: 0.016 s (the average for the test is: 0.0004 s) and the test takes about 14 seconds to finish. If we can't even safepoint in over 20 seconds which is factor of over a thousand against local worse, that the test didn't finish in 480 seconds sounds reasonable under such circumstance. If we can't find some information that indicate a problem with the VM I like to close as works a intended.
15-02-2021

For ILW, since this is a timeout and not a crash or severe loss of functionality, changing Impact to M: ILW = MLH = P4
26-01-2021

I had a look at what this test does. It runs a bunch of threads that continuously revoke each others locks, while yet another thread is shooting suspend/resume requests at the threads. During the execution of the test, not a single GC happens. Therefore, I think it seems rather unlikely it has anything to do with the GC, yet much more likely it could have something to do with either the locking code, handshake code, or suspend/resume code, which are all in the runtime. So I'm sending this one over to runtime.
22-01-2021