JDK-8267484 : misc vmTestbase/gc/gctests timed out on a particular machine
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 17
  • Priority: P4
  • Status: Closed
  • Resolution: Cannot Reproduce
  • OS: windows
  • CPU: x86_64
  • Submitted: 2021-05-20
  • Updated: 2024-05-20
  • Resolved: 2024-05-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 23
23Resolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following two tests timed out in the JDK17 CI w/ ZGC:

vmTestbase/gc/gctests/StringInternSyncWithGC2/TestDescription.java
vmTestbase/gc/gctests/StringIntern/StringIntern.java

Here's snippets from the StringInternSyncWithGC2 test log:

#section:compile
----------messages:(8/388)*----------
command: compile C:\\ade\\mesos\\work_dir\\jib-master\\install\\jdk-17+23-2075\\src.full\\open\\test\\hotspot\\jtreg\\vmTestbase\\gc\\gctests\\StringInternSyncWithGC\\StringInternSyncWithGC.java
reason: .class file out of date or does not exist
Mode: agentvm
Agent id: 4
Timeout refired 480 times
Timeout information:
--- Timeout information end.
elapsed time (seconds): 3332.359
----------configuration:(11/941)*----------
Boot Layer (javac runtime environment)
  class path: C:\\ade\\mesos\\work_dir\\jib-master\\install\\jtreg\\5.1\\b01\\bundles\\jtreg_bin-5.1.zip\\jtreg\\lib\\javatest.jar 
              C:\\ade\\mesos\\work_dir\\jib-master\\install\\jtreg\\5.1\\b01\\bundles\\jtreg_bin-5.1.zip\\jtreg\\lib\\jtreg.jar 
  patch:      java.base T:\\testoutput\\test-support\\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\\patches\\java.base

javac compilation environment
  source path: C:\\ade\\mesos\\work_dir\\jib-master\\install\\jdk-17+23-2075\\src.full\\open\\test\\hotspot\\jtreg\\vmTestbase
               C:\\ade\\mesos\\work_dir\\jib-master\\install\\jdk-17+23-2075\\src.full\\open\\test\\lib
  class path:  T:\\testoutput\\test-support\\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\\classes\\2\\vmTestbase
               T:\\testoutput\\test-support\\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\\classes\\2\\test\\lib

----------rerun:(41/5920)*----------

Please note that this test timed out in the *compile* phase.


Here's a snippet from the StringIntern test's log:

#section:main
----------messages:(6/261)----------
command: main gc.gctests.StringIntern.StringIntern
reason: User specified action: run main/othervm gc.gctests.StringIntern.StringIntern 
Mode: othervm [/othervm specified]
Timeout information:
--- Timeout information end.
elapsed time (seconds): 3804.369
----------configuration:(0/0)----------
----------System.out:(39/2841)----------
Timeout refired 480 times
Stress time: 120 seconds
Stress iterations factor: 1
Stress threads factor: 1
Stress runs factor: 1
Max memory: 2136997888
Sleep time: 500
Iterations: 0
Number of threads: 24
Run GC thread: false
Run mem diag thread: false
Run forever: false
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@39bb71dd,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@3116a6b3,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@53411cd5,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@14f15a56,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@70fbfbe2,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@74dce870,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@2fa0f58a,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@319d29c2,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@1e16f0f8,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@d1a0422,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@25bebcae,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@7d3552e2,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@540dfc8c,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@96bc93b,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@59e04f74,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@e95a4e8,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@35c7c02a,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@6232f5b8,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@39f29d65,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@1a63dc57,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@11ae99a1,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@64442119,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@49709141,5,MainThreadGroup]
Starting Thread[gc.gctests.StringIntern.StringIntern$StringGenerator@4d756d5c,5,MainThreadGroup]
For random generator using seed: 5747435418235436495
To re-run test with same seed value please add "-Djdk.test.lib.random.seed=5747435418235436495" to command line.
No unexpected exceptions/errors are thrown
----------System.err:(1/16)----------
STATUS:Passed.
----------rerun:(40/6324)*----------

<snip>

result: Error. Program `c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2075\windows-x64.jdk\jdk-17\bin\java' timed out (timeout set to 480000ms, elapsed time including timeout handling was 3372493ms).


Please note that this test passed after the timeout handler
started running.

The really strange part of this failure is that the timeout handler
kicked in at 480 seconds which makes sense since the default
timeout value is 120 seconds and the default timeoutFactor
value is 4 so the total timeout is 480 seconds/8 minutes.

However, the elapsed time for the test is 3804.369 seconds
which is 63 minutes. What the heck happened that caused
the test to take another 55 minutes after the 8 minute timeout.
Comments
The failure points to a problem with a specific machine and we haven't seen this issue in a long time. Closing this as CNR.
20-05-2024

There's a second jtreg issue that I think we need to resolve: https://github.com/openjdk/jtreg/pull/4 Tests can declare that they want to run "exclusively", but that only excludes other tests that also specify that they want to run exclusively. Other tests can still run at the same time. Fixing this will at least rule out any kind of starvation problems happening because of some other test running concurrently.
21-05-2021

Also not the time difference between the end of one timeout handler tool invocation and the start of the next: [2021-05-20 17:15:27] exit code: -2 time: 56011 ms [2021-05-20 17:18:21] [c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2059\windows-x64-debug.jdk\jdk-17\fastdebug\bin\jcmd.exe, 48672, Compiler.codelist] timeout=20000 It takes almost 3 minutes for something that later becomes instantaneous: [2021-05-20 09:29:24] exit code: 0 time: 272 ms [2021-05-20 09:29:24] [c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2075\windows-x64.jdk\jdk-17\bin\jcmd.exe, 35504, GC.class_histogram] timeout=20000 This is not a JVM specific issue. The entire machine goes down on its knees.
21-05-2021

We've seen a new hang on this machine with a G1 test. The timeout handler shows the same odd pattern that the driver JVM is blocked for a long time (note, not the test JVM) and then suddenly it becomes responsive again. See how it seems to start to accept jcmd requests after ~23 mins: [2021-05-20 17:06:27] [C:\cygwin\bin\bash.exe, -c, wmic process where ParentProcessId=48672 get ProcessId | tail -n+2] timeout=20000 ---------------------------------------- WARNING: tool timed out: killed process after 20000 ms ... stringtable ---------------------------------------- [2021-05-20 17:28:53] [c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2059\windows-x64-debug.jdk\jdk-17\fastdebug\bin\jcmd.exe, 48672, VM.stringtable] timeout=20000 ---------------------------------------- WARNING: tool timed out: killed process after 20000 ms ---------------------------------------- [2021-05-20 17:29:19] exit code: -2 time: 26305 ms ---------------------------------------- symboltable ---------------------------------------- [2021-05-20 17:29:43] [c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2059\windows-x64-debug.jdk\jdk-17\fastdebug\bin\jcmd.exe, 48672, VM.symboltable] timeout=20000 ---------------------------------------- 48672: SymbolTable statistics: Number of buckets : 32768 = 262144 bytes, each 8 Number of entries : 47151 = 754416 bytes, each 16 Number of literals : 47151 = 1879264 bytes, avg 39.000 Total footprint : = 2895824 bytes Average bucket size : 1.439 Variance of bucket size : 1.433 Std. dev. of bucket size: 1.197 Maximum bucket size : 8 ---------------------------------------- [2021-05-20 17:29:43] exit code: 0 time: 655 ms
21-05-2021

The other JVM running StringIntern test was also hanging during this time. From timeout handler of that process: [2021-05-20 08:41:24] [C:\cygwin\bin\bash.exe, -c, wmic process where ParentProcessId=43184 get ProcessId | tail -n+2] timeout=20000 <nothing here> [2021-05-20 09:29:23] [C:\WINDOWS\System32\Wbem\wmic.exe, process, where, processId=43184, list, full] timeout=20000 It's unclear if there's some kind of synchronization between the two timeout handlers that make only one of them execute and the other waiting, or if they actually are both hanging at the the same time.
20-05-2021

I somehow got assigned this bug by mistake...
20-05-2021

It is interesting that the timeout handler ran a number of jcmd against the G1 process and they timed out after 20s. One example: --- uptime ---------------------------------------- [2021-05-20 09:20:21] [c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2075\windows-x64.jdk\jdk-17\bin\jcmd.exe, 35504, VM.uptime] timeout=20000 ---------------------------------------- WARNING: tool timed out: killed process after 20000 ms ---------------------------------------- [2021-05-20 09:20:49] exit code: -2 time: 28477 ms --- but then suddenly they started to work at around this point: [2021-05-20 09:29:00] [c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2075\windows-x64.jdk\jdk-17\bin\jcmd.exe, 35504, VM.system_properties] timeout=20000 and just a few seconds after that the ~2880 second timeout handler completes. Why is this G1 process hanging?
20-05-2021

I looked at the process.html of vmTestbase/gc/gctests/StringInternSyncWithGC2/TestDescription.java. It says that it has been running different timeout handler tools from 8:42 until 9:29. But the weird thing is that the process that it queries is a G1 process. From jstack: Full thread dump Java HotSpot(TM) 64-Bit Server VM (17-ea+23-LTS-2075 mixed mode): ... "G1 Refine#2" os_prio=2 cpu=0.00ms elapsed=5932.36s tid=0x000002609e3e0e30 nid=0xcbc runnable and jcmd VM.system_properties: sun.java.command=com.sun.javatest.regtest.agent.AgentServer -id 4 -logfile T\:\\testoutput\\test-support\\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\\jtData\\agentServer.4.trace -allowSetSecurityManager -port 50437 -timeoutFactor 4.0 And windows | native | info indicates that it is a java.exe process and not javac.exe: CommandLine=c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2075\windows-x64.jdk\jdk-17\bin\java -classpath C:\ade\mesos\work_dir\jib-master\install\jtreg\5.1\b01\bundles\jtreg_bin-5.1.zip\jtreg\lib\javatest.jar;C:\ade\mesos\work_dir\jib-master\install\jtreg\5.1\b01\bundles\jtreg_bin-5.1.zip\jtreg\lib\jtreg.jar -XX:MaxRAMPercentage=4.16667 -Djava.io.tmpdir=t:\testoutput\test-support\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\tmp --patch-module java.base=T:\testoutput\test-support\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\patches\java.base -Djava.security.policy=file:/t:/testoutput/test-support/jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc/jtreg.policy com.sun.javatest.regtest.agent.AgentServer -id 4 -logfile T:\testoutput\test-support\jtreg_open_test_hotspot_jtreg_vmTestbase_vm_gc_misc\jtData\agentServer.4.trace -allowSetSecurityManager -port 50437 -timeoutFactor 4.0 So, is this hang really a ZGC hang, or is the process running G1 the one hanging?
20-05-2021

But something else is weird. Look at the output from the first timeout: [2021-05-20T09:29:24,320Z] TEST RESULT: Error. Program `c:\ade\mesos\work_dir\jib-master\install\jdk-17+23-2075\windows-x64.jdk\jdk-17\bin\java' timed out (timeout set to 480000ms, elapsed time including timeout handling was 3372493ms). [2021-05-20T09:29:24,320Z] -------------------------------------------------- and check the timestamp of the subsequent javac timestamps: [2021-05-20T09:29:30,293Z] TEST: vmTestbase/gc/gctests/StringInternSyncWithGC2/TestDescription.java The difference is 6 seconds, but the timeout was 480 seconds. Either the javac ran at the same time, or something is weird with the timestamps.
20-05-2021

So, vmTestbase/gc/gctests/StringIntern/StringIntern.java times out and runs for a long time, and directly after that the next javac compilation for vmTestbase/gc/gctests/StringInternSyncWithGC2/TestDescription.java hangs for 8min. That indicates to me that the same thing that caused the first to take a long time also affected the subsequent javac invocation.
20-05-2021

Remembering a little bit more about the durations. I think this one is correct: (timeout set to 480000ms, elapsed time including timeout handling was 3372493ms). but this one is wrong: TIME: 3804.369 seconds So, yes, we why did we spend 3372s here?
20-05-2021

I think you are dismissing the info I gave you too quickly. Look at the succeeding tests as well: vmTestbase/gc/gctests/StringInternSyncWithGC/StringInternSyncWithGC.java 1h 07m 15s vmTestbase/gc/gctests/StringInternSync2/TestDescription.java 1h 07m 15s vmTestbase/gc/gctests/StringInternSync/StringInternSync.java 1h 05m 16s vmTestbase/gc/gctests/StringInternGC/StringInternGC.java 1h 05m 15s vmTestbase/gc/gctests/StringIntern/StringIntern.java 1h 03m 24s You can see the distinct pattern of increments of 120s. But jtreg/mdash still reports > 1h of work. This doesn't show that we didn't timeout, but it shows that you can't trust the >1h durations. It's interesting that javac failed. That indicates that it is not the tests that are causing the timeouts, but probably how loaded the machine is. It's very unlikely that we both get a hang in the test and in a task that compiles the test, when we don't see any other hangs in other tests / OSes.
20-05-2021

[~stefank] - That explanation doesn't cover the javac timeout in StringInternSyncWithGC2. And for the StringIntern, maybe that's an issue, but the test task's concurrency is: -concurrency:6 -timeoutFactor:4 so I don't think that's what's happening here either. Also there are these timestamps in the StringIntern log file: start=Thu May 20 08\:25\:59 GMT 2021 end=Thu May 20 09\:29\:24 GMT 2021 so I believe that test also ran for that amount of time. The MDash page is also reporting that test ran for 1h 03m 24s.
20-05-2021

We have been tracking other strange test timeouts with ZGC on windows machines: JDK-8263909 misc tests timed out on a particular test machine JDK-8263618 misc tests timeout on a particular test machine with ZGC JDK-8260427 ZGC: misc vmTestbase/gc/gctests timed out The last one (JDK-8260427) was closed as not reproducible because the failure mode for which the bug was originally filed were no longer reproducible. However, the failures linked to JDK-8260427 since April 26th were caused by this bug: JDK-8266432 ZGC: GC allocation stalls can trigger deadlocks
20-05-2021

[~dcubed] I think the reported time is incorrect. I've created a jtreg fix for it, but haven't had time to follow-through with that patch. See: https://github.com/openjdk/jtreg/pull/3
20-05-2021