JDK-8248912 : ZGC: serviceability/sa/TestJhsdbJstackMixed.java timed out
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc-agent
  • Affected Version: 15,17
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: x86_64
  • Submitted: 2020-07-07
  • Updated: 2023-11-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8271507 :  
Description
The following test failed due to timeout in the JDK15 CI:

serviceability/sa/TestJhsdbJstackMixed.java

Here's a snippet from the log file:

#section:main
----------messages:(6/222)----------
command: main TestJhsdbJstackMixed
reason: User specified action: run main/othervm TestJhsdbJstackMixed 
Mode: othervm [/othervm specified]
Timeout information:
--- Timeout information end.
elapsed time (seconds): 542.313
----------configuration:(0/0)----------
----------System.out:(2823/83351)----------
For random generator using seed: 6987791652654824436
To re-run test with same seed value please add "-Djdk.test.lib.random.seed=6987791652654824436" to command line.
Command line: ['/opt/mach5/mesos/work_dir/jib-master/install/jdk-15+31-1493/linux-x64-debug.jdk/jdk-15/fastdebug/bin/java' '-XX:MaxRAMPercentage=6' '-XX:+CreateCoredumpOnCrash' '-XX:+UseParallelGC' '-XX:+UseNUMA' '-Djava.library.path=/opt/mach5/mesos/work_dir/jib-master/install/jdk-15+31-1493/linux-x64-debug.test/hotspot/jtreg/native' '-cp' '/opt/mach5/mesos/work_dir/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S997/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/a09de496-f095-49a3-aeb2-7e67191ae33a/runs/01764858-c14e-4d09-aff2-a2de5f71b4c0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_serviceability/classes/1/serviceability/sa/TestJhsdbJstackMixed.d:/opt/mach5/mesos/work_dir/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S997/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/a09de496-f095-49a3-aeb2-7e67191ae33a/runs/01764858-c14e-4d09-aff2-a2de5f71b4c0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_serviceability/classes/1/test/lib' 'LingeredAppWithNativeMethod' '2809b1dd-5940-454a-8e88-08a0963d1a5d.lck']
Started LingeredApp with pid 21901
[2020-07-07T04:05:27.596129414Z] Gathering output for process 21948
Timeout refired 480 times
Attaching to process ID 21901, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 15-ea+31-1493
Deadlock Detection:

No deadlocks found.

----------------- 21943 -----------------
"NoFramePointerJNIFib" #13 prio=5 tid=0x00007fb9482a9d00 nid=0x55b7 runnable [0x00007fb90f7f0000]
   java.lang.Thread.State: RUNNABLE
   JavaThread state: _thread_in_native
0x00007fb94c03b6b0	fib + 0x30
----------------- 21935 -----------------
"Common-Cleaner" #12 daemon prio=8 tid=0x00007fb94821b910 nid=0x55af in Object.wait() [0x00007fb90fbf9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
   JavaThread state: _thread_blocked
0x00007fb951730de2	__pthread_cond_timedwait + 0x132
----------------- 21934 -----------------
0x00007fb9508241ae	MutableSpaceUsedHelper::take_sample() + 0xe
0x00007fb9506e12df	PerfLongVariant::sample() + 0x1f
0x00007fb9482182d0		????????
0x00007fb9482182d0		????????

<snip>

0x00007fb9482182d0		????????
0x00007fb9482182d0		????????
0x00007fb9482182d0		????????

DEBUG: [0x00007fb94c03b6b0	fib + 0x30]
DEBUG: Test triggered interesting condition.
DEBUG: Test PASSED.
Test Completed
 LingeredApp stdout: [For random generator using seed: -1655933452054152999
To re-run test with same seed value please add "-Djdk.test.lib.random.seed=-1655933452054152999" to command line.
fib(40) = 102334155
fib(46) = 1836311903
fib(46) = ];
 LingeredApp stderr: []
 LingeredApp exitValue = 0
----------System.err:(2/16)----------

STATUS:Passed.
----------rerun:(35/6088)*----------

Please note that the test did pass while the JTREG timeout
handler was running. It could be that the test just needs a
slightly longer default timeout value. The default timeout is
120 seconds/2 minutes and the timeoutFactor is 4 for a
total timeout value of 480 seconds/8 minutes. This test
executed in

elapsed time (seconds): 542.313
Comments
There was a serviceability/sa/ClhsdbPstack.java#id1 timeout where the clhsdb main thread seemed to be stuck walking the stack. Each of the 5 samples has a different stack trace, but they all show CLHSDB.run() as being the bottom most active frame (its line number is different for each sample). Here's one sample: "main" #1 prio=5 os_prio=0 cpu=4774213.57ms elapsed=4793.81s allocated=295G defined_classes=505 tid=0x00007fa83002c3a0 nid=2440476 runnable [0x00007fa835ce2000] java.lang.Thread.State: RUNNABLE Thread: 0x00007fa83002c3a0 [0x253d1c] State: _at_safepoint _at_poll_safepoint 0 JavaThread state: _thread_blocked at sun.jvm.hotspot.types.basic.BasicTypeDataBase.findDynamicTypeForAddress(jdk.hotspot.agent@18-ea/BasicTypeDataBase.java:272) at sun.jvm.hotspot.runtime.VirtualBaseConstructor.instantiateWrapperFor(jdk.hotspot.agent@18-ea/VirtualBaseConstructor.java:104) at sun.jvm.hotspot.utilities.GrowableArray.at(jdk.hotspot.agent@18-ea/GrowableArray.java:63) at sun.jvm.hotspot.code.CodeCache.contains(jdk.hotspot.agent@18-ea/CodeCache.java:78) at sun.jvm.hotspot.tools.PStack.run(jdk.hotspot.agent@18-ea/PStack.java:139) at sun.jvm.hotspot.CommandProcessor$30.doit(jdk.hotspot.agent@18-ea/CommandProcessor.java:1123) at sun.jvm.hotspot.CommandProcessor.executeCommand(jdk.hotspot.agent@18-ea/CommandProcessor.java:2089) at sun.jvm.hotspot.CommandProcessor.executeCommand(jdk.hotspot.agent@18-ea/CommandProcessor.java:2059) at sun.jvm.hotspot.CommandProcessor.run(jdk.hotspot.agent@18-ea/CommandProcessor.java:1930) at sun.jvm.hotspot.CLHSDB.run(jdk.hotspot.agent@18-ea/CLHSDB.java:112) at sun.jvm.hotspot.CLHSDB.main(jdk.hotspot.agent@18-ea/CLHSDB.java:44) at sun.jvm.hotspot.SALauncher.runCLHSDB(jdk.hotspot.agent@18-ea/SALauncher.java:281) at sun.jvm.hotspot.SALauncher$$Lambda$4/0x0000000801001800.accept(jdk.hotspot.agent@18-ea/Unknown Source) at sun.jvm.hotspot.SALauncher.main(jdk.hotspot.agent@18-ea/SALauncher.java:500)
30-07-2021

It looks like we need to problem list the following 3 tests on ZGC since it is reproducing quite often: serviceability/sa/TestJhsdbJstackMixed.java serviceability/sa/ClhsdbPstack.java#id1 serviceability/sa/ClhsdbPstack.java#id0
29-07-2021

This is not a simple test timeout issue due to the test not having been given enough time. If you look at the log it contains a whole bunch of the following lines in the stack trace. 0x00007fb9482182d0 ???????? The log starts with this and ends with this. The test is suppose to produce 20 stack traces (and of course each stack trace has multiple threads), yet all we are seeing is the first thread: ----------------- 21943 ----------------- "NoFramePointerJNIFib" #13 prio=5 tid=0x00007fb9482a9d00 nid=0x55b7 runnable [0x00007fb90f7f0000] java.lang.Thread.State: RUNNABLE JavaThread state: _thread_in_native 0x00007fb94c03b6b0 fib + 0x30 ----------------- 21935 ----------------- "Common-Cleaner" #12 daemon prio=8 tid=0x00007fb94821b910 nid=0x55af in Object.wait() [0x00007fb90fbf9000] java.lang.Thread.State: TIMED_WAITING (on object monitor) JavaThread state: _thread_blocked 0x00007fb951730de2 __pthread_cond_timedwait + 0x132 ----------------- 21934 ----------------- 0x00007fb9508241ae MutableSpaceUsedHelper::take_sample() + 0xe 0x00007fb9506e12df PerfLongVariant::sample() + 0x1f 0x00007fb9482182d0 ???????? 0x00007fb9482182d0 ???????? 0x00007fb9482182d0 ???????? <1000's of lines of the same> 0x00007fb9482182d0 ???????? 0x00007fb9482182d0 ???????? 0x00007fb9482182d0 ???????? DEBUG: [0x00007fb94c03b6b0 fib + 0x30] DEBUG: Test triggered interesting condition. DEBUG: Test PASSED. So SA seems to have gotten stuck in an infinite loop dumping thread 21934, which for some reason doesn't even have it's thread info dumped along with it. I do also see this thread dumped in passing test runs, but it only appears as follows: ----------------- 3158 ----------------- 0x00007f954ae22de2 __pthread_cond_timedwait + 0x132 Based on the MutableSpaceUsedHelper::take_sample() frame it looks like this is the thread used for doing PerfData samples, so I assume it's not a JavaThread and therefore no thread name or state. I think the test ran jstack while the PerfData sampler thread was taking a sample, and for some reason this led to an infinite loop dumping the stack walk. The jstack process eventually times out, and the harness provides a stack dump for it. Here's what it's busy doing: "main" #1 prio=5 os_prio=0 cpu=363617.91ms elapsed=490.41s tid=0x00007f6acc032c50 nid=0x55bd runnable [0x00007f6ad47ed000] java.lang.Thread.State: RUNNABLE Thread: 0x00007f6acc032c50 [0x55bd] State: _at_safepoint _at_poll_safepoint 1 JavaThread state: _thread_blocked at sun.jvm.hotspot.types.basic.BasicTypeDataBase.findDynamicTypeForAddress(jdk.hotspot.agent@15-ea/BasicTypeDataBase.java:328) at sun.jvm.hotspot.runtime.VirtualBaseConstructor.instantiateWrapperFor(jdk.hotspot.agent@15-ea/VirtualBaseConstructor.java:104) at sun.jvm.hotspot.utilities.GrowableArray.at(jdk.hotspot.agent@15-ea/GrowableArray.java:63) at sun.jvm.hotspot.code.CodeCache.contains(jdk.hotspot.agent@15-ea/CodeCache.java:77) at sun.jvm.hotspot.tools.PStack.run(jdk.hotspot.agent@15-ea/PStack.java:130) at sun.jvm.hotspot.tools.PStack.run(jdk.hotspot.agent@15-ea/PStack.java:58) at sun.jvm.hotspot.tools.PStack.run(jdk.hotspot.agent@15-ea/PStack.java:53) at sun.jvm.hotspot.tools.JStack.run(jdk.hotspot.agent@15-ea/JStack.java:67) at sun.jvm.hotspot.tools.Tool.startInternal(jdk.hotspot.agent@15-ea/Tool.java:262) at sun.jvm.hotspot.tools.Tool.start(jdk.hotspot.agent@15-ea/Tool.java:225) at sun.jvm.hotspot.tools.Tool.execute(jdk.hotspot.agent@15-ea/Tool.java:118) at sun.jvm.hotspot.tools.JStack.runWithArgs(jdk.hotspot.agent@15-ea/JStack.java:90) at sun.jvm.hotspot.SALauncher.runJSTACK(jdk.hotspot.agent@15-ea/SALauncher.java:300) at sun.jvm.hotspot.SALauncher$$Lambda$3/0x0000000800bed4c0.accept(jdk.hotspot.agent@15-ea/Unknown Source) at sun.jvm.hotspot.SALauncher.main(jdk.hotspot.agent@15-ea/SALauncher.java:483) This looks a lot like JDK-8231635, which was fixed in 14. However, it was for regular jstack java frame walking code. In this case jstack is being used with --mixed, which does mixed native and java frame walking, and triggers different code to do so. I think that code has the same issue. PStack uses CFrame.sender(), which has different subclass variants depending on the platform. We could fix this in a way similar to JDK-8231635. Basically make sure sender() is not returning the same frame, or a frame with a higher address. But it's unclear to me why sender() would ever fail to return a valid frame. For JDK-8231635, the issue was determining the top java frame, which is done by retrying a whole bunch of stack pointers until one eventually can be used to safely walk the stack to the entry frame. So bad stack pointers and the potential of a frame pointing to itself is possible and needs to be defended against. However, in the case of the pstack mixed stack walking code, we assume the stack register points to a valid native frame. All pstack does is call LinuxCDebugger.topFrameForThread(), which does: AMD64ThreadContext context = (AMD64ThreadContext) thread.getContext(); Address pc = context.getRegisterAsAddress(AMD64ThreadContext.RIP); if (pc == null) return null; return LinuxAMD64CFrame.getTopFrame(dbg, pc, context); So there is no "trying to find" the top frame. It just grabs it based on the thread's context and assumes it's valid. PStack then walks the frames in a loop with no checks: CFrame f = cdbg.topFrameForThread(th); while (f != null) { <print frame> f = f.sender(th); } Now if this code had failed on the first frame, I would just assume it tried grabbing the frame while the thread was in some inconsistent state. But it appears the first two frames are valid because you see: 0x00007fb9508241ae MutableSpaceUsedHelper::take_sample() + 0xe 0x00007fb9506e12df PerfLongVariant::sample() + 0x1f 0x00007fb9482182d0 ???????? So problems don't turn up until the 3rd frame is reached. So either the stack walking code is just plain wrong, or the first frame was not actually valid, but just looks valid. Maybe these two frames are remnants from a no longer active stack. Possibly we need to do some sort of sanity check in LinuxCDebugger.topFrameForThread(), although I'm not to sure what to do. I guess we could just walk the stack like pstack does and makes sure it doesn't get stuck in an infinite loop, or possibly it should be pstacks job to do that.
04-08-2020

One thing I forgot to explain is why the test times out, but also prints out that it PASSED. I think the test harness timeout handler is getting triggered and starts killing processes. It kills the jstack process first, but before it has a chance to kill the main test process, it has woken up in response to the termination of the jstack process, and manages to look at the jstack output and print PASSED before the harness has a chance to kill it. This is just a guess though. I'm not well versed in the internal workings of the test harness and how timeouts are handled. Note the test does not check the result of Process.waitFor(): jhsdb.waitFor(); If it did, it probably would see that is is not 0 because the harness killed off the process, and at that point the test should be throwing an exception, not passing.
31-07-2020