JDK-8247762 : [aarch64] Timeout in .../HeapDumpTestWithActiveProcess.java due to inf. loop in AARCH64CurrentFrameGuess.run()
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc-agent
  • Affected Version: 15,16
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: aarch64
  • Submitted: 2020-06-17
  • Updated: 2020-07-13
  • Resolved: 2020-07-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b05Fixed
Related Reports
Relates :  
Relates :  
Description
Multiple (2) core files produced.

----------System.out:(27/1276)----------
Starting Jshell
[JShell]:/scratch/opt/mach5/mesos/work_dir/jib-master/install/2020-06-16-0942336.patric.hedlin.jdk/linux-aarch64-debug.jdk/jdk-16/fastdebug/bin/jshell
[JShell] |  Welcome to JShell -- Version 16-internal
[JShell] |  For an introduction type: /help intro
Jshell Started in 5066ms
Starting jmap against 29249
[2020-06-17T09:50:43.610755275Z] Gathering output for process 29309
Timeout refired 960 times
[JShell] |  State engine terminated.
[2020-06-17T10:07:50.249749Z] Waiting for completion for process 29309
[2020-06-17T10:07:50.250105500Z] Waiting for completion finished for process 29309
Output and diagnostic info for process 29309 was saved into 'pid-29309-output.log'
jhsdb jmap stdout:
Attaching to process ID 29249, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 16-internal+0-2020-06-16-0942336.patric.hedlin.jdk

jhsdb jmap stderr:

###### End of all output:
[2020-06-17T10:07:50.259806750Z] Waiting for completion for process 29309
[2020-06-17T10:07:50.260026600Z] Waiting for completion finished for process 29309
[2020-06-17T10:07:50.260282800Z] Waiting for completion for process 29309
[2020-06-17T10:07:50.260470150Z] Waiting for completion finished for process 29309
Destroying jshell
Jshell destroyed
----------System.err:(31/1594)----------
 stdout: [Attaching to process ID 29249, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 16-internal+0-2020-06-16-0942336.patric.hedlin.jdk
];
 stderr: []
 exitValue = 134

java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: Expected to get exit value of [0]

	at JShellHeapDumpTest.launch(JShellHeapDumpTest.java:82)
	at JShellHeapDumpTest.launch(JShellHeapDumpTest.java:97)
	at JShellHeapDumpTest.testHeapDump(JShellHeapDumpTest.java:121)
	at JShellHeapDumpTest.main(JShellHeapDumpTest.java:169)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.RuntimeException: Expected to get exit value of [0]

	at jdk.test.lib.process.OutputAnalyzer.shouldHaveExitValue(OutputAnalyzer.java:457)
	at JShellHeapDumpTest.launch(JShellHeapDumpTest.java:80)
	... 9 more
open/test/jdk/sun/tools/jhsdb/HeapDumpTestWithActiveProcess.java

JavaTest Message: Test threw exception: java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: Expected to get exit value of [0]

JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: Expected to get exit value of [0]
Comments
URL: https://hg.openjdk.java.net/jdk/jdk/rev/26fe38b4025c User: phedlin Date: 2020-07-08 10:16:42 +0000
08-07-2020

AARCH64Frame.sender() can make the check.
03-07-2020

Certainly, avoiding an infinite loop in JavaVFrame.sender() is just damage control. However, I suspect the intention is a platform agnostic traversal of the stack and that sort of prohibits pushing platform knowledge (such as stack order) into this context. We can certainly leave this for another bug report but it should be addressed one way or another.
03-07-2020

JavaVFrame.sender() has a potentially infinite loop, currently unable to handle anything but the expected scenario. A simple (max) counter should make this more robust.
03-07-2020

That will help prevent the timeout, but not all the other failures described in JDK-8247641. The root cause of all these problems is a not understood bug that causes stack walking of compiled frames to walk into a bad frame. Once garbage is accessed, there are many potential ways to fail, including getting stuck in an infinite loop. HeapDumpTestWithActiveProcess.java doesn't currently detect the failure modes described in JDK-8247641, and naively passes because it does not look for the exception in the output, but in the case of the infinite loop you are seeing it fails because of the timeout. Rather than a max counter, it could just check that AARCH64Frame.sender() always progresses in the proper direction, and throw an exception when it does not. However, this is really just hiding the root bug and not fixing it. The situation here is different than the similar fix you applied in AARCH64CurrentFrameGuess.run(). In that case we aren't sure if we are looking at a valid frame, so have to error check. In this case the frame is suppose to be valid and no error checking needed. However, I'm now wondering if possibly the frame is in the process of being pushed, and is not yet fully initialized, making walking the frame potentially dangerous. We would need a way to check for a (current) frame in this stack, similar to what I'm currently doing when an interpreter frame is being pushed. See JDK-8244383. I'm not yet sure how to do that.
29-06-2020

I think the failure immediately above is probably related to JDK-8247641, which documents numerous failures as ThreadStackTrace.dumpStack() calls VFrame.javaSender(). In those cases eventually a bad address is accessed, leading to an exception. In this case I think the bad address is leading to getting a VFrame that we've already seen, and thus the stack linkage is recursive. I still think it would be best to have a separate bug though since from the point of view of a user looking at the failure, a timeout is very different than an exception.
26-06-2020

The stack trace for the above failure is different: "main" #1 prio=5 os_prio=0 cpu=997554.48ms elapsed=1010.46s tid=0x0000fffe94031eb0 nid=0x62da runnable [0x0000fffe989ad000] java.lang.Thread.State: RUNNABLE Thread: 0x0000fffe94031eb0 [0x62da] State: _at_safepoint _at_poll_safepoint 1 JavaThread state: _thread_blocked at sun.jvm.hotspot.debugger.DebuggerUtilities.byteSwap(jdk.hotspot.agent@16-internal/DebuggerUtilities.java:280) at sun.jvm.hotspot.debugger.DebuggerUtilities.dataToCInteger(jdk.hotspot.agent@16-internal/DebuggerUtilities.java:157) at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.readCInteger(jdk.hotspot.agent@16-internal/LinuxDebuggerLocal.java:568) at sun.jvm.hotspot.debugger.DebuggerBase.readAddressValue(jdk.hotspot.agent@16-internal/DebuggerBase.java:462) at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.readAddress(jdk.hotspot.agent@16-internal/LinuxDebuggerLocal.java:502) at sun.jvm.hotspot.debugger.linux.LinuxAddress.getAddressAt(jdk.hotspot.agent@16-internal/LinuxAddress.java:74) at sun.jvm.hotspot.runtime.aarch64.AARCH64Frame.getSenderPC(jdk.hotspot.agent@16-internal/AARCH64Frame.java:448) at sun.jvm.hotspot.runtime.aarch64.AARCH64Frame.sender(jdk.hotspot.agent@16-internal/AARCH64Frame.java:294) at sun.jvm.hotspot.runtime.Frame.sender(jdk.hotspot.agent@16-internal/Frame.java:207) at sun.jvm.hotspot.runtime.Frame.realSender(jdk.hotspot.agent@16-internal/Frame.java:212) at sun.jvm.hotspot.runtime.VFrame.sender(jdk.hotspot.agent@16-internal/VFrame.java:119) at sun.jvm.hotspot.runtime.VFrame.javaSender(jdk.hotspot.agent@16-internal/VFrame.java:148) at sun.jvm.hotspot.runtime.ThreadStackTrace.dumpStack(jdk.hotspot.agent@16-internal/ThreadStackTrace.java:54) at sun.jvm.hotspot.utilities.HeapHprofBinWriter.dumpStackTraces(jdk.hotspot.agent@16-internal/HeapHprofBinWriter.java:718) at sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(jdk.hotspot.agent@16-internal/HeapHprofBinWriter.java:439) - locked <0x00000000e0d29598> (a sun.jvm.hotspot.utilities.HeapHprofBinWriter) at sun.jvm.hotspot.tools.JMap.writeHeapHprofBin(jdk.hotspot.agent@16-internal/JMap.java:182) at sun.jvm.hotspot.tools.JMap.run(jdk.hotspot.agent@16-internal/JMap.java:97) at sun.jvm.hotspot.tools.Tool.startInternal(jdk.hotspot.agent@16-internal/Tool.java:262) at sun.jvm.hotspot.tools.Tool.start(jdk.hotspot.agent@16-internal/Tool.java:225) at sun.jvm.hotspot.tools.Tool.execute(jdk.hotspot.agent@16-internal/Tool.java:118) at sun.jvm.hotspot.tools.JMap.main(jdk.hotspot.agent@16-internal/JMap.java:176) at sun.jvm.hotspot.SALauncher.runJMAP(jdk.hotspot.agent@16-internal/SALauncher.java:331) at sun.jvm.hotspot.SALauncher$$Lambda$4/0x0000000100066040.accept(jdk.hotspot.agent@16-internal/Unknown Source) at sun.jvm.hotspot.SALauncher.main(jdk.hotspot.agent@16-internal/SALauncher.java:483) It appears to be stuck in this loop ThreadStackTrace.dumpStack(): while (f != null) { if (f.isJavaFrame()) { return (JavaVFrame) f; } f = f.sender(imprecise); <--- here } So there might be a bug in the AARCH64 VFrame walking that is similar to the one mentioned above in AARCH64CurrentFrameGuess.run() where while stack walking we end up visiting a frame we've already been at, and therefore we are in an infinite loop. I suggest you file a different bug for that and still push the fix for AARCH64CurrentFrameGuess.run().
26-06-2020

The suggested patch does not seem to solve the issue, which can still be reproduced: --- ProcessLog --- cmd: /home/phedlin/hg/jdk/build/linux-aarch64/images/jdk/bin/jhsdb -J-Xmx512m -J-XX:MaxRAMPercentage=3 -J-XX:-UseNotificationThread -J-ea -J-esa jmap --binaryheap --dumpfile=/home/phedlin/hg/jdk/build/linux-aarch64/test-support/jtreg_open_test_jdk_sun_tools_jhsdb_HeapDumpTestWithActiveProcess_java/scratch/0/jhsdb.jmap.heap.1593136082286.hprof --pid=1329898 exitvalue: 134 stderr: stdout: Attaching to process ID 1329898, please wait... Debugger attached successfully. Server compiler detected. JVM version is 16-internal+0-2020-06-23-0831394.phedlin...
26-06-2020

Problem is manifested in approximately two percent (~2%) of the runs over longer series (100).
24-06-2020

Although there are core files, I don't think there was a crash since there are no hs_err files. It looks like these are forced cores due to the timeout. I found one process with the following stack trace: "main" #1 prio=5 os_prio=0 cpu=993971.52ms elapsed=1006.44s tid=0x0000fffbcc032100 nid=0x727f runnable [0x0000fffbd07fd000] java.lang.Thread.State: RUNNABLE Thread: 0x0000fffbcc032100 [0x727f] State: _at_safepoint _at_poll_safepoint 1 JavaThread state: _thread_blocked at sun.jvm.hotspot.types.basic.BasicTypeDataBase.findDynamicTypeForAddress(jdk.hotspot.agent@16-internal/BasicTypeDataBase.java:330) at sun.jvm.hotspot.runtime.VirtualBaseConstructor.instantiateWrapperFor(jdk.hotspot.agent@16-internal/VirtualBaseConstructor.java:104) at sun.jvm.hotspot.utilities.GrowableArray.at(jdk.hotspot.agent@16-internal/GrowableArray.java:63) at sun.jvm.hotspot.code.CodeCache.findBlobUnsafe(jdk.hotspot.agent@16-internal/CodeCache.java:104) at sun.jvm.hotspot.code.CodeCache.findBlob(jdk.hotspot.agent@16-internal/CodeCache.java:87) at sun.jvm.hotspot.runtime.Frame.cb(jdk.hotspot.agent@16-internal/Frame.java:119) at sun.jvm.hotspot.runtime.aarch64.AARCH64Frame.adjustUnextendedSP(jdk.hotspot.agent@16-internal/AARCH64Frame.java:332) at sun.jvm.hotspot.runtime.aarch64.AARCH64Frame.<init>(jdk.hotspot.agent@16-internal/AARCH64Frame.java:126) at sun.jvm.hotspot.runtime.aarch64.AARCH64Frame.sender(jdk.hotspot.agent@16-internal/AARCH64Frame.java:294) at sun.jvm.hotspot.runtime.Frame.sender(jdk.hotspot.agent@16-internal/Frame.java:207) at sun.jvm.hotspot.runtime.aarch64.AARCH64CurrentFrameGuess.run(jdk.hotspot.agent@16-internal/AARCH64CurrentFrameGuess.java:141) at sun.jvm.hotspot.runtime.linux_aarch64.LinuxAARCH64JavaThreadPDAccess.getCurrentFrameGuess(jdk.hotspot.agent@16-internal/LinuxAARCH64JavaThreadPDAccess.java:97) at sun.jvm.hotspot.runtime.JavaThread.getCurrentFrameGuess(jdk.hotspot.agent@16-internal/JavaThread.java:267) at sun.jvm.hotspot.runtime.JavaThread.getLastJavaVFrameDbg(jdk.hotspot.agent@16-internal/JavaThread.java:229) at sun.jvm.hotspot.runtime.ThreadStackTrace.dumpStack(jdk.hotspot.agent@16-internal/ThreadStackTrace.java:54) at sun.jvm.hotspot.utilities.HeapHprofBinWriter.dumpStackTraces(jdk.hotspot.agent@16-internal/HeapHprofBinWriter.java:718) at sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(jdk.hotspot.agent@16-internal/HeapHprofBinWriter.java:439) - locked <0x00000000e0169fe8> (a sun.jvm.hotspot.utilities.HeapHprofBinWriter) at sun.jvm.hotspot.tools.JMap.writeHeapHprofBin(jdk.hotspot.agent@16-internal/JMap.java:182) at sun.jvm.hotspot.tools.JMap.run(jdk.hotspot.agent@16-internal/JMap.java:97) at sun.jvm.hotspot.tools.Tool.startInternal(jdk.hotspot.agent@16-internal/Tool.java:262) at sun.jvm.hotspot.tools.Tool.start(jdk.hotspot.agent@16-internal/Tool.java:225) at sun.jvm.hotspot.tools.Tool.execute(jdk.hotspot.agent@16-internal/Tool.java:118) at sun.jvm.hotspot.tools.JMap.main(jdk.hotspot.agent@16-internal/JMap.java:176) at sun.jvm.hotspot.SALauncher.runJMAP(jdk.hotspot.agent@16-internal/SALauncher.java:331) at sun.jvm.hotspot.SALauncher$$Lambda$4/0x0000000100066040.accept(jdk.hotspot.agent@16-internal/Unknown Source) at sun.jvm.hotspot.SALauncher.main(jdk.hotspot.agent@16-internal/SALauncher.java:483) The key frame here is AARCH64CurrentFrameGuess.run(). This is likely the same bug as JDK-8231635, which was fixed in AMD64CurrentFrameGuess.run(). The bug causes the code to get into an infinite loop as it checks each stack address to see if it represents a valid frame. Eventually you get to a stack address that is not a valid frame, but looks a bit like one, and while walking that stack eventually frame.sender(map) returns a frame you already looked at, so you end up in an infinite loop. The fix is quite simple. You just need to make sure when calling frame.sender(map), you don't get back the same frame or one that is in the wrong direction (a lower address than the current frame rather than a higher address). This is the AMD64 version of the fix: + Frame oldFrame = frame; frame = frame.sender(map); + if (frame.getSP().lessThanOrEqual(oldFrame.getSP())) { + // Frame points to itself or to a location in the wrong direction. + // Break the loop and move on to next offset. + if (DEBUG) { + System.out.println("AMD64CurrentFrameGuess.run: frame <= oldFrame: " + frame); + } + break; + }
17-06-2020