JDK-8249150 : SA core file tests sometimes time out on OSX with "java.io.IOException: App waiting timeout"
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc
  • Affected Version: 16
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: os_x
  • CPU: x86_64
  • Submitted: 2020-07-09
  • Updated: 2021-05-19
  • Resolved: 2020-08-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b09Fixed
Related Reports
Relates :  
Relates :  
Description
The following test failed in the JDK16 CI:

serviceability/sa/ClhsdbFindPC.java#id3
serviceability/sa/ClhsdbFindPC.java#id1

Both tests report a SIGSEGV in Java_jdk_test_lib_apps_LingeredApp_crash
but I think that crash is intentionally caused by the test.

Here's a snippet from the ClhsdbFindPC.java#id3 log file:

#section:main
----------messages:(4/189)----------
command: main ClhsdbFindPC false true
reason: User specified action: run main/othervm/timeout=480 ClhsdbFindPC false true 
Mode: othervm [/othervm specified]
elapsed time (seconds): 445.85
----------configuration:(0/0)----------
----------System.out:(9/1828)----------
Starting the ClhsdbFindPC test
Command line: ['/scratch/mesos/jib-master/install/jdk-16+6-168/macosx-x64-debug.jdk/jdk-16/fastdebug/bin/java' '-XX:MaxRAMPercentage=12' '-Djava.io.tmpdir=/scratch/mesos/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S992/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1ca523ef-9cef-4c75-97f6-121eaacfa189/runs/130452c2-8592-4adc-9789-a0fa066e52a0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier1_serviceability/tmp' '-Xint' '-XX:+CreateCoredumpOnCrash' '-Djava.library.path=/scratch/mesos/jib-master/install/jdk-16+6-168/macosx-x64-debug.test/hotspot/jtreg/native' '-cp' '/scratch/mesos/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S992/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1ca523ef-9cef-4c75-97f6-121eaacfa189/runs/130452c2-8592-4adc-9789-a0fa066e52a0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier1_serviceability/classes/0/serviceability/sa/ClhsdbFindPC_id3.d:/scratch/mesos/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S992/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1ca523ef-9cef-4c75-97f6-121eaacfa189/runs/130452c2-8592-4adc-9789-a0fa066e52a0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier1_serviceability/classes/0/test/lib' 'LingeredAppWithTrivialMain' '3bf7de6b-8edf-411d-8bf7-7014944a9bd0.lck' 'forceCrash']
[2020-07-09T08:13:08.986673Z] Gathering output for process 13944
[2020-07-09T08:13:09.053042Z] Waiting for completion for process 13944
[2020-07-09T08:13:09.053225Z] Waiting for completion finished for process 13944
Output and diagnostic info for process 13944 was saved into 'pid-13944-output.log'
[2020-07-09T08:13:09.247708Z] Waiting for completion for process 13944
[2020-07-09T08:13:09.247843Z] Waiting for completion finished for process 13944
Run test with ulimit -c: unlimited
----------System.err:(21/1339)----------
java.lang.RuntimeException: Test ERROR java.io.IOException: App waiting timeout
	at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:155)
	at ClhsdbFindPC.main(ClhsdbFindPC.java:167)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.io.IOException: App waiting timeout
	at jdk.test.lib.apps.LingeredApp.waitAppReady(LingeredApp.java:264)
	at jdk.test.lib.apps.LingeredApp.waitAppReady(LingeredApp.java:294)
	at jdk.test.lib.apps.LingeredApp.startAppExactJvmOpts(LingeredApp.java:416)
	at jdk.test.lib.apps.LingeredApp.startApp(LingeredApp.java:431)
	at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:90)
	... 7 more

JavaTest Message: Test threw exception: java.lang.RuntimeException: Test ERROR java.io.IOException: App waiting timeout
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Test ERROR java.io.IOException: App waiting timeout
----------rerun:(33/6540)*----------


Here's a snippet from the ClhsdbFindPC.java#id1 log file:

#section:main
----------messages:(4/188)----------
command: main ClhsdbFindPC true true
reason: User specified action: run main/othervm/timeout=480 ClhsdbFindPC true true 
Mode: othervm [/othervm specified]
elapsed time (seconds): 457.343
----------configuration:(0/0)----------
----------System.out:(9/1829)----------
Starting the ClhsdbFindPC test
Command line: ['/scratch/mesos/jib-master/install/jdk-16+6-168/macosx-x64-debug.jdk/jdk-16/fastdebug/bin/java' '-XX:MaxRAMPercentage=12' '-Djava.io.tmpdir=/scratch/mesos/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S992/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1ca523ef-9cef-4c75-97f6-121eaacfa189/runs/130452c2-8592-4adc-9789-a0fa066e52a0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier1_serviceability/tmp' '-Xcomp' '-XX:+CreateCoredumpOnCrash' '-Djava.library.path=/scratch/mesos/jib-master/install/jdk-16+6-168/macosx-x64-debug.test/hotspot/jtreg/native' '-cp' '/scratch/mesos/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S992/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1ca523ef-9cef-4c75-97f6-121eaacfa189/runs/130452c2-8592-4adc-9789-a0fa066e52a0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier1_serviceability/classes/1/serviceability/sa/ClhsdbFindPC_id1.d:/scratch/mesos/slaves/52628e90-e5e7-4ef3-af97-10d8776d10db-S992/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1ca523ef-9cef-4c75-97f6-121eaacfa189/runs/130452c2-8592-4adc-9789-a0fa066e52a0/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier1_serviceability/classes/1/test/lib' 'LingeredAppWithTrivialMain' '87856dbf-33fd-454d-b89e-01435a05e80c.lck' 'forceCrash']
[2020-07-09T08:12:57.448369Z] Gathering output for process 13928
[2020-07-09T08:12:57.580371Z] Waiting for completion for process 13928
[2020-07-09T08:12:57.580672Z] Waiting for completion finished for process 13928
Output and diagnostic info for process 13928 was saved into 'pid-13928-output.log'
[2020-07-09T08:12:57.657282Z] Waiting for completion for process 13928
[2020-07-09T08:12:57.657499Z] Waiting for completion finished for process 13928
Run test with ulimit -c: unlimited
----------System.err:(21/1339)----------
java.lang.RuntimeException: Test ERROR java.io.IOException: App waiting timeout
	at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:155)
	at ClhsdbFindPC.main(ClhsdbFindPC.java:167)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.io.IOException: App waiting timeout
	at jdk.test.lib.apps.LingeredApp.waitAppReady(LingeredApp.java:264)
	at jdk.test.lib.apps.LingeredApp.waitAppReady(LingeredApp.java:294)
	at jdk.test.lib.apps.LingeredApp.startAppExactJvmOpts(LingeredApp.java:416)
	at jdk.test.lib.apps.LingeredApp.startApp(LingeredApp.java:431)
	at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:88)
	... 7 more

JavaTest Message: Test threw exception: java.lang.RuntimeException: Test ERROR java.io.IOException: App waiting timeout
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Test ERROR java.io.IOException: App waiting timeout
----------rerun:(33/6541)*----------

A curious observation: both tests specify /timeout=480
and both tests timeout before 480 seconds. That's weird
and that timeout value should have been multiplied by
the timeoutFactor which is usually 4.
Comments
URL: https://hg.openjdk.java.net/jdk/jdk/rev/15b9b4b21054 User: cjplummer Date: 2020-08-04 00:40:51 +0000
04-08-2020

I think I understand what is going on here. We already know that sometimes it takes a very long time to create a core file on OSX. Usually it is fast, but for some unknown reason it will occasionally take a very long time. See JDK-8214797 which was fixed by giving the test a 32 minute timeout. However, a longer test timeout won't work with these failures since the issue is with the timeout in waitAppReady() which is 100 seconds times the timeoutfactor, so normally 400 seconds for our testing. This code wills still timeout after 400 seconds no matter what the test timeout is set to. One thing I wasn't so sure of is why with JDK-8214797 the test did not time out until after the core file was finished dumping (no matter how long the core file dump took), whereas in these failures the timeout happens before the core file is finished dumping, always reliably timing out after 400 seconds. I think the answer is that for this failure waitAppReady() does periodic checks for the timeout, in a loop that does short Thread.sleep() calls. So it doesn't have to wait for the core file to finish dumping before throwing an exception. In the case of JDK-8214797, it's just launching a process that should crash, and waiting for it to exit, but the exit does not officially happen until after the core file is dumped, so you end up waiting beyond the timeout set for the test. I think we can make waitAppReady() wait for the core file to dump and not bother with the checking for timeout, or at the very least make the timeout exceedingly long, like 30 minutes. This way it would never throw "java.io.IOException: App waiting timeout " while waiting for the core dump. However, that will then leave the overall test timeout as the issue, just like with JDK-8214797. For it the ClhsdbJmapCoreMetaspace.java timeout was bumped to 480 seconds, but it's a special case that uses a lot of memory before crashing. ClhsdbJmapCore.java is a similar test and sets the timeout to 240 seconds and never seems to have a timeout issue, so this is a good baseline to start with for all our tests that generate core file. ClhsdbFindPC.java already uses a timeout of 480 seconds. I'm not sure why and that seems excessive, but that should be good enough to protect against timeouts due to long core dump times. That leaves ClhsdbPstack.java and ClhsdbPmap.java. They do not set the timeout, so should probably also use 240 seconds.
30-07-2020

I ran 50 times on random hosts. 48 of the runs were 1-3 minutes, with lower times being more common. There were two that were around 5 minutes. All passes. I ran 5 times on the host in question. Times were 9-12 minutes, so something is going on with this host that makes it run this test slowly. However, all the runs passed. I'm not sure why they didn't see the same issue as described in this CR.
09-07-2020

Regarding this part of the stack trace: Caused by: java.io.IOException: App waiting timeout at jdk.test.lib.apps.LingeredApp.waitAppReady(LingeredApp.java:264) at jdk.test.lib.apps.LingeredApp.waitAppReady(LingeredApp.java:294) waitAppReady() is passed a timeout of 100s, which is adjusted by the timeout factor of 4x for a total of 400s, or a bit less than 7 minutes. These tests are running about 460s, so that is inline with the timeout. waitAppReady() is sitting in a loop waiting for the spawned LingeredApp process to indicate it's readiness by touching the lock file. That is never going to happen when the LingeredApp "forceCrash" feature is used since LingeredApp.main() forces the crash before touching the lock file: if (forceCrash) { System.loadLibrary("LingeredApp"); // location of native crash() method crash(); } while (Files.exists(path)) { // Touch the lock to indicate our readiness setLastModified(theLockFileName, epoch()); Thread.sleep(spinDelay); } This is on purpose because waitAppReady() is suppose to bail once the app terminates for any reason, including crashing: // Make sure process didn't already exit if (!appProcess.isAlive()) { if (forceCrash) { return; // This is expected. Just return. } else { throw new IOException("App exited unexpectedly with " + appProcess.exitValue()); } } So after 7 minutes of running, appProcess.isAlive() is still returning true even though the app has crashed. My conclusion is that this is due to the periodic issue we see on macs with core file generation taking a long time. The process probably is not marked as no longer alive until the core dump concludes. Since this happened to both the #id1 and #id3 runs of this test, it looks like an issue with the specific host rather than just an occasional random problem. I'll try more runs on this host and see what the results are. If we need a longer timeout, setting the test "timeout" value won't help since in this case we are dealing with a specific timeout used by LingeredApp. Possibly we just need to bump it when generating core files, especially on OSX.
09-07-2020