JDK-8188872 : runtime/ErrorHandling/TimeoutInErrorHandlingTest.java fails intermittently
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 10,11,12,13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris_11
  • CPU: x86_64,sparc_64
  • Submitted: 2017-10-06
  • Updated: 2019-08-15
  • Resolved: 2019-06-05
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 13 JDK 14
13 b24Fixed 14Fixed
Related Reports
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Description
I have seen the following failure for this test:

----------System.out:(3/958)----------
Command line: [/work/shared/bug_hunt/8167108/SMR_prototype_10/build/solaris-x86_64-normal-server-slowdebug/images/jdk/bin/java -cp /work/shared/bug_hunt/8167108/SMR_prototype_10/JTwork_slowdebug/hotspot_jtreg_0/classes/41/runtime/ErrorHandling/TimeoutInErrorHandlingTest.d:/work/shared/bug_hunt/8167108/SMR_prototype_10/open/test/hotspot/jtreg/runtime/ErrorHandling:/work/shared/bug_hunt/8167108/SMR_prototype_10/JTwork_slowdebug/hotspot_jtreg_0/classes/41/test/lib:/work/shared/bug_hunt/8167108/SMR_prototype_10/open/test/lib:/java/re/jtreg/4.2/promoted/latest/binaries/jtreg/lib/javatest.jar:/java/re/jtreg/4.2/promoted/latest/binaries/jtreg/lib/jtreg.jar -XX:+UnlockDiagnosticVMOptions -Xmx100M -XX:ErrorHandlerTest=14 -XX:+TestUnresponsiveErrorHandler -XX:ErrorLogTimeout=16 -XX:-CreateCoredumpOnCrash -version ]
Found hs_err file. Scanning...
Found: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 4 s..
----------System.err:(13/916)----------
java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 1)
        at TimeoutInErrorHandlingTest.main(TimeoutInErrorHandlingTest.java:127)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:115)
        at java.base/java.lang.Thread.run(Thread.java:844)

JavaTest Message: Test threw exception: java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 1)
JavaTest Message: shutting down test

STATUS:Failed.`main' threw exception: java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 1)


The test failed because the hs_err_pid file was empty:

$ ls -l JTwork_slowdebug/hotspot_jtreg_0/runtime/ErrorHandling/TimeoutInErrorHandlingTest
total 586381
-rw-------   1 dcubed   green    300308990 Oct  5 12:15 core
-rw-r--r--   1 dcubed   green          0 Oct  5 12:14 hs_err_pid70.log


The test was invoked with '-XX:-CreateCoredumpOnCrash',
but there is a core file. Notice that the timestamp on the
core file is older than the timestamp on the hs_err_pid file.

For this particular failure, core file creation took too long
and hs_err_pid output did not happen.

So I see two problems here:

1) The '-XX:-CreateCoredumpOnCrash' is ignored for some reason.
2) The hs_err_pid may not get ANY output and we have to decide
     what that means in the context of this test.

Update: Now that I'm getting into analyzing the bug, I can see errors
in the above comments. These lines:

Found hs_err file. Scanning...
Found: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 4 s..

show that the hs_err file was found and that the first expected output
line was found. I have no explanation for the empty hs_err_pid file:

-rw-r--r--   1 dcubed   green          0 Oct  5 12:14 hs_err_pid70.log

Also this line:

> Notice that the timestamp on the
> core file is older than the timestamp on the hs_err_pid file.

is backwards. The core file is newer than the hs_err_pid file.
Since the core file is generated after the WatcherThread
detects that hs_err generation is taking too long, it makes
sense that the core file is newer than the hs_err_pid.
Comments
David's comment about got me thinking about the error reporting thread and whether it was really in infinite_sleep()'s call to ::sleep() so I added one more trace line. This failure was reproduced by running TimeoutInErrorHandlingTest.java in a loop in one window and running my do_jtreg_all_configs_tests.ksh script in another window. The failure happened at loop #75. # CreateCoredumpOnCrash turned off, no core file dumped # XXX - reporting_start_time_l=328136030147196 XXX - now=328136080148241 XXX - step_start_time_l=328136030337363 XXX - no time out XXX - now=328137079228948 XXX - no time out XXX - now=328138078270010 XXX - no time out XXX - now=328139077304601 XXX - no time out XXX - now=328140076339639 XXX - step timeout after 4 seconds; cur_ts=328140076357890 XXX - now=328141075382825 XXX - step timeout after 5 seconds; cur_ts=328141075403449 XXX - now=328142074423059 XXX - step timeout after 6 seconds; cur_ts=328142074441186 XXX - now=328143073465346 XXX - step timeout after 7 seconds; cur_ts=328143073483495 XXX - now=328144072513265 XXX - step timeout after 8 seconds; cur_ts=328144072542463 [timeout occurred during error reporting in step "calling expand_and_open(default_pattern)"] after 8 s. # An error report file with more information is saved as: # /tmp/hs_err_pid17807.log XXX - calling infinite_sleep() XXX - now=328145071574816 XXX - step_start_time_l=328144533085817 XXX - no time out XXX - now=328146070641528 XXX - no time out XXX - now=328147069682174 XXX - no time out XXX - now=328148068723086 XXX - no time out XXX - now=328149067762537 XXX - step timeout after 4 seconds; cur_ts=328149067796677 XXX - calling infinite_sleep() XXX - now=328150066822713 XXX - step_start_time_l=328149067864031 XXX - no time out XXX - now=328151065861925 XXX - no time out XXX - now=328152064910644 XXX - global timeout after 16 seconds; cur_ts=328152064932767 # [ timer expired, abort... ] <end cmd output> Found hs_err file. Scanning... Found: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 4 s.. <begin hs_err contents> # # A fatal error has been detected by the Java Runtime Environment: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 4 s. ------ Timeout during error reporting after 16 s. ------ <end hs_err contents> The WatcherThread's detection of step timeouts are marked with "step timeout after N seconds" and 5 step timeouts are detected _before_ the following output: [timeout occurred during error reporting in step "calling expand_and_open(default_pattern)"] after 8 s. # An error report file with more information is saved as: # /tmp/hs_err_pid17807.log This is the "open/create" failure for an hs_err_pid file in the current working directory. This is the first time I've seen it without a build running. And after that message we get our first: XXX - calling infinite_sleep() The error reporting thread hadn't reached the steps we were trying to test so the calls to interrupt_reporting_thread() which calls pthread_kill() were not "landing" in the code we expected. Oops. So we only managed to execute a single "step timeout" in the place we were targeting and the total timeout hit which caused us to fail the test.
09-05-2019

I should have been more clear. I have not seen the failure where we stall in open() outside of my recent attempts to reproduce the failure by using a full build as the stresser. I haven't nailed it down yet, but there is something in the build that is causing stalls to filesystem access in general. Could be something I did in my RAID-ZFS setup. For the general failure that I see periodically in my lab and for the Solaris SPARC failures that we see in Mach5, the failure mode is that the WatcherThread's call to VMError::check_timeout() which calls interrupt_reporting_thread() which calls pthread_kill() is not interrupting infinite_sleep()'s call to ::sleep().
09-05-2019

If the original open times out for some reason, and the SIGILL hits while still in system code then we may be reentering the system code from the signal handler - which is not guaranteed to work. We do a lot of stuff in our error handler that is not async-signal-safe so we should not be surprised that things may not work.
09-05-2019

During testing, I also observed a timeout with this set of traces: # CreateCoredumpOnCrash turned off, no core file dumped # XXX - reporting_start_time_l=255928736565372 XXX - now=255928786645807 XXX - step_start_time_l=255928736820080 XXX - no time out XXX - now=255929785772258 XXX - no time out XXX - now=255930784871783 XXX - no time out XXX - now=255931783944213 XXX - no time out XXX - now=255932783032107 XXX - step timeout after 4 seconds; cur_ts=255932783068422 XXX - now=255933782139385 XXX - step timeout after 5 seconds; cur_ts=255933782177223 XXX - now=255934781248083 XXX - step timeout after 6 seconds; cur_ts=255934781284137 XXX - now=255935780351619 XXX - step timeout after 7 seconds; cur_ts=255935780390400 XXX - now=255936779465794 XXX - step timeout after 8 seconds; cur_ts=255936779503449 [timeout occurred during error reporting in step ""] after 8 s. # An error report file with more information is saved as: # /tmp/hs_err_pid705.log "# CreateCoredumpOnCrash turned off, no core file dumped" is the last line printed by this report call: // print to screen if (!out_done) { report(&out, false); We stall... and then print: [timeout occurred during error reporting in step ""] after 8 s. # An error report file with more information is saved as: # /tmp/hs_err_pid705.log The timeout line is printed by VMError::report_and_die() when it sees that a step timed out. The last two lines are printed by: // print to error log file if (!log_done) { // see if log file is already open if (!log.is_open()) { // open log file fd_log = prepare_log_file(ErrorFile, "hs_err_pid%p.log", buffer, sizeof(buffer), &_current_step_info); if (fd_log != -1) { out.print_raw("# An error report file with more information is saved as:\n# "); out.print_raw_cr(buffer); so the prepare_log_file() call was unable to create an hs_err_pid() file in the current directory and created one in /tmp instead... And somehow that process timed out at least once... I found a total of three /tmp/hs_err_pid files from my recent testing over the last two days. Update: Ran into another of these very early timeouts: # CreateCoredumpOnCrash turned off, no core file dumped # XXX - reporting_start_time_l=269403627818155 XXX - now=269403677807621 XXX - step_start_time_l=269403627983222 XXX - no time out XXX - now=269404676921100 XXX - no time out XXX - now=269405676018685 XXX - no time out XXX - now=269406675093220 XXX - no time out XXX - now=269407674168370 XXX - step timeout after 4 seconds; cur_ts=269407674191072 XXX - now=269408673252131 XXX - step timeout after 5 seconds; cur_ts=269408673272935 XXX - now=269409672329035 XXX - step timeout after 6 seconds; cur_ts=269409672350918 XXX - now=269410671417848 XXX - step timeout after 7 seconds; cur_ts=269410671455967 XXX - now=269411670513487 XXX - step timeout after 8 seconds; cur_ts=269411670536042 XXX - now=269412669611783 XXX - step timeout after 9 seconds; cur_ts=269412669644513 XXX - now=269413668704185 XXX - step timeout after 10 seconds; cur_ts=269413668726433 [timeout occurred during error reporting in step "calling expand_and_open(default_pattern)"] after 10 s. # An error report file with more information is saved as: # /tmp/hs_err_pid18992.log With the addition debug info, the timeout happened: // Either user didn't specify, or the user's location failed, // so use the default name in the current directory if (fd == -1) { *info_p = "calling os::get_current_directory()"; const char* cwd = os::get_current_directory(buf, buflen); if (cwd != NULL) { size_t pos = strlen(cwd); int fsep_len = jio_snprintf(&buf[pos], buflen-pos, "%s", os::file_separator()); pos += fsep_len; if (fsep_len > 0) { *info_p = "calling expand_and_open(default_pattern)"; fd = expand_and_open(default_pattern, buf, buflen, pos); in the expand_and_open() call. I suspect that it is this call: // the O_EXCL flag will cause the open to fail if the file exists fd = open(buf, O_RDWR | O_CREAT | O_EXCL, 0666); and for some reason, file creation took 10 seconds before it was interrupted, failed and then /tmp/hs_err_pid18992.log was successfully created. Puzzling...
09-05-2019

Here's the hs_err_pid for a recent failure: <begin hs_err contents> # # A fatal error has been detected by the Java Runtime Environment: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 4 s. <end hs_err contents> It shows that we had a single step timeout at 4 seconds. The debugging mesgs in VMError::check_timeout() reveal the story: XXX - reporting_start_time_l=245015758986237 XXX - now=245015808989488 XXX - step_start_time_l=245015759520830 XXX - no time out XXX - now=245016808122759 XXX - no time out XXX - now=245017807218580 XXX - no time out XXX - now=245018806299283 XXX - no time out XXX - now=245019805394603 XXX - step timeout after 4 seconds; cur_ts=245019805421291 // This is where the WatcherThread calls interrupt_reporting_thread() // to interrupt the "hanging" step. This one generated output! XXX - now=245020804510094 XXX - step_start_time_l=245019805656098 XXX - no time out XXX - now=245021803690227 XXX - no time out XXX - now=245022802853206 XXX - no time out XXX - now=245023801953123 XXX - no time out XXX - now=245024801042030 XXX - step timeout after 4 seconds; cur_ts=245024801067933 // This is where the WatcherThread calls interrupt_reporting_thread() // to interrupt the "hanging" step. This one DID NOT generate output! XXX - now=245025800100660 XXX - step_start_time_l=245019805656098 XXX - step timeout after 5 seconds; cur_ts=245025800138625 // Uh oh... the step_start_time_l value HAS NOT changed so we're // detecting that the same step timed out at 5 seconds... XXX - now=245026799193295 XXX - step_start_time_l=245019805656098 XXX - step timeout after 6 seconds; cur_ts=245026799232170 // The same step has timed out at 6 seconds... XXX - now=245027798295895 XXX - step_start_time_l=245019805656098 XXX - step timeout after 7 seconds; cur_ts=245027798326840 // The same step has timed out at 7 seconds... XXX - now=245028797379893 XXX - step_start_time_l=245019805656098 XXX - step timeout after 8 seconds; cur_ts=245028797411119 // The same step has timed out at 8 seconds... XXX - now=245029796480457 XXX - step_start_time_l=245019805656098 XXX - step timeout after 9 seconds; cur_ts=245029796512194 // The same step has timed out at 9 seconds... XXX - now=245030795589489 XXX - step_start_time_l=245019805656098 XXX - step timeout after 10 seconds; cur_ts=245030795634719 // The same step has timed out at 10 seconds... XXX - now=245031794709258 XXX - global timeout after 16 seconds; cur_ts=245031794730553 // We've reached the global hs_err_pid generation timeout at 16 seconds. The step start time is managed by this code: src/hotspot/share/utilities/vmError.cpp: volatile jlong VMError::_step_start_time = -1; <snip> void VMError::record_step_start_time() { const jlong now = get_current_timestamp(); Atomic::store(now, &_step_start_time); } jlong VMError::get_step_start_time() { return Atomic::load(&_step_start_time); } <snip> # define STEP(s) } if (_current_step < __LINE__) { _current_step = __LINE__; _current_step_info = s; \ record_step_start_time(); _step_did_timeout = false; So the lack of a new _step_start_time value strongly indicates that we never finished the second STEP macro call: #define TIMEOUT_TEST_STEP STEP("test unresponsive error reporting step") \ if (_verbose && TestUnresponsiveErrorHandler) { os::infinite_sleep(); } TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP Hmmm... that tends to indicate the interrupt_reporting_thread() call that is made by each timeout that's detected in check_timeout() did not successfully interrupt the os::infinite_sleep(). Based on the debug output above, we made a total of 7 interrupt_reporting_thread() calls that didn't work (4th -> 10th seconds) so making more calls isn't going to unwedge the stuck error reporting step. On Solaris, os::infinite_sleep() is defined like this: src/hotspot/os/solaris/os_solaris.cpp: // Sleep forever; naked call to OS-specific sleep; use with CAUTION void os::infinite_sleep() { while (true) { // sleep forever ... ::sleep(100); // ... 100 seconds at a time } } and interrupt_reporting_thread() is defined like this: src/hotspot/os/posix/vmError_posix.cpp: void VMError::interrupt_reporting_thread() { // We misuse SIGILL here, but it does not really matter. We need // a signal which is handled by crash_handler and not likely to // occurr during error reporting itself. ::pthread_kill(reporter_thread_id, SIGILL); } I find myself wondering if Solaris' os::infinite_sleep() would recognize the pthread_kill() earlier if the sleep was shorter... Time for an experiment...
08-05-2019

I've updated the test to include more diagnostic output when it fails: - dump the cmd output as needed - dump the hs_err_pid file as needed I've also included support for recognizing '-Dverbose=true' to dump both the cmd output and the hs_err_pid file so that passing runs can also be examined. I'm also instrumenting VMError::check_timeout() with some temporary debug output to determine where the slow down is occurring on my Solaris-X64 server. In one of my failure reproductions, I observed the first step timeout occur at 15 seconds. The step timeouts are supposed to take 1/4 of the ErrorLogTimeout value which is 16 for this test. So this step timeout should have happened after 4 seconds (not 15). It is not yet clear if the delay is occurring in the WatcherThread's call to VMError::check_timeout() which calls interrupt_reporting_thread() or in the error reporting thread's call to infinite_sleep(): // TestUnresponsiveErrorHandler: We want to test both step timeouts and global timeout. // Step to global timeout ratio is 4:1, so in order to be absolutely sure we hit the // global timeout, let's execute the timeout step five times. // See corresponding test in test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java #define TIMEOUT_TEST_STEP STEP("test unresponsive error reporting step") \ if (_verbose && TestUnresponsiveErrorHandler) { os::infinite_sleep(); } TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP TIMEOUT_TEST_STEP I've discovered that the test is more likely to fail when I'm running JTREG tests via my do_jtreg_tests.ksh script (CONCURRENCY set to 8) or when I'm running a full build via my do_java_bld script. Both of those scripts tend to swamp my Solaris-X64 server.
08-05-2019

I have seen several failure modes for this test on Solaris-X64: Missing the second pattern in the hs_err_pid file: Found hs_err file. Scanning... Found: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 4 s.. ----------System.err:(13/916)---------- java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 1) at TimeoutInErrorHandlingTest.main(TimeoutInErrorHandlingTest.java:127) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:567) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:835) JavaTest Message: Test threw exception: java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 1) JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 1) Missing the first pattern in the hs_err_pid file: Found hs_err file. Scanning... ----------System.err:(13/916)---------- java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 0) at TimeoutInErrorHandlingTest.main(TimeoutInErrorHandlingTest.java:127) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:567) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:835) JavaTest Message: Test threw exception: java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 0) JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: hs-err file incomplete (first missing pattern: 0) Timing out during the test, but passing during JTREG timeout processing: Timeout refired 720 times Found hs_err file. Scanning... Found: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 7 s.. Found: [timeout occurred during error reporting in step "test unresponsive error reporting step"] after 5 s.. OK. ----------System.err:(1/15)---------- STATUS:Passed. ----------rerun:(31/3974)*---------- Cannot find the hs_err_file in the test's output: ----------System.out:(1/1080)---------- Command line: [/work/shared/bug_hunt/thread_SMR_stress/jdk13_exp/build/solaris-x86_64-normal-server-slowdebug/images/jdk/bin/java -cp /work/shared/bug_hunt/thread_SMR_stress/jdk13_exp/build/solaris-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_hotspot_tier2_runtime/classes/7/runtime/ErrorHandling/TimeoutInErrorHandlingTest.d:/work/shared/bug_hunt/thread_SMR_stress/jdk13_exp/open/test/hotspot/jtreg/runtime/ErrorHandling:/work/shared/bug_hunt/thread_SMR_stress/jdk13_exp/build/solaris-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_hotspot_tier2_runtime/classes/7/test/lib:/work/shared/bug_hunt/thread_SMR_stress/jdk13_exp/open/test/lib:/work/local/jib-dcubed/install/java/re/jtreg/4.2/promoted/all/b14/bundles/jtreg_bin-4.2.zip/jtreg/lib/javatest.jar:/work/local/jib-dcubed/install/java/re/jtreg/4.2/promoted/all/b14/bundles/jtreg_bin-4.2.zip/jtreg/lib/jtreg.jar -XX:+UnlockDiagnosticVMOptions -Xmx100M -XX:ErrorHandlerTest=14 -XX:+TestUnresponsiveErrorHandler -XX:ErrorLogTimeout=16 -XX:-CreateCoredumpOnCrash -version ] ----------System.err:(15/875)---------- java.lang.RuntimeException: Did not find hs-err file in output. at TimeoutInErrorHandlingTest.main(TimeoutInErrorHandlingTest.java:91) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:567) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:835) JavaTest Message: Test threw exception: java.lang.RuntimeException: Did not find hs-err file in output. JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Did not find hs-err file in output. ----------rerun:(31/3975)*----------
06-05-2019

I spotted this failure in my stress testing for 8153224 on Solaris-X64: $ unzip -l 8153224.v2.00_solx64.8188872.zip Archive: 8153224.v2.00_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14440 03-27-2019 14:33 8153224.v2.00_3/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 14440 1 file
28-03-2019

Spotted in my jdk-13+13 stress testing on Solaris-X64: $ unzip -l jdk-13+13_solx64.8188872.zip Archive: jdk-13+13_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14347 03-22-2019 14:44 jdk-13+13_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug 14329 03-22-2019 16:28 jdk-13+13_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug 14334 03-23-2019 08:54 jdk-13+13_3/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 43010 3 files
24-03-2019

Spotted in my jdk-13+11 stress testing on Solaris-X64: $ unzip -l jdk-13+11_solx64.8188872.zip Archive: jdk-13+11_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14345 03-09-2019 15:07 jdk-13+11_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 14345 1 file
11-03-2019

Spotted in my jdk-13+10 stress testing on Solaris-X64: $ unzip -l jdk-13+10_solx64.8188872.zip Archive: jdk-13+10_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14345 03-01-2019 16:58 jdk-13+10_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug 14346 03-02-2019 12:03 jdk-13+10_3/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug --------- ------- 28691 2 files
03-03-2019

Spotted in my jdk-12+32 stress testing on Solaris-X64: $ unzip -l jdk-12+32_solx64.8188872.zip Archive: jdk-12+32_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14134 02-14-2019 22:51 jdk-12+32_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug 14132 02-15-2019 18:52 jdk-12+32_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 28266 2 files
17-02-2019

Spotted in my jdk-12+31 stress testing on Solaris-X64: $ unzip -l jdk-12+31_solx64.8188872.zip Archive: jdk-12+31_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14120 02-08-2019 21:54 jdk-12+31_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug 14119 02-09-2019 18:13 jdk-12+31_3/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug --------- ------- 28239 2 files
10-02-2019

Spotted in my jdk-12+30 stress testing on Solaris-X64: $ unzip -l jdk-12+30_solx64.8188872.zip Archive: jdk-12+30_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14131 02-02-2019 05:12 jdk-12+30_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug 14119 02-03-2019 03:06 jdk-12+30_3/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug --------- ------- 28250 2 files
04-02-2019

Spotted in my jdk-12+27 stress testing on Solaris-X64: $ unzip -l jdk-12+27_solx64.8188872.zip Archive: jdk-12+27_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14121 01-11-2019 20:35 jdk-12+27_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 14121 1 file
13-01-2019

Spotted in my jdk-12+26 stress testing on Solaris-X64: $ unzip -l jdk-12+26_solx64.8188872.zip Archive: jdk-12+26_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14134 01-04-2019 04:15 jdk-12+26_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug 14118 01-04-2019 07:23 jdk-12+26_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug --------- ------- 28252 2 files
07-01-2019

Spotted in my jdk-12+25 stress testing on Solaris-X64: $ unzip -l jdk-12+25_solx64.8188872.zip Archive: jdk-12+25_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14119 12-21-2018 03:32 jdk-12+25_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug 14134 12-21-2018 00:24 jdk-12+25_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 28253 2 files
23-12-2018

Spotted in my jdk-12+24 stress testing on Solaris-X64: $ unzip -l jdk-12+24_solx64.8188872.zip Archive: jdk-12+24_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14119 12-15-2018 02:22 jdk-12+24_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug --------- ------- 14119 1 file
17-12-2018

Spotted in my jdk-12+23 Thread-SMR stress testing on Solaris-X64: $ unzip -l jdk-12+23_solx64.8188872.zip Archive: jdk-12+23_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14121 12-06-2018 21:51 jdk-12+23_1/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug 14132 12-07-2018 18:34 jdk-12+23_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 28253 2 files
10-12-2018

Spotted in my jdk-12+22 Thread-SMR stress testing on Solaris-X64: $ unzip -l jdk-12+22_solx64.8188872.zip Archive: jdk-12+22_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14115 11-30-2018 21:42 jdk-12+22_2/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.fastdebug --------- ------- 14115 1 file
02-12-2018

Spotted in my jdk-12+21 Thread-SMR stress testing on Solaris-X64: $ unzip -l jdk-12+21_solx64.8188872.zip Archive: jdk-12+21_solx64.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14127 11-24-2018 14:44 jdk-12+21_3/failures.solaris-x86_64/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 14127 1 file
26-11-2018

Here are the logs for my jdk11+19 sightings: $ unzip -l jdk11_migrate.8188872.zip Archive: jdk11_migrate.8188872.zip Length Date Time Name --------- ---------- ----- ---- 14488 06-27-2018 11:24 jdk11_migrate_3/TimeoutInErrorHandlingTest.jtr.fastdebug 14488 06-27-2018 07:08 jdk11_migrate_3/TimeoutInErrorHandlingTest.jtr.slowdebug --------- ------- 28976 2 files
19-07-2018

This platform is no longer supported.
10-07-2018

Here's an analysis of the interesting frames: [24] JVM_handle_solaris_signal(sig = 11, info = 0x7ffffc55e980, ucVoid = 0x7ffffc55e2d0, abort_if_unrecognized = 1), line 778 in "os_solaris_x86.cpp" Pretty normal here in JVM_handle_solaris_signal: src/hotspot/os_cpu/solaris_x86/os_solaris_x86.cpp: => VMError::report_and_die(t, sig, pc, info, ucVoid); ShouldNotReachHere(); [23] VMError::report_and_die(thread = 0x440000, sig = 11U, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0), line 1220 in "vmError.cpp" Just a wrapper here in VMError::report_and_die: src/hotspot/share/utilities/vmError.cpp => report_and_die(thread, sig, pc, siginfo, context, "%s", ""); [22] VMError::report_and_die(thread = 0x440000, sig = 11U, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0, detail_fmt = 0x7fffff81648e "%s", ...), line 1214 in "vmError.cpp" Just a bit more wrapper here in VMError::report_and_die: src/hotspot/share/utilities/vmError.cpp va_list detail_args; va_start(detail_args, detail_fmt); => report_and_die(sig, NULL, detail_fmt, detail_args, thread, pc, siginfo, context, NULL, 0, 0); [21] VMError::report_and_die(id = 11, message = (nil), detail_fmt = 0x7fffff81648e "%s", detail_args = 0x7ffffc55de20, thread = 0x440000, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0, filename = (nil), lineno = 0, size = 0), line 1394 in "vmError.cpp" Now were into the meat of it: src/hotspot/share/utilities/vmError.cpp void VMError::report_and_die(int id, const char* message, const char* detail_fmt, va_list detail_args, Thread* thread, address pc, void* siginfo, void* context, const char* filename, int lineno, size_t size) <snip> // print to error log file if (!log_done) { // see if log file is already open if (!log.is_open()) { // open log file int fd = prepare_log_file(ErrorFile, "hs_err_pid%p.log", buffer, sizeof(buffer)); if (fd != -1) { out.print_raw("# An error report file with more information is saved as:\n# "); out.print_raw_cr(buffer); log.set_fd(fd); } else { out.print_raw_cr("# Can not save log file, dump to screen.."); log.set_fd(defaultStream::output_fd()); /* Error reporting currently needs dumpfile. * Maybe implement direct streaming in the future.*/ transmit_report_done = true; } } => report(&log, true); So the out.print_* calls before report() explain the few lines that we have in the hs_err_pid file. [20] VMError::report(st = 0x7fffffafcc00, _verbose = true), line 453 in "vmError.cpp" So this is where report() is stuck: src/hotspot/share/utilities/vmError.cpp // TestUnresponsiveErrorHandler: We want to test both step timeouts and global timeout. // Step to global timeout ratio is 4:1, so in order to be absolutely sure we hit the // global timeout, let's execute the timeout step five times. // See corresponding test in test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java #define TIMEOUT_TEST_STEP STEP("test unresponsive error reporting step") \ if (_verbose && TestUnresponsiveErrorHandler) { os::infinite_sleep(); } => TIMEOUT_TEST_STEP [19] os::infinite_sleep(), line 2949 in "os_solaris.cpp" Ummm... I should have looked at the name of the test more closely... TimeoutInErrorHandlingTest.java... I'm wondering how this test doesn't race with the timeout in the WatcherThread on a more regular basis... Yikes.
26-03-2018

I'm really curious about this part of the stack: ---- called from signal handler with signal 4 (SIGILL) ------ [16] __nanosleep(0x7ffffc55d720, 0x7ffffc55d730, 0x7fffffa3d508, 0xefd1f8456bf29, 0x7fffff9f70c1, 0xefd1f8456bff8), at 0x7fffffd3d91a [17] nanosleep(), at 0x7fffffd254ac [18] sleep(), at 0x7fffffd25a6e [19] os::infinite_sleep(), line 2949 in "os_solaris.cpp" [20] VMError::report(st = 0x7fffffafcc00, _verbose = true), line 453 in "vmError.cpp" [21] VMError::report_and_die(id = 11, message = (nil), detail_fmt = 0x7fffff81648e "%s", detail_args = 0x7ffffc55de20, thread = 0x440000, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0, filename = (nil), lineno = 0, size = 0), line 1394 in "vmError.cpp" [22] VMError::report_and_die(thread = 0x440000, sig = 11U, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0, detail_fmt = 0x7fffff81648e "%s", ...), line 1214 in "vmError.cpp" [23] VMError::report_and_die(thread = 0x440000, sig = 11U, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0), line 1220 in "vmError.cpp" [24] JVM_handle_solaris_signal(sig = 11, info = 0x7ffffc55e980, ucVoid = 0x7ffffc55e2d0, abort_if_unrecognized = 1), line 778 in "os_solaris_x86.cpp" [25] signalHandler(sig = 11, info = 0x7ffffc55e980, ucVoid = 0x7ffffc55e2d0), line 3699 in "os_solaris.cpp" [26] __sighndlr(), at 0x7fffffd35586 [27] call_user_handler(), at 0x7fffffd28a0a [28] sigacthandler(), at 0x7fffffd28eb3 ---- called from signal handler with signal 11 (SIGSEGV) ------ [29] crash_with_segfault(), line 1625 in "vmError.cpp" [30] VMError::controlled_crash(how = 14), line 1690 in "vmError.cpp" [31] VMError::test_error_handler(), line 1630 in "vmError.cpp" We're nicely executing our test code, crashing the VM with a SIGSEGV, call into JVM_handle_solaris_signal() and then into VMError::report_and_die() which all looks very normal for this test, but then we do this: ---- called from signal handler with signal 4 (SIGILL) ------ [16] __nanosleep(0x7ffffc55d720, 0x7ffffc55d730, 0x7fffffa3d508, 0xefd1f8456bf29, 0x7fffff9f70c1, 0xefd1f8456bff8), at 0x7fffffd3d91a [17] nanosleep(), at 0x7fffffd254ac [18] sleep(), at 0x7fffffd25a6e [19] os::infinite_sleep(), line 2949 in "os_solaris.cpp" [20] VMError::report(st = 0x7fffffafcc00, _verbose = true), line 453 in "vmError.cpp" so something went wrong in the error reporting process and we dropped into an infinite sleep which is why the WatcherThread timed out...
26-03-2018

$ pargs TimeoutInErrorHandlingTest/core core 'TimeoutInErrorHandlingTest/core' of 3825: /work/shared/bug_hunt/8167108/SMR_prototype_10/build/solaris-x86_64-normal-serv argv[0]: /work/shared/bug_hunt/8167108/SMR_prototype_10/build/solaris-x86_64-normal-server-slowdebug/images/jdk/bin/java argv[1]: -cp argv[2]: /work/shared/bug_hunt/8167108/SMR_prototype_10/build/solaris-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_hotspot_tier2_runtime/classes/13/runtime/ErrorHandling/TimeoutInErrorHandlingTest.d:/work/shared/bug_hunt/8167108/SMR_prototype_10/open/test/hotspot/jtreg/runtime/ErrorHandling:/work/shared/bug_hunt/8167108/SMR_prototype_10/build/solaris-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_hotspot_tier2_runtime/classes/13/test/lib:/work/shared/bug_hunt/8167108/SMR_prototype_10/open/test/lib:/java/re/jtreg/4.2/promoted/latest/binaries/jtreg/lib/javatest.jar:/java/re/jtreg/4.2/promoted/latest/binaries/jtreg/lib/jtreg.jar argv[3]: -XX:+UnlockDiagnosticVMOptions argv[4]: -Xmx100M argv[5]: -XX:ErrorHandlerTest=14 argv[6]: -XX:+TestUnresponsiveErrorHandler argv[7]: -XX:ErrorLogTimeout=16 argv[8]: -XX:-CreateCoredumpOnCrash argv[9]: -version So the '-XX:-CreateCoredumpOnCrash' is supposed to mean the test is not supposed to dump a core file. However, because the WatcherThread timedout, it called abort() which generated the core file.
26-03-2018

I have another sighting of this failure mode on my Solaris-X64 server: $ ls -ltr TimeoutInErrorHandlingTest/* TimeoutInErrorHandlingTest.jtr -rw-r--r-- 1 dcubed green 70 Mar 26 13:47 TimeoutInErrorHandlingTest/hs_err_pid3825.log -rw------- 1 dcubed green 305766638 Mar 26 13:48 TimeoutInErrorHandlingTest/core -rw-r--r-- 1 dcubed green 13036 Mar 26 13:48 TimeoutInErrorHandlingTest.jtr -rw-r--r-- 1 dcubed green 533055 Mar 26 13:49 TimeoutInErrorHandlingTest/environment.html -rw-r--r-- 1 dcubed green 195 Mar 26 13:49 TimeoutInErrorHandlingTest/environment.log $ cat -n TimeoutInErrorHandlingTest/hs_err_pid3825.log 1 # 2 # A fatal error has been detected by the Java Runtime Environment: 3 Here's the aborting thread: THREAD t@79 t@79(l@79) stopped in __lwp_sigqueue at 0x7fffffd3e75a 0x00007fffffd3e75a: __lwp_sigqueue+0x000a: jae __lwp_sigqueue+0x18 [ 0x7fffffd3e768, .+0xe ] current thread: t@79 [1] __lwp_sigqueue(0x4f, 0x6, 0xffffc1c086b0e7e0, 0xffffffff, 0x0, 0x7fffd86ebb60), at 0x7fffffd3e75a [2] thr_kill(), at 0x7fffffd32f1c [3] raise(), at 0x7fffffce29dd [4] abort(), at 0x7fffffcb7ff2 =>[5] os::die(), line 1355 in "os_solaris.cpp" [6] WatcherThread::run(this = 0xbd1800), line 1394 in "thread.cpp" [7] thread_native_entry(thread_addr = 0xbd1800), line 772 in "os_solaris.cpp" [8] _thrp_setup(), at 0x7fffffd35221 [9] _lwp_start(), at 0x7fffffd354c0 Here's the VMThread: THREAD t@2 t@2(l@2) stopped in __write at 0x7fffffd3ea6a 0x00007fffffd3ea6a: __write+0x000a: jae __write+0x16 [ 0x7fffffd3ea76, .+0xc ] current thread: t@2 [1] __write(0x6, 0x7fffff6f6574, 0xffffc1c03775d3e0, 0x7ffffc55c510, 0x1, 0x7fffffafcc00), at 0x7fffffd3ea6a [2] write(), at 0x7fffffd24d28 =>[3] fdStream::write(this = 0x7fffffafcc00, s = 0x7fffff6f6574 "\n", len = 1U), line 574 in "ostream.cpp" [4] outputStream::cr(this = 0x7fffffafcc00), line 208 in "ostream.cpp" [5] VMError::report_and_die(id = 11, message = (nil), detail_fmt = 0x7fffff81648e "%s", detail_args = 0x7ffffc55c510, thread = (nil), pc = 0x7fffff13cf50 "M\x8b@^PL\x89E���L\x8bE��I\x83��", siginfo = (nil), context = 0x7ffffc55c720, filename = (nil), lineno = 0, size = 0), line 1334 in "vmError.cpp" [6] VMError::report_and_die(thread = (nil), sig = 11U, pc = 0x7fffff13cf50 "M\x8b@^PL\x89E���L\x8bE��I\x83��", siginfo = (nil), context = 0x7ffffc55c720, detail_fmt = 0x7fffff81648e "%s", ...), line 1214 in "vmError.cpp" [7] VMError::report_and_die(thread = (nil), sig = 11U, pc = 0x7fffff13cf50 "M\x8b@^PL\x89E���L\x8bE��I\x83��", siginfo = (nil), context = 0x7ffffc55c720), line 1220 in "vmError.cpp" [8] crash_handler(sig = 11, info = (nil), ucVoid = 0x7ffffc55c720), line 130 in "vmError_posix.cpp" [9] __sighndlr(), at 0x7fffffd35586 [10] call_user_handler(), at 0x7fffffd28a0a [11] sigacthandler(), at 0x7fffffd28eb3 ---- called from signal handler with signal 11 (SIGSEGV) ------ [12] crash_handler(sig = 4, info = (nil), ucVoid = 0x7ffffc55cfb0), line 122 in "vmError_posix.cpp" [13] __sighndlr(), at 0x7fffffd35586 [14] call_user_handler(), at 0x7fffffd28a0a [15] sigacthandler(), at 0x7fffffd28eb3 ---- called from signal handler with signal 4 (SIGILL) ------ [16] __nanosleep(0x7ffffc55d720, 0x7ffffc55d730, 0x7fffffa3d508, 0xefd1f8456bf29, 0x7fffff9f70c1, 0xefd1f8456bff8), at 0x7fffffd3d91a [17] nanosleep(), at 0x7fffffd254ac [18] sleep(), at 0x7fffffd25a6e [19] os::infinite_sleep(), line 2949 in "os_solaris.cpp" [20] VMError::report(st = 0x7fffffafcc00, _verbose = true), line 453 in "vmError.cpp" [21] VMError::report_and_die(id = 11, message = (nil), detail_fmt = 0x7fffff81648e "%s", detail_args = 0x7ffffc55de20, thread = 0x440000, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0, filename = (nil), lineno = 0, size = 0), line 1394 in "vmError.cpp" [22] VMError::report_and_die(thread = 0x440000, sig = 11U, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0, detail_fmt = 0x7fffff81648e "%s", ...), line 1214 in "vmError.cpp" [23] VMError::report_and_die(thread = 0x440000, sig = 11U, pc = 0x7fffff13ba82 "A\x88", siginfo = 0x7ffffc55e980, context = 0x7ffffc55e2d0), line 1220 in "vmError.cpp" [24] JVM_handle_solaris_signal(sig = 11, info = 0x7ffffc55e980, ucVoid = 0x7ffffc55e2d0, abort_if_unrecognized = 1), line 778 in "os_solaris_x86.cpp" [25] signalHandler(sig = 11, info = 0x7ffffc55e980, ucVoid = 0x7ffffc55e2d0), line 3699 in "os_solaris.cpp" [26] __sighndlr(), at 0x7fffffd35586 [27] call_user_handler(), at 0x7fffffd28a0a [28] sigacthandler(), at 0x7fffffd28eb3 ---- called from signal handler with signal 11 (SIGSEGV) ------ [29] crash_with_segfault(), line 1625 in "vmError.cpp" [30] VMError::controlled_crash(how = 14), line 1690 in "vmError.cpp" [31] VMError::test_error_handler(), line 1630 in "vmError.cpp" [32] JNI_CreateJavaVM_inner(vm = 0x7ffffc55ef60, penv = 0x7ffffc55ef58, args = 0x7ffffc55eec0), line 3963 in "jni.cpp" [33] JNI_CreateJavaVM(vm = 0x7ffffc55ef60, penv = 0x7ffffc55ef58, args = 0x7ffffc55eec0), line 4012 in "jni.cpp" [34] InitializeJVM(pvm = 0x7ffffc55ef60, penv = 0x7ffffc55ef58, ifn = 0x7ffffc55ef68), line 1478 in "java.c" [35] JavaMain(_args = 0xffff80ffbfffe890), line 411 in "java.c" [36] _thrp_setup(), at 0x7fffffd35221 [37] _lwp_start(), at 0x7fffffd354c0 So we have the VMThread trying to write out an hs_err_pid file, and we have the WatcherThread calling os::die()... Here's that part of the WatcherThread: src/hotspot/share/runtime/thread.cpp: void WatcherThread::run() { if (VMError::is_error_reported()) { // A fatal error has happened, the error handler(VMError::report_and_die) // should abort JVM after creating an error log file. However in some // rare cases, the error handler itself might deadlock. Here periodically // check for error reporting timeouts, and if it happens, just proceed to // abort the VM. // This code is in WatcherThread because WatcherThread wakes up // periodically so the fatal error handler doesn't need to do anything; // also because the WatcherThread is less likely to crash than other // threads. for (;;) { // Note: we use naked sleep in this loop because we want to avoid using // any kind of VM infrastructure which may be broken at this point. if (VMError::check_timeout()) { // We hit error reporting timeout. Error reporting was interrupted and // will be wrapping things up now (closing files etc). Give it some more // time, then quit the VM. os::naked_short_sleep(200); // Print a message to stderr. fdStream err(defaultStream::output_fd()); err.print_raw_cr("# [ timer expired, abort... ]"); // skip atexit/vm_exit/vm_abort hooks os::die(); } // Wait a second, then recheck for timeout. os::naked_short_sleep(999); } } So the VMThread was trying to generate an hs_err_pid file, took too long and the WatcherThread timed and killed the VM... There is no mention of "timer expired" in the .jtr file so I'm going to guess that something in the test or test harness swallowed that error message.
26-03-2018

[~dcubed], if you determine that this failure isn't due to SMR changes, please move to 11 if it seems like timing a error in the test.
09-10-2017

Of course this could be due to a SMR specific problem as well.
08-10-2017

Sorry, I already started a new run for this weekend...
07-10-2017

This suggest to me that attempting to write the hs_err log trigerred an abort that was processed by the OS directly. Can you get a stack from the core file?
06-10-2017