Bug ID: JDK-8250637 UseOSErrorReporting times out (on Mac and Linux)

JDK-8250637 : UseOSErrorReporting times out (on Mac and Linux)

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 14

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2020-07-27
Updated: 2024-12-09
Resolved: 2020-11-05

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 16
16 b24Fixed

Related Reports

Blocks :	JDK-8237727 - Mac: after we handle a crash, Apple's crash reporter is left with incorrect state
CSR :	JDK-8255571 - UseOSErrorReporting times out (on Mac and Linux)
Cloners :	JDK-8250782 - UseOSErrorReporting causes stack space exception (Win)
Relates :	JDK-6794885 - Improve Windows unhandled structured exception reporting - UseOSErrorReporting
Relates :	JDK-8251438 - Issues with our POSIX set_signal_handler()
Relates :	JDK-8237727 - Mac: after we handle a crash, Apple's crash reporter is left with incorrect state
Relates :	JDK-4997835 - RFE: crash dump will only be created when running w/ -XX:+ShowMessageBoxOnError
Relates :	JDK-6227246 - Improve Windows unhandled structured exception reporting

Description

Turning UseOSErrorReporting halts the VM process during crash report, eventually times out with message: "[ timer expired, abort... ]"

# java -XX:+UseOSErrorReporting -cp /Volumes/Work/snippets Crasher
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010c115e64, pid=52840, tid=9731
#
# JRE version: Java(TM) SE Runtime Environment (14.0.1+7) (build 14.0.1+7)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000110d15e64, pid=53519, tid=5891
#
# JRE version: Java(TM) SE Runtime Environment (14.0.1+7) (build 14.0.1+7)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14.0.1+7, mixed mode, sharing, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# V  [libjvm.dylib+0x715e64]  _ZN16RawAccessBarrierILy128EE14store_internalILy128EiEEN8EnableIfIXsr12HasDecoratorIXT_EXL_ZL11MO_VOLATILEEEEE5valueEvE4typeEPvT0_+0x4
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/gerard/hs_err_pid53519.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
# [ timer expired, abort... ]

Comments

Changeset: ba2ff3a6 Author: Gerard Ziemski <gziemski@openjdk.org> Date: 2020-11-05 16:15:27 +0000 URL: https://github.com/openjdk/jdk/commit/ba2ff3a6
05-11-2020
I understand what you are trying to do. I still think returning from VMError::report_and_die() is inherently dangerous in ways in which our normal signal handler shenanigans are not. But to make this easier I'll continue the discussion in the code review for the PR https://github.com/openjdk/jdk/pull/813. (BTW very nice analysis writeup, that helped a lot understanding this).
24-10-2020
Nothing is guaranteed with signal handlers. This is the best effort to let the OS catch the crash and create its own crash log - see https://bugs.openjdk.java.net/browse/JDK-8237727 for more relevant details. Summary: 1. it appears to works just fine when I was investigating JDK-8237727 2. core dump is not the point here, the native crash log is a) using abort() will create core dump, but it exits the process with error code 0, which makes the OS think it exited cleanly and will not produce native crash log b) using exit() will cease the process at a different stack frame and will look confusing in the native crash log c) letting the process crash again may not be guaranteed to work, but in my experiments it works (on macOS) and is best effort, that in my opinion is worth it 3. The fix presented is pretty small and not a big effort, but with a big payoff, i.e. native crash log that includes more and different details than out hs_err log (for example back trace of all native threads on macOS)
23-10-2020
I am not convinced that this switch is really useful in its current form (writing the hs-err file, then attempting to re-crash at the same point) or worth the code complexity. First, there is no guarantee that the crash actually happens second time around. The crash may have been timing sensitive. It happened ages ago, and the VM state may be a completely different now. Another thing, if the crash does not happen a second time, now we continue to run with a corrupted VM, which I dislike. I certainly would not use this in production. Also, we now run without any signal handlers, which will give this VM a limited lifetime. Second, presumably the point of this would be to get a core dump at the crash site in cases where we believe to be better served with a core dump close to the crash site than with a hs-err file. If this is the case, why not just abort()? What may be more useful is to bypass error reporting altogether and just crash fast and hard as close to the crash site as possible. After the regular handler ran, but before error handling. Since error handling is very extensive and and may muddle the picture. We actually have exactly such a switch in our internal JVMs and I have used it quite often.
22-10-2020
It turns out that we have a similar infinite loop issue on Windows too. The difference is that Windows process gets signaled with "out of stack space" exception after around 280 loops and terminates soon after. Otherwise it would keep spinning for 2 minutes and be interrupted by the thread watcher, just like macOS/Linux The good news then is that the same fix will then apply to all 3 architectures.
20-10-2020
On Windows OS we have: LONG WINAPI crash_handler(struct _EXCEPTION_POINTERS* exceptionInfo) { DWORD exception_code = exceptionInfo->ExceptionRecord->ExceptionCode; VMError::report_and_die(NULL, exception_code, NULL, exceptionInfo->ExceptionRecord, exceptionInfo->ContextRecord); return EXCEPTION_CONTINUE_SEARCH; } where EXCEPTION_CONTINUE_SEARCH means "The process is being debugged, so the exception should be passed (as second chance) to the application's debugger." and since "VMError::report_and_die()" does NOT die if "UseOSErrorReporting==true", it seems the control is returned to the "debugger", which in this case I assume is the OS itself, which then reports on the crash itself. On Mac and Linux we also do not die in "VMError::report_and_die()", because "UseOSErrorReporting==true", but we return the control to the process itself, which will crash again in same spot and we are stuck in infinite loop...
25-09-2020
On Windows it probably works like this: #1 catch signal in our handler #2 generate hs_err log #3 forward the crash to Windows asking it to take over the crash Need to verify how #3 is actually done on Windows. On Mac/Linux it would look more like this: #1 catch signal in our handler #2 generate hs_err log #3 turn off our signal handler #4 continue the process normally, allowing it to crash again in the same spot, with the same signal being generated So we would not generate a secondary (different from the original) signal, but we would generate a second "same" signal, which now would be handled by the OS. That seems to be working just fine as proved by JDK-8237727 (at least on Mac). That "The behavior of a process is undefined after it returns normally from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(), sigqueue(), or raise()." is worrying, but after we generate hs_err log, we are done, so everything after that is our best effort (as is our signal handler tbh) and if we can get a nice native crash log then great, if not, however, then we don't loose anything.
16-09-2020
I'm unclear on the intent of VMError::reset_signal_handlers, but it does seem to reset all signals to our crash handler, which is not what we would want in the UseOSErrorReporting case, if we actually have to trigger a secondary signal. But triggering a secondary signal would result in that signal being processed by the OS, not the original, so this doesn't seem like it can work. (In contrast on Windows if we catch the exception and process it and then allow it to continue, that will will involve the Windows error reporting functionality.) If VMError::report_and_die() returns due to +UseOSErrorReporting, then our CrashHandler signal-handling function will also return and we enter undefined-behaviour territory: "The behavior of a process is undefined after it returns normally from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(), sigqueue(), or raise()." I think signal chaining also complicates this story.
30-07-2020
From JDK-6227246 "I implemented Windows-only flag -XX+UseOSErrorReporting which allows us instead of running of our crash handler and dying, forward exception handling to the OS in case of actual crash." But it seems that disabled hs_err generation and so a later bug wanted it added back - JDK-6794885 - which was closed as a dup of JDK-4997835. It is unclear to me why this was ever allowed on non-Windows systems, and even less clear if it ever actually did anything useful there. But the current situation seems akin to the original motivation for adding the flag on Windows - we want the hs_err reporting to take place but we also want to allow the native OS crash reporting to get involved.
30-07-2020
We should skip ShouldNotReachHere() with UseOSErrorReporting, that's the point of the flag - to exit the signal handler and let the native crash reporter catch that signal the next time around. That's one (minor) bug here. The major bug is with unregistering our signal handler - if that's what reset_signal_handlers() is supposed to do (though the name makes it sounds like it "rearms" the signal handler), then yes that's where the problem is.
29-07-2020
Is it a bug that we hit ShouldNotReachHere() with UseOSErrorReporting? Is it a bug in reset_signal_handlers() ?
29-07-2020
Here is what happens currently from what I can tell: #1 crash #2 JVM_handle_bsd_signal() catches it, calls VMError::report_and_die() #3 VMError::report_and_die() reports it #3.1 sets _first_error_tid (creates log, ...) #3.2 calls reset_signal_handlers() and returns (skipping abort() and exit() because UseOSErrorReporting) #4 JVM_handle_bsd_signal() reaches ShouldNotReachHere(), which calls report_vm_error(), which calls VMError::report_and_die() #5 VMError::report_and_die() reports it (again) #5.1 checks _first_error_tid and since it is set and UseOSErrorReporting returns #6 JVM_handle_bsd_signal() returns #7 crash_handler() now catches the signal, calls VMError::report_and_die() #8 VMError::report_and_die() reports it (again) #8.1 checks _first_error_tid and since it is set and UseOSErrorReporting returns #9 crash_handler() returns #10 crash_handler() now catches the signal, calls VMError::report_and_die() #11 VMError::report_and_die() reports it (again) #11.1 checks _first_error_tid and since it is set and UseOSErrorReporting returns #12 crash_handler() returns and we get stuck here in the loop signal handled, thrown, caught ... Whatever we do (if anything, reset_signal_handlers?) to unset the crash handler does not seem to be working as we keep re-catching the signal by crash_handler() after JVM_handle_bsd_signal() initially caught it.
28-07-2020
The fix is to unregister our signal handlers, i.e. instead of: if (UseOSErrorReporting && log_done) return; we need something like: if (UseOSErrorReporting && log_done) { struct sigaction info, newhandler; sigemptyset(&(newhandler.sa_mask)); newhandler.sa_handler = SIG_DFL; for (int sig=SIGHUP; sig<=SIGUSR2; sig++) { sigaction(sig, &newhandler, &info); } return; } No idea why the current code works correctly on Windows, but on macOS and Linux it was probably broken from the start.
28-07-2020
ILW = MLM = P4
28-07-2020
Also times out on Linux. On Windows 10 it takes a few seconds longer than otherwise, but I don't see any other differences - I thought that the option would involve the WER (Windows Error Reporting) The option looks to be useless to me, hangs on Mac/Linux, doesn't do anything useful on Windows.
28-07-2020
I thought UseOSErrorReporting simply allowed report_and_die to return rather than calling exit/abort. So control would return to the signal handler and I'm not sure what happens after that when the signal handler completes ... possibly we re-fault and so re-enter report_and_die, but now a previous error has already been reported so we might end up in one of the paths that does os::infinite_sleep, which then allows the watchdog thread to do the abort when it times out.
27-07-2020
Found out while investigating how UseOSErrorReporting affects my fix for 8237727 (thank you to David Holmes for pointing the flag out)
27-07-2020