Bug ID: JDK-8328305 Move the existing error handling code out of process

Type: Enhancement
Component: hotspot
Sub-Component: runtime

Priority: P4
Status: Closed
Resolution: Won't Fix
OS: os_x
CPU: aarch64

Submitted: 2024-03-16
Updated: 2024-03-28
Resolved: 2024-03-19

According to "man sigaction"

"All functions not in the above lists are considered to be unsafe with respect to signals.  That is to say, the behaviour of such functions when called from a signal handler is undefined.  In general though, signal handlers should do
     little more than set a flag; most other actions are not safe."

So most of our existing error handling code is out of compliance here.

Additional consequence of moving it out of process is that other parts of VM would benefit, like our current WX memory protection mechanism, that requires non-null thread.

For WX memory protection we could be using statically created pthread TLS mechanism to keep track of the state.

Every time someone tries to expand the error reporting there is pushback against it. There are restrictions on what can be done in crash handler (i.e. no memory allocation, no TLS, no multithreaded processing) I had to re-arrange code to make sure that we report the most important info first, in case the error reporting dies. Can't optimize using native threads, for example processing any of the parts of the log. In the case of the SIGBUS, if we were the parent process of java process, we still can't stop it, but at least we could report back to the user that it did happen. In my opinion there are reasons to think about it more.
28-03-2024
We're not going to rewrite how we report errors via the hs_err_pid file, to delegate to an external process. We have the ServiceabilityAgent which can read core files and are actively trying to work on a replacement for it due to its limitations. This issue is closed as WNF.
28-03-2024
Starting a second process for every JVM is simply not feasible/practical. Nor is writing something that acts as a "custom debugger". We do not have a problem with crash handling that would warrant such an investigation and potential re-architecture of the VM.
28-03-2024
I don't mean to turn ShowMessageBoxOnError, which btw does not work on macOS, to supplement the error reporting. I mean: have another process, that has similar functionality as a debugger and use it, instead of the current crash gathering mechanism, to produce the hs_err* log file. The entire point here is to reduce the amount of work we currently have to do inside the signal handling (the real VM bug crashing handling). We would start the debugger process at the same time as the actual java process, in parallel, so as not to affect the startup performance of the user process. I would like us to have a real conversation about this and actually consider the effort, the benefits and hows.
27-03-2024
That sounds like what ShowMessageBoxOnError attempts to do with the platform debugger. - but that is at the end of error reporting. And of course absolutely no guarantee you will be able to exec a debugger from the crash handler.
27-03-2024
A debugger like gdb or lldb can read a process' registers no problem. Our out of process reporter could be a custom debugger like process that collects all the needed info and reports it on the behalf of the crashed java process. We could even use it to execute code in the crashed VM for collecting data such as NMT info. I think this can and should be done.
26-03-2024
I don't see how this is really feasible either. At the point in the signal handler where we would initiate crash reporting we would have to somehow "freeze" the crashing process, communicate to the other process that we have crashed, and that other process then has to interrogate the crashed process to get the information for the hs_err log. How can it do all that in a safe and reliable manner?
20-03-2024
Our hs_err reporting collects information that is available in the process at the time of the crash, registers, stack trace, etc. Trying to do this from another process that then has to read the failing process is not worth the risk of not getting an error report at all. We are careful about what hs_err reporting collects and every now and then a step fails, but there is mitigation for this in the error reporting code so as much useful information is printed as possible. There are user defined commands that one can run in error reporting, but most of error reporting needs to be in-process to be reliable. product(ccstrlist, OnError, "", \ "Run user-defined commands on fatal error; see VMError.cpp " \ "for examples") \ \ product(ccstrlist, OnOutOfMemoryError, "", \ "Run user-defined commands on first java.lang.OutOfMemoryError " \ "thrown from JVM") \
19-03-2024
Ideally, anything involved in producing our hs_err crash log file I imagine. May not be possible to do all of it, but maybe we could split off at least some of the more dangerous info gathering code? By "out of process" I imagine we could start a background java process watcher, when the main Java process starts, with some sort of communication channel (a pipe?) that "handles" the process interrogation and crash log file producing duties on behalf of the crashed user java process.
19-03-2024
What is "out of process" error handling?
19-03-2024
There would be no need to raise any concerns if it was done out of process and dveryone would benefit from more flexible crash reports.
18-03-2024
We are very aware that a lot of what we do in signal handlers in regards to error reporting is not guaranteed to work and indeed it often doesn't. This is why I raise concerns everytime we add more stuff to hs-err files. But the general philosophy there is that it is better to get additional crash information most of the time even if we can't always get it. And if something is discovered to be too problematic for crash reporting then we change it.
17-03-2024

Relates :	JDK-8301403 - Eliminate memory allocations during signal handling
Relates :	JDK-8327860 - Java processes get killed, leaving no hs_err/stack trace on macOS 14.4