JDK-8252533 : Signal handlers should run with synchronous error signals unblocked
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 16
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2020-08-29
  • Updated: 2020-11-10
  • Resolved: 2020-11-03
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b23Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Description
This is a continuation of JDK-8065895 [1].

When a signal happens which cannot be deferred (SIGFPE, SIGILL, SIGSEGV, SIGBUS) but whose delivery is blocked, bad things happen. This is undefined territory, and we have observed the following cases:

- on Linux, the default handler is invoked instead of the user handler, which in case of error signals causes the process to core immediately.
- on AIX and PASE both, the process just becomes unresponsive and hangs.
- on HPUX - one of our internal platform - the process just vanishes without a trace.
I did not test other platforms but would guess similar things happen there.

Posix documentation [4] states:
"If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated while they are blocked, the result is undefined, unless the signal was generated by the kill() function, the sigqueue() function, or the raise() function."

At the moment, undeferrable error signals are unblocked outside the signal handler (see hotspot sigmask) and, since JDK-8065895, inside the error handler (see crash_handler setup). This leaves us with a window where the hotspot signal handler runs but before he has decided to invoke fatal error handling. Inside that window, for any platform but AIX error signals are still blocked. So any crash inside them tears down the VM immediately without giving us a useful hs-err file. 

On AIX they are not blocked because we added an AIX-only patch a while ago which unblocks them at the entrance of the AIX signal handler. This was before we contributed the port to OpenJDK, so no history in the official repos. But that behavior makes sense for all posix platforms.

For more details see discussion from Nov 2014 [2][3].

(Side note, these effects only show for truly synchronous error signals. You cannot artificially create such a scenario e.g. by raising SIGSEGV with kill.)

[1] https://bugs.openjdk.java.net/browse/JDK-8065895
[2] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2014-November/013346.html
[3] https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2015-January/013718.html
[4] https://pubs.opengroup.org/onlinepubs/009695399/functions/sigprocmask.html

Comments
Changeset: e7a2d5c8 Author: Thomas Stuefe <stuefe@openjdk.org> Date: 2020-11-03 07:16:45 +0000 URL: https://github.com/openjdk/jdk/commit/e7a2d5c8
03-11-2020

Thanks for digging up the various POSIX references. It makes no sense to ever block synchronous signals.
26-10-2020

Found the relevant comment in the Posix documentation to pthread_sigmask (https://pubs.opengroup.org/onlinepubs/009695399/functions/sigprocmask.html): <quote> If any of the SIGFPE, SIGILL, SIGSEGV, or SIGBUS signals are generated while they are blocked, the result is undefined, unless the signal was generated by the kill() function, the sigqueue() function, or the raise() function. </quote>
23-10-2020

> I'm not clear what these "undeferrable signals" are as it seems a term not used by POSIX. Posix uses the term "deferring" signals to describe signals which are to be delivered in a delayed fashion (e.g. after sighold()). It does not talk about "undeferrable". What I mean is signals which cannot be delivered in a delayed fashion. Do you know a better word? Posix does, though, treat synchronous error signals special throughout its documentation, see e.g. https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04_03 <quote> SIG_IGN: Delivery of the signal shall have no effect on the process. The behavior of a process is undefined after it ignores a SIGFPE, SIGILL, SIGSEGV, or SIGBUS signal that was not generated by kill(), sigqueue(), or raise(). </quote> which to me is similar to blocking such a signal. or: <quote> The behavior of a process is undefined after it returns normally from a signal-catching function for a SIGBUS, SIGFPE, SIGILL, or SIGSEGV signal that was not generated by kill(), sigqueue(), or raise(). </quote> > I'm also unclear what you expect to happen by unblocking these signals and thus allowing recursive entry to the signal handler function (which would seem to explain why AIX hangs). AIX hangs if I keep the signal *blocked*. It is what unblocking these signals prevents. > You said yourself it is a small window, so within that window wouldn't the occurence of a synchronous signal indicate a condition that is very likely to repeat should be start executing the handler again? Not necessarily, since the path taken at the first invocation of the signal handler is a different from the follow up invocation. The first invocation is triggered by a signal inside the hotspot, which may be a semantic non-error signal like a polling page access. We handle that in the signal handler, before invoking the error handling. Lets say in that signal handler we now trigger a second crash. Now, if the crash signal was unblocked, we re-enter the signal handler with a different signal info and probably bypass all that stuff and go straight to VMError::report_and_die(). Yes, recursivity may still happen, but for platforms != Linux which do stupid things like hang or vanish, this would still be better. Recursive signal handler invocation would probably trigger a stack overflow at some point and give us a normal core file.
23-10-2020

I'm not clear what these "undeferrable signals" are as it seems a term not used by POSIX. I'm also unclear what you expect to happen by unblocking these signals and thus allowing recursive entry to the signal handler function (which would seem to explain why AIX hangs). You said yourself it is a small window, so within that window wouldn't the occurence of a synchronous signal indicate a condition that is very likely to repeat should be start executing the handler again? I don't object to unblocking them as by then we are so far out on a limb it doesn't really matter, but I'm not seeing why you expect this to somehow be better?
23-10-2020

Okay, rewrote the description to be more precise and shorter. Hope that is clearer.
23-10-2020

@dholmes: No, this is not restricted to AIX. I could not find any standard documentation for handling of undeferrable signals yet. I tested this on Linux. Crashing with an undeferrable signal when that signal is blocked invokes the default signal handler, so for error signals we core immediately. This behavior is a bit more sane and reasonable than what I have seen on AIX (process gets unresponsive and hangs) and HPUX (process just vanishes), but it still prevents us from getting a hs-err file. This is regardless of whether we are inside a signal handler or not. Since we unblock error signals in normal state and, since JDK-8065895, also inside the error handler, the section left is the standard signal handler. There we only unblock on AIX, but we should unblock on all platforms.
23-10-2020

Is the synopsis of this issue incorrect: "Signal handlers should run with synchronous error signals blocked" should be: "Signal handlers should NOT run with synchronous error signals blocked"? It seems to me that we are in OS defined behaviour territory here either way, and I don't see anything that suggests that using the same approach as on AIX will actually make anything better on non-AIX.
23-10-2020

> when a signal handler is entered with a signal mask blocking all signals When do we do that? Ah now I see it: sigfillset(&(sigAct.sa_mask));
31-08-2020