JDK-8227275 : Within native OOM error handling, assertions may hang the process
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 10,11,12,13,14
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2019-07-04
  • Updated: 2021-02-11
  • Resolved: 2019-07-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 13 JDK 14
11.0.10Fixed 13Resolved 14 b06Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Summary: on OOM, we fail to disarm assertion poison page; this may lead to endless loops during error handling if assertions happen in native OOM scenarios.

--

When an assert happens, we touch a poison page to receive the current ucontext for error analysis. That works like this:

assert ->
touch assertion poison page (immediately, in the same frame, with as little as possible code running after evaluating the assert condition) ->
bang! enter signal handler ->
in signal handling, copy ucontext ->
and disable poison page ->
return from signal handler, brings us to the same load which triggered the original crash ->
repeat touching the poison page. It is disarmed now, so a noop ->
continue handling the assertion.

In case of a native OOM, this may fail; the mprotect call used to disarm the poison page may return with ENOMEM (depends on the OS, but can happen e.g. on Linux when switching from PROT_NONE to PROT_RW). Leaving the poison page armed.

The chance of this happening for normal assertion scenario (an OOM hitting out of the blue just when we hit an assert and attempt to disarm the poison page) is astronomically small.

However, this may happen as a result of an OOM elsewhere, which could trigger a follow up assertion. Then this happens:

... OOM! ...
...
assert -> 
touch assert poison page -> 
bang! enter signal handler ->
in signal handling, copy ucontext ->
and disable poison page - but that fails! ->
current code does not care, returns to asserting code, to the same opcode -> 
again touch assert poison. -> 
enter signal handler -> 
repeat...
...

Endless loop; since we do not use stack space this can go on forever, and since we effectively disable signal handling the error handler timeout does not seem to work either. Process hangs.

Most native OOM situations in the hotspot are handled cleanly: they either are handled explicitly by the caller or they enter error handling via VMError::report_vm_out_of_memory(). This means that an assertion following a native OOM most likely happens during error handling. This slightly changes the picture above:

... OOM! ...
...
assert -> 
touch assert poison page -> 
bang! enter secondary signal handler (crash_handler() in vmError_posix.cpp) -> 
in signal handling, copy ucontext ->
and disable poison page - but that fails! ->
current code does not care, returns to asserting code, to the same opcode -> 
again touch assert poison. -> 
enter secondary signal handler (crash_handler() in vmError_posix.cpp) -> 
repeat...
...


One simple fix could be to just switch off the assertion poison page after entering the VMError::report_and_die(). We do not need it from that point on, since we do not care for secondary asserts or asserts happening in parallel threads (much). 

Also, when we fail to disarm the poison page, we should not just return from the signal handler. Since we cannot do much else, we should proceed as if this were a real crash. This will "hide" an assert behind a SIGSEGV and can be confusing if one does not closely examines the call stack, but it is still better than the process hanging.


Comments
Fix request (13u) Requesting backport to 13u for parity with 11u. The patch applies cleanly. Tested with tier1.
11-02-2021

I assume patch applies cleanly.
30-11-2020

Fix Request: Important to fix because: prevents possible process hangs on assert Nature of fix: disarms assertion poison page when an error happens Low risk because: disabling the assertion poison page is harmless Testing done: Fix has been active in head and in our 11u builds since July 19
30-11-2020

URL: https://hg.openjdk.java.net/jdk/client/rev/3243c42d737d User: psadhukhan Date: 2019-07-24 07:24:52 +0000
24-07-2019

HG Updates added a comment - 15 hours ago URL: https://hg.openjdk.java.net/jdk/jdk/rev/3243c42d737d User: stuefe Date: 2019-07-11 04:58:02 +0000 Fix was pushed while main bug was targeted to '13'. Reset the main bug to fixed in '14' and copied the hgupdater entry here.
11-07-2019

Hi David, I saw it happen on Ubuntu 16.4. Limited the process space with ulimit -d and ran against it. I believe the OS swaps out protected pages to save memory? See e.g. os::uncommit_memory on Linux, where we set PROT_NONE for what I believe the same reason. One think I could try is just write protecting the page instead of PROT_NONE, that may have a different effect. But all in all, I think I will disable assertion poison pages once we enter error handling. I even have my nagging doubts if this feature is at all that useful. Some of our compiler devs like it but OTOH it is confusing seeing a SIGSEGV pop up when you debug an assert...
05-07-2019

I'm not familiar with low-level OS memory management but find it hard to understand how we can get ENOMEM when changing the protection bits of a page that already exists and is mapped and protected? We're not trying to allocate memory or even map more memory, we're just changing the bits on an existing page. ??? From the POSIX mprotect docs: The mprotect() function shall fail if: [ENOMEM] Addresses in the range [addr,addr+len) are invalid for the address space of a process, or specify one or more pages which are not mapped. [ENOMEM] The prot argument specifies PROT_WRITE on a MAP_PRIVATE mapping, and it would require more space than the system is able to supply for locking the private pages, if required. --- I don't see how either of those conditions apply here? That said, regardless of what should happen if Linux does actually do this then of course we have this problem.
04-07-2019