Bug ID: JDK-8283935 Parallel: Crash during pretouch after large pages allocation failure

JDK-8283935 : Parallel: Crash during pretouch after large pages allocation failure

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 18,19

Priority: P2
Status: Resolved
Resolution: Fixed

Submitted: 2022-03-30
Updated: 2025-06-04
Resolved: 2022-04-06

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 19
19 b17Fixed

Related Reports

Duplicate :	JDK-8259496 - Parallel GC insists on using large pages although it failed to use them
Relates :	JDK-8259496 - Parallel GC insists on using large pages although it failed to use them
Relates :	JDK-8346005 - Parallel: Incorrect page size calculation with UseLargePages
Relates :	JDK-8272807 - Permit use of memory concurrent with pretouch
Relates :	JDK-8298642 - ParallelGC -XX:+UseNUMA eden spaces allocated on wrong node
Relates :	JDK-8324817 - Parallel GC does not pre-touch all heap pages when AlwaysPreTouch enabled and large page disabled

Description

Seen this in many configurations in current testing:

$ build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+AlwaysPreTouch -Xmx2g -XX:+UseLargePages -XX:LargePageSizeInBytes=1g -XX:+UseParallelGC
OpenJDK 64-Bit Server VM warning: Failed to reserve and commit memory. req_addr: 0x0000000080000000 bytes: 2147483648 page size: 1073741824 (errno = 12).
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7f557f7cda, pid=3987099, tid=3987101
#
# JRE version:  (19.0) (fastdebug build )
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 19-internal+0-adhoc.shade.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, parallel gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x1600cda]  os::pretouch_memory(void*, void*, unsigned long)+0x1ea

Bisection points to JDK-8272807.

I believe this is due to LargePages allocation failure. The new pretouching code still rounds down to still-large page size, which touches memory out of heap bounds. This looks to be a problem with Parallel GC (and maybe others), not G1.

See, adding this assert:

    for ( ; true; cur += page_size) {
      assert(cur >= start, "sanity: " PTR_FORMAT " in " PTR_FORMAT " ... " PTR_FORMAT, p2i(cur), p2i(start), p2i(end));
      Atomic::add(reinterpret_cast<int*>(cur), 0, memory_order_relaxed);
      if (cur >= last) break;
    }

...crashes with:

#  Internal Error (/home/shade/trunks/jdk/src/hotspot/share/runtime/os.cpp:1766), pid=4062942, tid=4062944
#  assert(cur >= start) failed: sanity: 0x00000000c0000000 in 0x00000000d5600000 ... 0x00000000f5800000

Comments

It seems the problem with Parallel using the wrong page size in some places is already known: JDK-8259496.
08-04-2022
Changeset: b56df280 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2022-04-06 08:01:47 +0000 URL: https://git.openjdk.java.net/jdk/commit/b56df2808d79dcc1e2d954fe38dd84228c683e8b
06-04-2022
A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/8090 Date: 2022-04-04 11:07:09 +0000
04-04-2022
Other collectors are fine, this includes parallel gc in a NUMA setup. As indicated by the related bug, this is a JDK 19 issue only (verified).
04-04-2022
Some more reproduction details: $ java -XX:+AlwaysPreTouch -Xms2g -Xmx2g -XX:+UseLargePages -XX:LargePageSizeInBytes=1g -Xlog:pagesize,gc+heap=debug -XX:+UseParallelGC Hello With no 1g large pages allocated, but a few 2m pages (less than heap size; i.e. needed). If there are enough 2m pages available, the wrong rounding presented above works out to not crash as the heap is aligned properly (by chance); this may be an artifact of the machine used though.
04-04-2022
When reserving the heap (via a ReservedSpace object), if large page allocation fails then it falls back to the default page size. The actual page size used is recorded in the ReservedSpace object. But when ParallelGC uses the ReservedSpace object to initialize its generations, it doesn't use the page_size from the ReservedSpace object. Instead it directly checks UseLargePages and the corresponding page size. That's wrong; it should be getting the page size from the ReservedSpace. Spot-checking other collectors, I think they are properly getting the actual page size for the heap from a ReservedSpace object.
01-04-2022
That's exactly what's happening. The caller of pretouch_memory is passing in the 1G page_size from -XX:LargePageSizeInBytes, even though the page size that was actually used is something smaller. pretouch_memory could be more defensive and avoid this crash, but it seems wrong that it should need to. And if it was more careful about the boundaries then it would end up only touching every page_size_arg / page_size_actual pages.
01-04-2022
I'm not sure how to reproduce this (is it sufficient to just request large pages even though there aren't any? I'll try that.), but it seems like the problem here might be the caller passing the wrong page size to pretouch. If the caller only allocated small pages starting around "start", but tells pretouch the page size is large and "start" is in the middle of a large page, well, that's lying to pretouch. An hs_err file or stack trace to give more context might help.
31-03-2022