JDK-8279241 : G1 Full GC does not always slide memory to bottom addresses
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 18
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2021-12-23
  • Updated: 2022-08-04
  • Resolved: 2022-01-25
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b07Fixed
Related Reports
Relates :  
Description
The following test failed in the JDK18 CI:

runtime/cds/appcds/cacheObject/HeapFragmentationTest.java

Here's a snippet from the log file:

[0.625s][info ][gc,heap] GC(4) Eden regions: 0->0(2)
[0.625s][info ][gc,heap] GC(4) Survivor regions: 0->0(5)
[0.625s][info ][gc,heap] GC(4) Old regions: 1->1
[0.625s][info ][gc,heap] GC(4) Archive regions: 2->2
[0.625s][info ][gc,heap] GC(4) Humongous regions: 0->0
[0.625s][info ][gc     ] GC(4) Pause Full (G1 Compaction Pause) 1M->1M(10M) 71.992ms
Aborting due to java.lang.OutOfMemoryError: Java heap space
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/debug.cpp:362
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/ff806ead-2cac-495d-9cbc-62116f99bf14-S13740/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/ab608f54-4d8f-478e-83ac-d19b2e323bdb/runs/0b06b07a-ea51-4623-af6d-c2d4ba24de47/workspace/open/src/hotspot/share/utilities/debug.cpp:362), pid=17819, tid=17823
#  fatal error: OutOfMemory encountered: Java heap space
#
# JRE version: Java(TM) SE Runtime Environment (18.0+30) (fastdebug build 18-ea+30-2014)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 18-ea+30-2014, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/ff806ead-2cac-495d-9cbc-62116f99bf14-S14144/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/394d1899-59ec-4632-9797-dfcc5aae96c5/runs/57a2ffc5-f654-472a-8132-39ff1cdd676a/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_cds_relocation/scratch/3/core.17819)
#
# An error report file with more information is saved as:
# /opt/mach5/mesos/work_dir/slaves/ff806ead-2cac-495d-9cbc-62116f99bf14-S14144/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/394d1899-59ec-4632-9797-dfcc5aae96c5/runs/57a2ffc5-f654-472a-8132-39ff1cdd676a/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_cds_relocation/scratch/3/hs_err_pid17819.log
];
 stderr: [java version "18-ea" 2022-03-22
Java(TM) SE Runtime Environment (fastdebug build 18-ea+30-2014)
Java HotSpot(TM) 64-Bit Server VM (fastdebug build 18-ea+30-2014, mixed mode, sharing)
]
 exitValue = 134

java.lang.RuntimeException: 'array.length = 1048576000' missing from stdout/stderr 

	at jdk.test.lib.process.OutputAnalyzer.shouldContain(OutputAnalyzer.java:221)
	at jdk.test.lib.cds.CDSTestUtils.checkMatches(CDSTestUtils.java:491)
	at jdk.test.lib.cds.CDSTestUtils$Result.assertNormalExit(CDSTestUtils.java:194)
	at HeapFragmentationTest.main(HeapFragmentationTest.java:85)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312)
	at java.base/java.lang.Thread.run(Thread.java:833)

JavaTest Message: Test threw exception: java.lang.RuntimeException
JavaTest Message: shutting down test

result: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: 'array.length = 1048576000' missing from stdout/stderr

Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x00007f98a0037620):  JavaThread "main" [_thread_in_vm, id=17823, stack(0x00007f98a841f000,0x00007f98a8520000)]

Stack: [0x00007f98a841f000,0x00007f98a8520000],  sp=0x00007f98a851e2e0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1a1bf00]  VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x1a0
V  [libjvm.so+0xaf4d5b]  report_fatal(VMErrorType, char const*, int, char const*, ...)+0x12b
V  [libjvm.so+0xaf500d]  report_java_out_of_memory(char const*)+0xed
V  [libjvm.so+0x14b1c1a]  MemAllocator::Allocation::check_out_of_memory()+0xda
V  [libjvm.so+0x14b48e1]  MemAllocator::allocate() const+0x191
V  [libjvm.so+0x1968c86]  TypeArrayKlass::allocate_common(int, bool, JavaThread*)+0x2e6
V  [libjvm.so+0x15ea06f]  oopFactory::new_typeArray(BasicType, int, JavaThread*)+0xff
V  [libjvm.so+0xeed4d6]  InterpreterRuntime::newarray(JavaThread*, BasicType, int)+0xb6
j  HeapFragmentationApp.main([Ljava/lang/String;)V+20
v  ~StubRoutines::call_stub
V  [libjvm.so+0xf07434]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x504
V  [libjvm.so+0x1040b75]  jni_invoke_static(JNIEnv_*, JavaValue*, _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, JavaThread*) [clone .constprop.1]+0x375
V  [libjvm.so+0x10441b5]  jni_CallStaticVoidMethod+0x1c5
C  [libjli.so+0x4797]  JavaMain+0xd37
C  [libjli.so+0x7aa9]  ThreadJavaMain+0x9

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  HeapFragmentationApp.main([Ljava/lang/String;)V+20
v  ~StubRoutines::call_stub
Comments
Changeset: 295b263f Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2022-01-25 09:13:50 +0000 URL: https://git.openjdk.java.net/jdk/commit/295b263fa951b9d51bfa92c04e49b2a17a62bd6f
25-01-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7106 Date: 2022-01-17 11:00:04 +0000
19-01-2022

Defer request for JDK 18 approved. After the bug is fixed and tested in JDK 19 consider backporting it into JDK18u (update) in addition to JDK 17u.
18-01-2022

Deferral Request: We suggest deferral for this because: * complexity of the fix: it is fairly big for a change this late (see https://github.com/openjdk/jdk/pull/7106/files) * likelihood of the crash: there is a very very very small likelihood of this occurring given our (complete) understanding of this issue; this is a race that happens due to threads potentially getting arbitrarily paused for a relatively long time. We've got only two instances of this, one just before jdk17, one now (before 18), where the JDK17 fix (JDK-8267703) already makes this even less likely to occur. We have got no end user reports about this occurring. * this is not a regression in jdk18, but introduced earlier While the complexity is probably manageable, we think the likelihood argument outweighs the improvement this change gives. We will consider backporting this change to jdk17.
18-01-2022

The suggested change to guarantee sliding all live memory to lower address is to split the existing "Prepare Compaction" task into two parts: * a serial part that spreads the given compaction target regions (i.e. regions we are going to compact into) across the compaction point queues in a round robin fashion * a parallel part that covers the remaining work: forwarding the live objects to their respective future location (every thread using the calculated compaction queue), and clearing and rebuilding metadata; clear the metadata that is going to be invalid after full gc, and rebuild metadata that is not otherwise created (BOT for young gen regions staying in place). Measurements on a prototype show that even the absolute time such a serial phase takes (small per-region work) is negligible (<=0.05ms for >1000 regions), and even more negligible compared to what actual work takes. On very large heaps the number of regions can be increased if that ever becomes a concern (allowed by JDK-8275056). Future work could e.g. precalculate target regions for humongous objects fairly easily with that change too (for a maximally compacting last-resort gc); this would be fairly hard doing in parallel in a single pass anyway as they require multiple consecutive regions. Another option is to balance the compaction queues based on amount of live data to balance the compaction (copying) phase.
13-01-2022

We had this problem at the end of 17 as well. Looks like that fix didn't fix all cases.
11-01-2022

That single Old region at 400-something is a Survivor region - G1 allocates young gen from the end of the heap downwards as opposed to real old gen regions created by promotion which are allocated from the bottom of the heap upwards as a simple measure to avoid region level fragmentation.
10-01-2022

This is a pre-existing issue: G1 full gc does not guarantee any compaction order or always compacting into lower addresses. In the "Prepare for Compaction Phase" threads iterate over all regions to put them into their list(s) of regions they can compact into in parallel, competing for these regions obviously. If cpu time assignment is bad, it can happen that the thread that later actually compacts that region, gets that region 400-something as first destination region. In my tests I did not really manage to reproduce the failure, but with three threads it could happen that instead of starting to compact into region 0, 1, and 2 respectively, I managed to sometimes get starting regions like 32 or so (can only be observed using additional debug code).
10-01-2022

In this failure we have: GC(0) Pause Young (Concurrent Start) (G1 Humongous Allocation) GC(1) Concurrent Undo Cycle GC(2) Pause Young (Normal) (G1 Humongous Allocation) GC(3) Pause Full (G1 Compaction Pause) GC(4) Pause Full (G1 Compaction Pause) All appear to be triggered by the single humongous allocation. It seems reasonable to have a single old region after all that. But it does seem strange and not useful that it's in the middle of the heap, rather than "compacted" to the end.
25-12-2021

Reassigned to GC team for evaluation. Question for GC team -- is G1 supposed to move the Old Gen region to satisfy the humongous allocation?
23-12-2021

The tests assumes this: // We run with a 1400m heap, so we should be able to allocate a 1000m buffer, regardless // of the heap size chosen at dump time. However, G1 puts an Old Gen region in the middle of the heap. This makes it impossible to allocate a 1000MB humongous region. Heap: garbage-first heap total 1035264K, used 1116K [0x0000000600000000, 0x0000000657800000) region size 1024K, 0 young (0K), 0 survivors (0K) Metaspace used 183K, committed 384K, reserved 1114112K class space used 2K, committed 128K, reserved 1048576K Heap Regions: E=young(eden), S=young(survivor), O=old, HS=humongous(starts), HC=humongous(continues), CS=collection set, F=free, OA=open archive, CA=closed archive, TAMS=top-at-mark-start (previous, next) | 0|0x0000000600000000, 0x0000000600000000, 0x0000000600100000| 0%| F| |TAMS 0x0000000600000000, 0x0000000600000000| Untracked | 1|0x0000000600100000, 0x0000000600100000, 0x0000000600200000| 0%| F| |TAMS 0x0000000600100000, 0x0000000600100000| Untracked | 2|0x0000000600200000, 0x0000000600200000, 0x0000000600300000| 0%| F| |TAMS 0x0000000600200000, 0x0000000600200000| Untracked .... all free regions here .... (~450MB) | 432|0x000000061b000000, 0x000000061b000000, 0x000000061b100000| 0%| F| |TAMS 0x000000061b000000, 0x000000061b000000| Untracked | 433|0x000000061b100000, 0x000000061b100000, 0x000000061b200000| 0%| F| |TAMS 0x000000061b100000, 0x000000061b100000| Untracked | 434|0x000000061b200000, 0x000000061b200000, 0x000000061b300000| 0%| F| |TAMS 0x000000061b200000, 0x000000061b200000| Untracked | 435|0x000000061b300000, 0x000000061b32b3e0, 0x000000061b400000| 16%| O| |TAMS 0x000000061b300000, 0x000000061b300000| Untracked | 436|0x000000061b400000, 0x000000061b400000, 0x000000061b500000| 0%| F| |TAMS 0x000000061b400000, 0x000000061b400000| Untracked | 437|0x000000061b500000, 0x000000061b500000, 0x000000061b600000| 0%| F| |TAMS 0x000000061b500000, 0x000000061b500000| Untracked .... all free regions here ....(~600MB) |1007|0x000000063ef00000, 0x000000063ef00000, 0x000000063f000000| 0%| F| |TAMS 0x000000063ef00000, 0x000000063ef00000| Untracked |1008|0x000000063f000000, 0x000000063f000000, 0x000000063f100000| 0%| F| |TAMS 0x000000063f000000, 0x000000063f000000| Untracked [??? unmapped spaces (~400MB) |1398|0x0000000657600000, 0x0000000657676000, 0x0000000657700000| 46%|OA| |TAMS 0x0000000657676000, 0x0000000657600000| Untracked |1399|0x0000000657700000, 0x0000000657776000, 0x0000000657800000| 46%|CA| |TAMS 0x0000000657700000, 0x0000000657776000| Untracked
23-12-2021