Bug ID: JDK-8137099 G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

JDK-8137099 : G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 9

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2015-09-24
Updated: 2019-07-12
Resolved: 2018-01-11

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11
11 b01Fixed

Related Reports

Blocks :	JDK-8194877 - Clean up code in G1CollectedHeap::attempt_allocation_slow
Duplicate :	JDK-8165150 - G1 sometimes performs one or more young gcs with zero sized eden after evacuation failure before issuing a full gc
Relates :	JDK-8179226 - gc/stress/gclocker/TestGCLockerWithG1.java: fails with OOME Java heap space
Relates :	JDK-8192647 - GClocker induced GCs can starve threads requiring memory leading to OOME

Description

We regularly see OoM-Errors with G1 in our stress tests. We run the tests with the same heap size with ParallelGC and CMS without that problem.

The stress tests are based on real world application code with a lot of threads.

Scenario:
We have an application with a lot of threads and spend time in critical native sections.

1. An evacuation failure happens during a GC.
2. After clean-up work, the safepoint is left.
3. An other thread can't allocate and triggers a new incremental gc.
4. A thread, that can't allocate after an incremental GC, triggers a full GC. However, the GC doesn't start because an other thread
    started an incremental GC, the GC-locker is active or the GCLocker initiated GC has not yet been performed.
    If an incremental GC doesn't succeed due to the GC-locker, and if this happens more often than GCLockerRetryAllocationCount (=2) an OOME is thrown.

Without critical native code, we would try to trigger a full gc until we succeed. In this case there is just a performance issue, but not an OOME.

The reason is that only G1 splits the "upgrade" of young gc to full gc into multiple VM operations. Between those, the gclocker state can change and prevent full gc.

The problem can be reproduced with the attached program.
The parameters might vary depending on the system.

java -Xmx64m -XX:+UseG1GC -XX:+PrintGC -XX:MaxGCPauseMillis=10 -XX:+UnlockExperimentalVMOptions -XX:-G1ForceFullGCAfterEvacuationFailure -XX:-PrintAdaptiveSizePolicy TestEvacFailureThreaded 10 10000000 10000 10000 10000 10 0.7

A snipped of the output:

#2539: [GC pause (G1 Evacuation Pause) (young) 62M->62M(64M), 0.0062519 secs]
#2540: [GC pause (G1 Evacuation Pause) (young) 62M->62M(64M), 0.0050967 secs]
#2538: [GC concurrent-mark-end, 0.0193436 secs]
#2538: [GC remark, 0.0048717 secs]
#2538: [GC cleanup 62M->62M(64M), 0.0016663 secs]
#2541: [GC pause (GCLocker Initiated GC) (young) 62M->62M(64M), 0.0061165 secs]
#2542: [GC pause (G1 Evacuation Pause) (mixed)-- 62M->62M(64M), 0.0063998 secs]
#2543: [GC pause (G1 Evacuation Pause) (mixed)-- 62M->62M(64M), 0.0066795 secs]
#2544: [GC pause (GCLocker Initiated GC) (mixed)-- 62M->62M(64M), 0.0082145 secs]
#2545: [GC pause (G1 Evacuation Pause) (mixed)-- 62M->62M(64M), 0.0102476 secs]
#2546: [GC pause (GCLocker Initiated GC) (mixed)-- 62M->62M(64M), 0.0142916 secs]
#2547: [GC pause (G1 Evacuation Pause) (mixed)-- 62M->62M(64M), 0.0108066 secs]
#2548: [GC pause (G1 Evacuation Pause) (young) 62M->62M(64M), 0.0065968 secs]
#2549: [Full GC (Allocation Failure)  62M->23M(64M), 0.0483837 secs]
java.lang.OutOfMemoryError: Java heap space
        at TestEvacFailureThreaded.runTest(TestEvacFailureThreaded.java:75)
        at TestEvacFailureThreaded$2.run(TestEvacFailureThreaded.java:138)

Comments

Since back porting this to OpenJDK 8 seems like a lot of work, here is a much simpler way to prevent those OOM behaviors: change the value of GCLockerRetryAllocationCount from 2 to a large number (e.g. 10000), in "globals.hpp". And FYI, here is an alternative patch that makes the number of retry attempts infinite, plus a Java-only unit test that does not define any additional native methods: http://cr.openjdk.java.net/~bmathiske/8137099/webrev.00/ This may not be an ideal solution, but better than having crashes, perhaps.
01-11-2018
The reason for the bug (and the suggested fix indicates it) is that in G1 and only G1, full GCs determined by ergonomics are executed using two safepoints. Between the first (young gc request) and the second (full gc request) the gc locker can "lock out" the second so that it never occurs. The other collectors issue a full gc after a young gc within the same pause to ensure that everything possible has been done to get memory atomically so this lock-out can't happen (if the first gc has not been locked out, the second can't either because the GCLocker can't be entered while a safepoint is in progress).
29-11-2017
JDK-8179226 is similar but not the same, although this same issue also occurs in JDK-8179226. Changing JDK-8179226 to be about the other bug (a test bug), and this one to fix the missing "upgrade" of young collections to full ones.
29-11-2017
Looks very similar to JDK-8179226, i.e. gclocker in combination with full and young gcs.
03-11-2017
Mikael, I still can reproduce the problem.
20-11-2015
Axel, do you still see this OOME after the fix for JDK-8130265 ? I just remembered this bug and it seems like that fix resolves or hides this problem for me.
19-11-2015