JDK-8192647 : GClocker induced GCs can starve threads requiring memory leading to OOME
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9,10,11,14,16,17,20,21
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2017-11-29
  • Updated: 2024-05-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 23
23Unresolved
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Threads may be starved from receiving memory by the GClocker. This typically leads to OOME in the affected threads. This works as follows:

1. Thread A tries to allocate memory as normal, and tries to start a GC; the GCLocker is active and so the thread gets stalled waiting for the GC
2. GCLocker induced GC executes and frees some memory
3. Thread A does not get any of that memory, but other threads also waiting for memory.
4. Goto 1 until the gclocker retry count has been reached
Comments
With region pinning G1 does not have a GCLocker and that problem any more, so JDK-8308507 has been closed as WNF. If needed, JDK-8308507 could be reactivated for JDK 21 and (probably) earlier.
23-01-2024

G1 changes moved to JDK-8308507.
22-05-2023

After JDK-7129164, the precise number of threads in critical-region is not tracked all the time. Otherwise, one can probably stall-until-gc-locker-clear before attempting the VM-GC-op. (The fact that VM-GC-op can fail due to active-gc-locker requires much workaround in many places and complicates the logic flow significantly, IMO.)
18-04-2023

Another option that incurs more gcs would be, for each thread, only consider the GC locker retry count if that thread already had an opportunity to satisfy its request (by issuing a GC itself). That may cause extra GCs, but may be less complicated to implement than an allocation request queue as suggested somewhere above.
21-03-2023

Here's a log file snippet from the jdk-20+11-623-tier7 sighting: vmTestbase/gc/lock/jni/jnilock003/TestDescription.java #section:main ----------messages:(4/305)---------- command: main -XX:-UseGCOverheadLimit gc.lock.LockerTest -gp1 interned(randomString) -lockers jni reason: User specified action: run main/othervm/native -XX:-UseGCOverheadLimit gc.lock.LockerTest -gp1 interned(randomString) -lockers jni Mode: othervm [/othervm specified] elapsed time (seconds): 301.497 ----------configuration:(0/0)---------- ----------System.out:(35/2014)---------- Stress time: 300 seconds Stress iterations factor: 1 Stress threads factor: 1 Stress runs factor: 1 Max memory: 1038090240 Sleep time: 500 Iterations: 0 Number of threads: 24 Run GC thread: false Run mem diag thread: false Run forever: false Starting Thread[#58,gc.lock.LockerTest$Worker@251dc6e4,5,MainThreadGroup] Starting Thread[#59,gc.lock.LockerTest$Worker@60f6433b,5,MainThreadGroup] Starting Thread[#60,gc.lock.LockerTest$Worker@763cf82e,5,MainThreadGroup] Starting Thread[#61,gc.lock.LockerTest$Worker@3d5d3aeb,5,MainThreadGroup] Starting Thread[#62,gc.lock.LockerTest$Worker@6ea2bfb,5,MainThreadGroup] Starting Thread[#63,gc.lock.LockerTest$Worker@54575f86,5,MainThreadGroup] Starting Thread[#64,gc.lock.LockerTest$Worker@453b20aa,5,MainThreadGroup] Starting Thread[#65,gc.lock.LockerTest$Worker@1165d084,5,MainThreadGroup] Starting Thread[#66,gc.lock.LockerTest$Worker@468794a0,5,MainThreadGroup] Starting Thread[#67,gc.lock.LockerTest$Worker@28c25899,5,MainThreadGroup] Starting Thread[#68,gc.lock.LockerTest$Worker@34e0035b,5,MainThreadGroup] Starting Thread[#69,gc.lock.LockerTest$Worker@b3e04b3,5,MainThreadGroup] Starting Thread[#70,gc.lock.LockerTest$Worker@7166a3e0,5,MainThreadGroup] Starting Thread[#71,gc.lock.LockerTest$Worker@4bda2bc5,5,MainThreadGroup] Starting Thread[#72,gc.lock.LockerTest$Worker@7f94033e,5,MainThreadGroup] Starting Thread[#73,gc.lock.LockerTest$Worker@4f49be3,5,MainThreadGroup] Starting Thread[#74,gc.lock.LockerTest$Worker@4049513e,5,MainThreadGroup] Starting Thread[#75,gc.lock.LockerTest$Worker@5a57def1,5,MainThreadGroup] Starting Thread[#76,gc.lock.LockerTest$Worker@585bb76,5,MainThreadGroup] Starting Thread[#77,gc.lock.LockerTest$Worker@258e03a6,5,MainThreadGroup] Starting Thread[#78,gc.lock.LockerTest$Worker@2f447688,5,MainThreadGroup] Starting Thread[#79,gc.lock.LockerTest$Worker@4b8095c7,5,MainThreadGroup] Starting Thread[#80,gc.lock.LockerTest$Worker@6376d9ca,5,MainThreadGroup] Starting Thread[#81,gc.lock.LockerTest$Worker@36307531,5,MainThreadGroup] ----------System.err:(2/126)---------- java.lang.OutOfMemoryError: Java heap space STATUS:Failed.`main' threw exception: java.lang.OutOfMemoryError: Java heap space ----------rerun:(37/7570)*---------- <snip> result: Failed. Execution failed: `main' threw exception: java.lang.OutOfMemoryError: Java heap space
16-08-2022

Is that considerable to introduce new exception class different from OOME to indicate this situation? OOME does not look as the best desicion from support perspective: we need to get logs (it's rarely possible) before to suggest customers to increase GCLockerRetryAllocationCount option, to use another GC implementation, or just to refactor their code.
03-02-2022

- gc/stress/TestJNIBlockFullGC/TestJNIBlockFullGC.java fails w/ UnsatisfiedLinkError int gc.stress.TestJNIBlockFullGC.TestJNIBlockFullGC.TestCriticalArray0(int[]), which I'm going to fix it by JDK-8249681; - vmTestbase/gc/lock/jni/jnilock002/TestDescription.java had another problem, JDK-8208243, which I've fixed but, unfortunately, after we conducted problem-listed tests execution, so I don't have enough information about its stability.
17-07-2020

[~kbarrett], the reason you don't see failures associated w/ this bug in our testing is what the tests are problem-listed.
17-07-2020

No failures associated with this bug from Oracle CI since 2019-06-20, so removed maintainer-pain label. Not sure why the failures stopped; perhaps some infrastructure or test framework change?
03-12-2019

The preferred fix from Oracle for G1 is to implement object pinning via region pinning. Nobody is working on that though, and a short search in JBS did not show a specific CR for that.
12-11-2019

We see premature OOME due to GCLocker/JNI in production with JDK11+G1. Increasing GCLockerRetryAllocationCount could work around the problem, but we are concerned if this could cause long stalls in worst case scenario. Is anyone working on this bug or have some prototype fix?
11-11-2019

[~stefank] - I've filed a new bug in hotspot/gc for the sightings that I added to this bug report: JDK-8226536
20-06-2019

I'm not sure that the failures in the last few comments are actually related to this bug. I locked at one of the logs and it was using ZGC, which does not throw OOME because of the GC locker.
20-06-2019

the test failed in ATR w/ AOT'ed java.base, it seems to be the same issue as this one, if someone thinks otherwise please open a new bug: java.lang.OutOfMemoryError: Java heap space at TestJNIBlockFullGC.runTest(TestJNIBlockFullGC.java:99) at TestJNIBlockFullGC$2.run(TestJNIBlockFullGC.java:168) java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: Java heap space at TestJNIBlockFullGC.runTest(TestJNIBlockFullGC.java:90) at TestJNIBlockFullGC$2.run(TestJNIBlockFullGC.java:168) java.lang.RuntimeException: Experienced an OoME during execution. at TestJNIBlockFullGC.main(TestJNIBlockFullGC.java:178) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:115) at java.base/java.lang.Thread.run(Thread.java:834)
16-07-2018

[~pliden]: In the normal case there can be no starvation problem, as the thread will always enqueue and wait for the result of a VM operation (assuming the vm operation queue does not drop VM operations; of course this is not very efficient, and since the GC locker can kind of cancel VM operations too...). The problem really is the GCLocker preventing enqueuing and eventual execution of the allocation request. (random idea): So the option to fix this is that after waiting for a gc locker, _all_ threads and not just any single one are guaranteed to enqueue and wait for VM operation (or successfully allocate before that VM operation, obviating the need for it) - and the GC locker is not allowed to cancel these operations, i.e. as suggested, prevent any thread entering a critical section again. Not sure if this is easier to guarantee, or desirable as this may lock out the threads executing JNI code for a long time.
04-12-2017

[~tschatzl] Agree, the above mentioned solution is only fixes the GC-locker part of the problem. There still needs to be some sort of allocation queue for the allocating threads to solve the other starvation problem (which in theory can happen also without having the GC-locker involved).
04-12-2017

[~pliden]:" Anyways, working on an alternative fix. The problem with the GC locker is that the last thread out of the JNI critical region is the one doing a GC. This means that the allocating threads (which are stalled due to GC locker being active) will just wait and then retry the allocation when the GC locker performed GC has completed. However, this is where information is lost. If the GC doesn't actually free up any memory there's no way for the allocating threads to know this and report the OOM condition. Instead they will end up trying to do another GC, which in turn is likely to fail because the GC locker is active again, so they stall and wait. This cycle goes on and on. The fix I'm working on will move the responsibility of doing a GC out of the GC locker. Instead, the last thread out of a JNI ciritical region will signal to *any stalled allocator to continue and do a GC. No thread is allowed to enter a JNI critical region until a GC has been performed*. With this approach, the GC is initiated by the allocating thread (i.e. the normal way) and so if the GC fails to free up memory we have a real OOM and the allocating thread can throw this error. " I do not think this will help as described (and of course it depends on the implementation of that fix): when threads are stalled *all* of them need to be serviced with that GC, regardless of who initiates it, otherwise there is still opportunity for one thread to be starved of memory (never getting a chance to actually allocate). Consider the situation when multiple threads are waiting, only a random one gets the opportunity to allocate, and the others will try to repeat the allocation. Given that during that time new threads might require allocation, there is no guarantee that a particular thread will ever be able to successfully allocate, looping until the threads give up. Alternatively all threads need schedule a VM operation to allocate.
04-12-2017

Another option would be to collect pending memory allocation requests at the start of a safepoints (typically not only a single thread requires new memory at the start of a GC) and satisfy all of them in the safepoint.
29-11-2017

Potential fix idea from [~pliden]: The change introduced in JDK-7014552 can cause a pre-mature OOME if the allocation pressure is very high and GC_locker is active for long periods (which it is in this test). Without the fix made in JDK-7014552 there's a possibility for a live-lock, which is probably what happened in the original bug. However, the jnilock* tests are kind of questionable. Keeping the GC locker active for long periods goes against the (JNI) specification, so sub-par performance should more or less be expected here. Anyways, working on an alternative fix. The problem with the GC locker is that the last thread out of the JNI critical region is the one doing a GC. This means that the allocating threads (which are stalled due to GC locker being active) will just wait and then retry the allocation when the GC locker performed GC has completed. However, this is where information is lost. If the GC doesn't actually free up any memory there's no way for the allocating threads to know this and report the OOM condition. Instead they will end up trying to do another GC, which in turn is likely to fail because the GC locker is active again, so they stall and wait. This cycle goes on and on. The fix I'm working on will move the responsibility of doing a GC out of the GC locker. Instead, the last thread out of a JNI ciritical region will signal to any stalled allocator to continue and do a GC. No thread is allowed to enter a JNI critical region until a GC has been performed. With this approach, the GC is initiated by the allocating thread (i.e. the normal way) and so if the GC fails to free up memory we have a real OOM and the allocating thread can throw this error. This should simplify the whole GC locker, which makes me wonder why it wasn't designed this way from the start. Hope I'm not missing something. However, this change also has some side effect, such as GCCause::_gc_locker will go away (the GC locker doesn't do GC anymore) and the option GCLockerInvokesConcurrent will become irrelevant.
29-11-2017