Bug ID: JDK-8179226 gc/stress/gclocker/TestGCLockerWithG1.java: fails with OOME Java heap space

JDK-8179226 : gc/stress/gclocker/TestGCLockerWithG1.java: fails with OOME Java heap space

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 10

Priority: P3
Status: Resolved
Resolution: Duplicate

Submitted: 2017-04-24
Updated: 2018-01-09
Resolved: 2018-01-09

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 10
10Resolved

Related Reports

Duplicate :	JDK-8180622 - gc/stress/gclocker/TestGCLockerWithParallel.java: fails with OOME Java heap space
Relates :	JDK-8177968 - Add GC stress test TestGCLocker
Relates :	JDK-8137099 - G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Sub Tasks

JDK-8185874 :

quarantine gc/stress/gclocker/TestGCLockerWithG1.java - Resolved

Description

gc/stress/gclocker/TestGCLockerWithG1.java: fails with OOME Java heap space
Failures seen on Linux hosts only so far.

The log tail:
...
[11.528s][info][gc,start      ] GC(43) Pause Young (G1 Evacuation Pause)
[11.528s][info][gc,task       ] GC(43) Using 2 workers of 2 for evacuation
[11.530s][info][gc,phases     ] GC(43)   Pre Evacuate Collection Set: 0.3ms
[11.530s][info][gc,phases     ] GC(43)   Evacuate Collection Set: 1.2ms
[11.530s][info][gc,phases     ] GC(43)   Post Evacuate Collection Set: 0.2ms
[11.530s][info][gc,phases     ] GC(43)   Other: 0.1ms
[11.530s][info][gc,heap       ] GC(43) Eden regions: 0->0(75)
[11.530s][info][gc,heap       ] GC(43) Survivor regions: 0->0(10)
[11.530s][info][gc,heap       ] GC(43) Old regions: 744->744
[11.530s][info][gc,heap       ] GC(43) Humongous regions: 756->756
[11.530s][info][gc,metaspace  ] GC(43) Metaspace: 6790K->6790K(8192K)
[11.530s][info][gc            ] GC(43) Pause Young (G1 Evacuation Pause) 1500M->1500M(1500M) 2.022ms
[11.530s][info][gc,cpu        ] GC(43) User=0.00s Sys=0.00s Real=0.00s
[11.530s][info][gc,start      ] GC(44) Pause Full (Allocation Failure)
[11.534s][info][gc,phases,start] GC(44) Phase 1: Mark live objects
[11.693s][info][gc,stringtable ] GC(44) Cleaned string and symbol table, strings: 3731 processed, 196 removed, symbols: 27674 processed, 0 removed
[11.693s][info][gc,phases      ] GC(44) Phase 1: Mark live objects 158.999ms
[11.693s][info][gc,phases,start] GC(44) Phase 2: Compute new object addresses
[11.813s][info][gc,phases      ] GC(44) Phase 2: Compute new object addresses 119.267ms
[11.813s][info][gc,phases,start] GC(44) Phase 3: Adjust pointers
[11.893s][info][gc,phases      ] GC(44) Phase 3: Adjust pointers 80.090ms
[11.893s][info][gc,phases,start] GC(44) Phase 4: Move objects
[11.992s][info][gc,phases      ] GC(44) Phase 4: Move objects 98.811ms
[12.064s][info][gc,task        ] GC(44) Using 2 workers of 2 to rebuild remembered set
[12.145s][info][gc,heap        ] GC(44) Eden regions: 0->0(75)
[12.145s][info][gc,heap        ] GC(44) Survivor regions: 0->0(10)
[12.145s][info][gc,heap        ] GC(44) Old regions: 744->172
[12.145s][info][gc,heap        ] GC(44) Humongous regions: 756->180
[12.145s][info][gc,metaspace   ] GC(44) Metaspace: 6790K->6734K(8192K)
[12.146s][info][gc             ] GC(44) Pause Full (Allocation Failure) 1500M->343M(1500M) 615.231ms
[12.146s][info][gc,cpu         ] GC(44) User=0.47s Sys=0.04s Real=0.61s
[12.148s][info][gc,marking     ] GC(35) Concurrent Mark From Roots 934.707ms
[12.148s][info][gc,marking     ] GC(35) Concurrent Mark Abort
[12.148s][info][gc             ] GC(35) Concurrent Cycle 937.392ms
[12.177s][info][gc,heap,exit   ] Heap
[12.177s][info][gc,heap,exit   ]  garbage-first heap   total 1536000K, used 352000K [0x00007fb4f2400000, 0x00007fb4f2502ee0, 0x00007fb550000000)
[12.177s][info][gc,heap,exit   ]   region size 1024K, 1 young (1024K), 0 survivors (0K)
[12.177s][info][gc,heap,exit   ]  Metaspace       used 6992K, capacity 7222K, committed 7424K, reserved 8192K
----------System.err:(6/303)----------
java.lang.OutOfMemoryError: Java heap space
	at Filler.<init>(TestGCLocker.java:50)
	at MemoryUser.load(TestGCLocker.java:150)
	at MemoryUser.run(TestGCLocker.java:164)
	at java.base/java.lang.Thread.run(Thread.java:844)
STATUS:Failed.`main' threw exception: java.lang.OutOfMemoryError: Java heap space

Comments

Closed as duplicate of JDK-8180622 - any remaining problem seems to be related to the test being bad, which is handled there.
09-01-2018
There is also a test bug: the problem is that the only memory consumer thread waits at least 500ms until it releases memory. This may be too long, i.e. there may be more than the allowed number of gclocker induced garbage collections during this time. This is why the test fails.
29-11-2017
I think I understand the exact sequence what is happening: the VM is in a state where any young gc does not yield space for any allocation. Then one thread A goes into G1CollectedHeap::attempt_allocation_slow(), the GC locker is not active and does not need a GC. This means that in G1CollectedHeap::attempt_allocation_slow() we go into the "if (should_try_gc)" part and issues an incremental collection that "succeeds" (which just means that it has not been prevented to run by a GC locker); since we are already a state where that garbage collection does not yield any space, that VM operation (VM_G1IncCollectionPause) succeeds, and returns NULL. Then there is this code that simply passes the NULL on, with the following comment: // If we get here we successfully scheduled a collection which // failed to allocate. No point in trying to allocate // further. We'll just return NULL. MutexLockerEx x(Heap_lock); *gc_count_before_ret = total_collections(); return NULL; Now the caller of caller of attempt_allocation_slow() (G1CollectedHeap::mem_allocate(), via attempt_allocation) would typically schedule a Full GC. However that one does not succeed because the GC locker is active from another thread B; now that thread B manages to schedule a regular GC locker induced GC that clears that we are waiting for a GC-locker induced GC afterwards. This repeats GCLockerRetryAllocationCount in thread A, ultimately returning NULL, giving the OOME. The Full GCs in the log could be caused by the different threads being in the same situation managing to schedule those before thread A returns with the failed allocation.
24-11-2017
[~ehelin] Thanks for being so thorough in your analysis about why this bug isn't an integration_blocker!
15-08-2017
Ok, so I got some time to analyze this bug. As I suspected, it is not a new bug, but essentially the same as JDK-8137099. The failure pattern is slightly different here, but the main issue is still the same. A Java thread wants to satisfy a failed allocation with the help of a full collection, but the GCLocker is active (and sets needs_gc). The GCLocker will have needs_gc set, but when the GCLocker finally runs Universe::heap()->collect(GCCause::_gc_locker) then G1CollectedHeap::collect then a "incremental" (young or mixed) collection will be run, not a full. If all old regions are full and there is no liveness data, then "inc" collection will be young and won't be able to collect any garbage. To summarize, the gist of the problem is that the GCLocker "blocks" a full GC and G1 then runs an YC, not a full GC.
15-08-2017
[~dcubed] there has been a couple of failures due to this new stress test (added by me). The test itself is fine, it just provokes some old, dusty bugs that have been lying around sleeping :) It should definitely be quarantined for G1 until we solve JDK-8137099 (which you already have done, thank you!). Despite Thomas' comment in JDK-8180622, I don't think JDK-8180622 is a dup. In JDK-8180622 it looks like Parallel GC throws the OOM because the heap actually is full. This most likely happened because the stress test was run in a new environment. This is always finicky with stress tests, I tried to tune the test to stress the JVM, but not too much. If the test then is run on a host that simultaneously runs 10 other stress tests, then the stress test might be too much. I have also seen a few bugs from some of the other stress tests, but they have more been infra/system related (can't allocate enough heap at startup etc.). I _think_ the sibling tests are fine, we only need to quarantine the G1 version (which you already have done).
15-08-2017
[~ehelin] I'm okay with removing the integration_blocker label, but we have to do something about the noise in the nightly. The above comments say that this only happens on Linux X64, but if you look at the history for this test you see OOME failures in just about all configs. This test is new in JDK10 so I think we need to quarantine it before we push JDK10/hs -> JDK10/jdk10. Above Kim mentions seeing similar failures in sibling tests like TestGCLockerWithParallel.java. Should we quarantine the entire group of new tests before pushing JDK10/hs -> JDK10/jdk10? Update: I guess I've quarantined too many tests lately. I see that I already quarantined this one via JDK-8185874. So what should we do about the sibling tests?
14-08-2017
[~dcubed] I'm 99% sure this is the same problem as JDK-8137099, does this still makes this an integration blocker?
14-08-2017
Moving the integration_blocker label to JDK-8185874. Update: This test is only present in JDK10-hs. That makes this issue an integration_blocker so putting back the label.
04-08-2017
The bug that added these tests is: JDK-8177968 and that bug is only fixed in JDK10-hs. That makes this issue an integration_blocker.
04-08-2017
I've seen failures that look very much like this one, but for TestGCLockerWithParallel.java. Always on Linux-x64.
12-05-2017
btw, this is not an integration blocker, this is a known existing bug (see my above comment). We can probably close this as a duplicate.
25-04-2017
This seems very similar to https://bugs.openjdk.java.net/browse/JDK-8137099. It is suspicious that we throw an OOME when the heap only uses 350 MB out 1200 MB
24-04-2017