Bug ID: JDK-7000559 G1: assertion failure !outer || (full_collections_started == _full_collections

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 7

Priority: P4
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2010-11-16
Updated: 2013-09-18
Resolved: 2011-03-07

JDK 6	JDK 7	Other
6u25Fixed	7Fixed	hs20Fixed

G1 failure in nightly testing.

#  Internal Error (C:\temp\jprt\P1\B\185041.ysr\source\src\share\vm\gc_implementation\g1\g1CollectedHeap.cpp:1780), pid=5588, tid=1280
#  assert(!outer || (full_collections_started == _full_collections_completed + 1)) failed: for outer caller: full_collections_started = 13 is inconsistent with _full_collections_completed = 11

Test was

nsk/stress/except/except010

EVALUATION http://hg.openjdk.java.net/jdk7/build/hotspot/rev/fb712ff22571
25-12-2010
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot/hotspot/rev/fb712ff22571
16-12-2010
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/fb712ff22571
15-12-2010
SUGGESTED FIX We'll go first fix #1: --- a/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp +++ b/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp @@ -277,7 +277,9 @@ // completed. This will also notify the FullGCCount_lock in case a // Java thread is waiting for a full GC to happen (e.g., it // called System.gc() with +ExplicitGCInvokesConcurrent). + _sts.join(); g1->increment_full_collections_completed(true /* outer */); + _sts.leave(); } assert(_should_terminate, "just checking");
11-12-2010
EVALUATION I should add that the failure is benign in the product. The mutator thread that is waiting for the notification from the concurrent mark thread is actually in native and to get back to executing it will need to get back to VM mode. So, even if it wakes up during the Full GC it will stall until the safepoint is done. (Many thanks to Ramki for the explanation!) I'm lowering the priority to P4.
10-12-2010
SUGGESTED FIX If the issue is as described in Evaluation Note #1 then there are a couple of quick fixes we can do: - The CM thread should join the suspendible thread set before calling increment_full_collections_completed(). (Maybe this is my preferred one) - The VM thread should take the FullGCCount_lock for the duration of the Full GC.
09-12-2010
EVALUATION I have not been able to reproduce it but I browsed through the code and I have a theory on how this failure can happen. The failure is related to my changes for 6944166: G1: explicit GCs are not always handled correctly. The failing assert is in method G1CollectedHeap::increment_full_collections_completed() which is called either at the end of a Full GC (from do_collection()) or at the end of a concurrent marking phase (at the end of the loop in ConcurrentMarkThread::run()). Here's some info from the hs_err file from the failure: Other Threads: 0x00000000054ab000 VMThread [stack: 0x0000000005540000,0x0000000005640000] [id=652] =>0x0000000001c9f800 (exited) ConcurrentGCThread [stack: 0x0000000004e20000,0x0000000004f20000] [id=1280] VM state:at safepoint (normal execution) VM Mutex/Monitor currently owned by a thread: ([mutex/lock_event]) [0x0000000001c35068] FullGCCount_lock - owner thread: 0x0000000001c9f800 [0x0000000001c372d8] Threads_lock - owner thread: 0x00000000054ab000 [0x0000000001c37af8] Heap_lock - owner thread: 0x0000000001c39800 So, the thread that hit the assert seems to be the ConcurrentGCThread therefore the call to increment_full_collections_completed() should have come from the ConcurrentMarkThread::run() method (note that the same thread is holding the FullGCCount_lock which is taken in that method). Also note that the VM currently seems to be at a safepoint. I assume this safepoint is for a Full GC since it's the only explanation on how the failure can occur. This is how the loop on ConcurrentMarkThread::run() works: void ConcurrentMarkThread::run() { [snip] while (!_should_terminate) { // wait until started is set. sleepBeforeNextCycle(); [snip] // we now want to allow clearing of the marking bitmap to be // suspended by a collection pause. _sts.join(); _cm->clearNextBitmap(); _sts.leave(); } g1->increment_full_collections_completed(true /* outer /); } The marking thread will not normally wait for a safepoint unless it joins the suspendible thread set (_sts.join()) and do an explicit yield call (which is how it does the clearing of the next marking bitmap above). What I think happened is that a safepoint for a Full GC was initiated without the concurrent GC thread having joined the suspendible thread set. Then the Full GC incremented the number of full GCs that had been started but the concurrent GC thread carried on working through the safepoint and called increment_full_collections_completed() before the Full GC did. This breaks the assumed nesting (it's assumed that the Full GC will call increment_full_collections_completed() before the concurrent mark thread), which is what the assert checks for. The values for full_collections_started = 13 and _full_collections_completed = 11 that we get from the assert are also consistent with the above scenario. Here's a timeline to make it clearer: _total_full_collections == 11 _full_collections_completed == 11 VM Thread Concurrent Mark Thread safepoint start (initial-mark pause) _total_full_collections++ (== 12) notify CM thread to start safepoint end concurrent marking start safepoint start (Full GC) _total_full_collections++ (== 13) increment_full_ collections_completed(true / outer /); FullGCCount_lock->lock() BOOM* assert fails // instead, the VM thread should have reached here and called // increment_full_collections_completed() before the CM thread: increment_full_ collections_completed(false /* outer */); safepoint end I'm not surprised it doesn't fail often. It's a very hard scenario to reproduce. Maybe I can artifically cause it by adding a delay in the CM thread before the call to increment_full_collections_completed().
09-12-2010