Bug ID: JDK-6909756 G1: guarantee(!G1CollectedHeap::heap()->mark_in

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: hs17

Priority: P3
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2009-12-11
Updated: 2013-09-18
Resolved: 2010-04-21

JDK 6	JDK 7	Other
6u21Fixed	7Fixed	hs17Fixed

Nightly testing is hitting this error:

#  Internal Error (/tmp/jprt/P1/B/231722.ysr/source/src/share/vm/gc_implementation/g1/concurrentMark.cpp:671), pid=31416, tid=2840730512
#  Error: guarantee(!G1CollectedHeap::heap()->mark_in_progress(),"Precondition.")

EVALUATION http://hg.openjdk.java.net/hsx/hsx17/master/rev/41643676408b
09-04-2010
EVALUATION http://hg.openjdk.java.net/hsx/hsx17/baseline/rev/41643676408b
08-04-2010
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/23b1b27ac76c
07-04-2010
PUBLIC COMMENTS I'll also piggy-back a simple change to do the "GC locker active" check earlier in the G1 pauses and before we do the "will this pause be an initial-mark" decision. It turns out that during testing I hit a case when we'd start doing an initial-mark pause, only to abandon it because the GC locker was active. At the beginning of the next pause I'd hit an assert because the "we are during an initial-mark pause" flag was set and it should only be set for the duration of a pause and not outside one (the start of a pause should always find it to be clear). Additionally, I'm changing the "is GC locker active" test from GC_locker::is_active() to GC_locker::check_active_before_gc(), since the latter will block more threads from entering the critical region and, as a result, the GC_locker will become inactive quicker and not starve.
06-04-2010
PUBLIC COMMENTS In the fix I'll piggy-back a couple of minor fixes: removal of the _conc_mark_initiated parameter from G1CollectorPolicy which doesn't seem to be used, as well as minor reformatting.
26-03-2010
SUGGESTED FIX Quick note: we don't need an extra active flag. We can use started() \|\| in_progress().
25-03-2010
SUGGESTED FIX I have come up with a simple fix that will eliminate the assertion. The idea is not to let a new marking cycle start before the CMThread completes the previous cycle (the previous cycle completes after the next marking bitmap has been cleared). So, the CMThread maintains an active flag that's set as soon as the cycle starts and cleared just after the next marking bitmap has been cleared. If a GC pause notices that it needs to start a marking cycle by piggy-backing initial-mark and the CMThread is active, then it will not do the initial-mark (and hence start a marking cycle) and postpone it for a future pause to eventually do it (after the CMThread has become inactive). The assumption is that marking cycles will not happen back-to-back in the normal operation of G1. But the above solution does have the potential to delay the start of marking cycles. However, first: having two marking cycles overlap, as it can happen now, is just wrong and second: if an application initiates back-to-back marking cycles to keep up then the cycle starting late is the least of its performance issues. An alternative would be to ensure that the marking bitmap is cleared during the initial-mark pause (and notify the CMThread that it doesn't have to do it / bail out of doing it). But, this is not entirely satisfactory either given it could cause a very long initial-mark pause.
25-03-2010
SUGGESTED FIX The fix will be tricky. We cannot wait during the IM pause for the CMThread to finish clearing the bitmap, as the CMThread will have yielded waiting for the pause to complete (i.e., we'll deadlock). Also, we cannot clear the bitmap in the beginning of the marking cycle (which starts after the IM pause), as we need it to be clear during the IM pause so that we can mark the roots on it.
24-03-2010
EVALUATION I poked around the code to see how this can be caused. It seems to me that it's not necessarily the back-to-back Full GCs that are the main cause for this as I had originally thought. The concurrent marking (CM) code assumes that the concurrent marking thread (CMThread) will have finished its work during a cycle (which completes with clearing the next marking bitmap) before a new cycle is initiated during an initial-mark (IM) pause. However, nothing guards against a new cycle being initiated before the CMThread clears the next marking bitmap. So, the steps that can cause this are the following: - a marking cycle reaches and does cleanup / aborts (we've only seen the issue after a cycle aborted, I think we might also be able to hit it in the case where a cycle gracefully completes) - the CMThread is somehow starved between observing the completion of the cycle and clearing the next marking bitmap - meanwhile a new marking cycle is initiated and the mark_in_progress() flag is set to true - finally, the CMThread wakes up and tries to clear the bitmap, but fails in the guarantee that mark_in_progress() should be false More worrisome is the fact that, while clearing the next marking bitmap, the CMThread actually yields every now and then. What can maybe happen is that the next IM pause happens while the CMThread is yielding, then the marking cycle will start using the next marking bitmap, while the CMThread is happily zeroing the same marking bitmap. And this could go unnoticed (until another failure happens).
24-03-2010
PUBLIC COMMENTS I attached the GC log from a failing test.
11-12-2009

Relates :	JDK-6867250 - G1: java.lang.ArrayIndexOutOfBoundsException in Win64
Relates :	JDK-6666145 - G1: marking verification fails