Bug ID: JDK-7129892 G1: explicit marking cycle initiation might fail to initiate a marking cycle

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: hs23

Priority: P4
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2012-01-13
Updated: 2013-09-18
Resolved: 2012-03-24

JDK 7	JDK 8	Other
7u4Fixed	8Fixed	hs23Fixed

While doing a code review for a G1 change I think I spotted a subtle race in the code. Consider two mutator threads (A and B) doing the following concurrently:

A: it attempts an allocation, the allocation fails, it schedules a GC VM op in an attempt to free up space

B: it needs to explicitly start a concurrent marking cycle (say: System.gc() with -XX:+ExplicitGCInvokesConcurrent), it calls collect(Cause cause) which schedules a GC VM op with the should_initiate_conc_mark flag set to true.

Currently, one of the GC VM ops will "win" and do the GC, the other will observe that a GC took place between the time it was scheduled and the time it was executed and do nothing else. if A's VM op "wins", then B's VM op will not do the GC and as a result the conc marking cycle will not start.

The mechanisms that use collect(Cause cause) to explicitly start a concurrent marking cycle and should be affected by this issue are:

-XX:+ExplicitGCInvokesConcurrent
-XX:+GCLockerInvokesConcurrent

and the recent changes for
6976060: G1: humongous object allocations should initiate marking cycles when necessary 

I should point out that, as far as I know, we haven't come across this issue during testing so we should first reproduce it with a test to prove that the race can indeed happen.

EVALUATION http://hg.openjdk.java.net/lambda/lambda/hotspot/rev/caa4652b4414
22-03-2012
EVALUATION http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/caa4652b4414
18-02-2012
EVALUATION http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/caa4652b4414
14-02-2012
PUBLIC COMMENTS Interestingly, if collect(Cause cause) is called because a concurrent hum allocation pushes the non-young capacity over the initiating threshold and the attempted VM op fails because another thread managed to schedule another GC maybe retrying the GC is not necessary. The GC that succeeded should notice that the non-young capacity has gone over the theshold and should have started the cycle initiation procedure anyway. But the GC should be re-attempted in the case of ExplicitGCInvokesConcurrent and GCLockerInvokesConcurrent.
16-01-2012
PUBLIC COMMENTS I have now proven that this issue exists. I wrote a small test (see attached) that has one thread (the "System GC Thread") doing back-to-back System.gc()'s (which will initiate conc marking cycles if -XX:+ExplicitGCInvokesConcurrent is set) whereas other threads (the "Load Threads") are trying to concurrently allocate short-lived objects. If the test is run with a small-ish young gen, the Load Threads run out of young gen often and compete with the System GC Thread in scheduling a GC. I run it with: java -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCTimeStamps -verbosegc -Xms1g -Xmx1g -Xmn32m MarkingCycleInitTest and around 5-10 marking cycles (out of around 1,300 or so) fail to start within the 2 min window the test is run for. I count the number of initial-mark pauses that appear in the generated GC log and compare it with the number of cycles that were supposed to have been initiated. I don't think it's worth trying to reproduce the issue with the other ways an explicit concurrent marking cycle can be initiated (currently: -XX:+GCLockerInvokesConcurrent the changes for 6976060) given that they all use the same underlying mechanism.
16-01-2012
PUBLIC COMMENTS There are a couple of possible ways to address this: - Instead of attaching the should_initiate_conc_mark condition on the VM op we could instead set a global flag so that any subsequent GC will start a a conc cycle. This might require some code re-organization through given that in the case of +ExplicitGCInvokesConcurrent the initiating thread waits for the conc cycle to finish in the epilogue of the VM op. - Alternatively, we could keep attempting the "initial mark" VM op in a loop until it succeeds. In extreme cases, this might cause the initiating thread to starve though.
13-01-2012
PUBLIC COMMENTS Another way for an initial-mark pause not to succeed might be the GC locker: if it's active it might prevent the "initial mark" VM op from succeeding.
13-01-2012