United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-7000559 G1: assertion failure !outer || (full_collections_started == _full_collections_completed + 1)
JDK-7000559 : G1: assertion failure !outer || (full_collections_started == _full_collections_completed + 1)

Details
Type:
Bug
Submit Date:
2010-11-16
Status:
Closed
Updated Date:
2011-03-07
Project Name:
JDK
Resolved Date:
2011-03-07
Component:
hotspot
OS:
generic
Sub-Component:
gc
CPU:
generic
Priority:
P4
Resolution:
Fixed
Affected Versions:
7
Fixed Versions:
hs20 (b04)

Related Reports
Backport:
Backport:
Relates:

Sub Tasks

Description
G1 failure in nightly testing.

#  Internal Error (C:\temp\jprt\P1\B\185041.ysr\source\src\share\vm\gc_implementation\g1\g1CollectedHeap.cpp:1780), pid=5588, tid=1280
#  assert(!outer || (full_collections_started == _full_collections_completed + 1)) failed: for outer caller: full_collections_started = 13 is inconsistent with _full_collections_completed = 11

Test was

nsk/stress/except/except010

                                    

Comments
SUGGESTED FIX

We'll go first fix #1:

--- a/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp
+++ b/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp
@@ -277,7 +277,9 @@
     // completed. This will also notify the FullGCCount_lock in case a
     // Java thread is waiting for a full GC to happen (e.g., it
     // called System.gc() with +ExplicitGCInvokesConcurrent).
+    _sts.join();
     g1->increment_full_collections_completed(true /* outer */);
+    _sts.leave();
   }
   assert(_should_terminate, "just checking");
                                     
2010-12-11
EVALUATION

I have not been able to reproduce it but I browsed through the code and I have a theory on how this failure can happen.

The failure is related to my changes for 6944166: G1: explicit GCs are not always handled correctly. The failing assert is in method G1CollectedHeap::increment_full_collections_completed() which is called either at the end of a Full GC (from do_collection()) or at the end of a concurrent marking phase (at the end of the loop in ConcurrentMarkThread::run()).

Here's some info from the hs_err file from the failure:

Other Threads:
  0x00000000054ab000 VMThread [stack: 0x0000000005540000,0x0000000005640000] [id=652]

=>0x0000000001c9f800 (exited) ConcurrentGCThread [stack: 0x0000000004e20000,0x0000000004f20000] [id=1280]

VM state:at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread:  ([mutex/lock_event])
[0x0000000001c35068] FullGCCount_lock - owner thread: 0x0000000001c9f800
[0x0000000001c372d8] Threads_lock - owner thread: 0x00000000054ab000
[0x0000000001c37af8] Heap_lock - owner thread: 0x0000000001c39800

So, the thread that hit the assert seems to be the ConcurrentGCThread therefore the call to increment_full_collections_completed() should have come from the ConcurrentMarkThread::run() method (note that the same thread is holding the FullGCCount_lock which is taken in that method). Also note that the VM currently seems to be at a safepoint. I assume this safepoint is for a Full GC since it's the only explanation on how the failure can occur.

This is how the loop on ConcurrentMarkThread::run() works:

void ConcurrentMarkThread::run() {
  [snip]

  while (!_should_terminate) {
    // wait until started is set.
    sleepBeforeNextCycle();

    [snip]

      // we now want to allow clearing of the marking bitmap to be
      // suspended by a collection pause.
      _sts.join();
      _cm->clearNextBitmap();
      _sts.leave();
    }

    g1->increment_full_collections_completed(true /* outer */);
  }

The marking thread will not normally wait for a safepoint unless it joins the suspendible thread set (_sts.join()) and do an explicit yield call  (which is how it does the clearing of the next marking bitmap above). 

What I think happened is that a safepoint for a Full GC was initiated without the concurrent GC thread having joined the suspendible thread set. Then the Full GC incremented the number of full GCs that had been started but the concurrent GC thread carried on working through the safepoint and called increment_full_collections_completed() before the Full GC did. This breaks the assumed nesting (it's assumed that the Full GC will call increment_full_collections_completed() before the concurrent mark thread), which is what the assert checks for. The values for full_collections_started = 13 and _full_collections_completed = 11 that we get from the assert are also consistent with the above scenario.

Here's a timeline to make it clearer:

_total_full_collections == 11
_full_collections_completed == 11

VM Thread                           Concurrent Mark Thread


safepoint start
(initial-mark pause)

_total_full_collections++
    (== 12)
notify CM thread to start
safepoint end


                                    concurrent marking start

safepoint start
(Full GC)

_total_full_collections++
    (== 13)


                                    increment_full_
                                      collections_completed(true /* outer */);
                                    FullGCCount_lock->lock()
                                    *BOOM* assert fails

// instead, the VM thread should have reached here and called 
// increment_full_collections_completed() before the CM thread:

increment_full_
  collections_completed(false /* outer */);
safepoint end

I'm not surprised it doesn't fail often. It's a very hard scenario to reproduce. Maybe I can artifically cause it by adding a delay in the CM thread before the call to increment_full_collections_completed().
                                     
2010-12-09
SUGGESTED FIX

If the issue is as described in Evaluation Note #1 then there are a couple of quick fixes we can do:

- The CM thread should join the suspendible thread set before calling increment_full_collections_completed(). (Maybe this is my preferred one)
- The VM thread should take the FullGCCount_lock for the duration of the Full GC.
                                     
2010-12-09
EVALUATION

I should add that the failure is benign in the product. The mutator thread that is waiting for the notification from the concurrent mark thread is actually in native and to get back to executing it will need to get back to VM mode. So, even if it wakes up during the Full GC it will stall until the safepoint is done. (Many thanks to Ramki for the explanation!) I'm lowering the priority to P4.
                                     
2010-12-10
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/fb712ff22571
                                     
2010-12-15
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot/hotspot/rev/fb712ff22571
                                     
2010-12-16
EVALUATION

http://hg.openjdk.java.net/jdk7/build/hotspot/rev/fb712ff22571
                                     
2010-12-25



Hardware and Software, Engineered to Work Together