Bug ID: JDK-6666145 G1: marking verification fails

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 7

Priority: P2
Status: Closed
Resolution: Cannot Reproduce
OS: generic
CPU: generic

Submitted: 2008-02-21
Updated: 2013-09-18
Resolved: 2010-11-23

Other
hs18Resolved

During testing for 6659663 I came across failures that look like this (they were discovered during heap verification before GC):

Field 0xdb97e138 of live obj 0xd35ff238 points to dead obj 0xd35ff234.
Live obj:
[Ljava.util.Date; 
 - klass: 'java/util/Date'[]
 - length: 5
 -   0 : a 'java/util/Date'
 -   1 : a 'java/util/Date'
 -   2 : a 'java/util/Date'
 -   3 : a 'java/util/Date'
 -   4 : a 'java/util/Date'
Bad referent:
java.util.Date 
 - klass: 'java/util/Date'
fields:
   - private transient 'fastTime' 'J' @8 
   - private transient 'cdate' 'Lsun/util/calendar/BaseCalendar$Date;' @16 

Thse are not the "standard" missing remset entry failures. They actually indicate that a live object is pointing to a garbage one. I ran with -XX:+G1VerifyConcMark and I get the same failure at the end of marking (marking verification happens at the end of cleanup).

I can reproduce this issue relatively easily but with the heavy instrumentation I have added to the VM when investigating 6659663. I went back to my error logs for previous instances of this failure and I can only find one (but it happens very easily with the added instrumentation).

I have reproduced it with jbb2005 on Niagara and Tigerton.

EVALUATION Additionally, CR 6940310 can definitely explain some of these marking failures, as it can cause a region pop to pop the stack, but get the wrong information and "hide" a region from the marking threads.
05-04-2010
EVALUATION We should also note that, even though it's not very likely to happen, the problem in CR 6909756 can also cause marking to be incomplete due to the concurrent marking thread accidentally zeroing parts of the marking bitmap (see that CR for more details).
25-03-2010
EVALUATION A little bit more info on the remaining marking bug. It happens very rarely. Out of 3,000+ jbb2005 runs on three different machines I hit only 3 times and only on one of the three machines. I was running with 8 warehouses and the machine which hit the failure did longer jbb2005 runs than the other two (it has 8 CPUs and went into the longer timing run, the others have 16 and 32 CPUs and they only did shorter warm-up runs). So, maybe, longer runs will help uncover the issue quicker than shorter ones. Now, all three failures have the same symptoms: a small number of live objects on a region R point to a small number of dead objects on the same region R. Further, R was an old to-space region during the marking cycle (i.e., during a collection between and excluding initial-mark and remark).
22-03-2010
EVALUATION It turns out that after 3,000 runs of jbb2005 I hit a single marking failure. So, there are still ongoing marking issues, even though they now seem very infrequent / hard to reproduce. I'll keep this CR open to carry on the investigation and I opened 6935821 for the SATB queues not getting activated properly issue.
17-03-2010
EVALUATION We investigated the issue for some time and all the indications were that the failure was being caused by a missing write barrier. By getting heap snapshots at the start and end of the marking cycle we could see that: start of marking a.f == b end of marking a.f == c x.g == b (x was allocated after marking started) x was implicitly live (since it was allocated since the start of marking) and a was explicitly live. It was the assignment a.f = c that should have caused b to be queued up and marked, as objects allocated since the start of marking do not have to be visited by the marking threads (according to the SATB invariants). We could also reproduce it with -Xint so John Cuthbertson checked the interpreter code for missing / incorrect write barriers and all looked good. It was John finally who spotted the problem: the SATB queues of newly-created threads were created with their active field set to false, irrespective of whether the thread was created during a marking cycle or not. So, updates were ignored for threads that were created during a marking cycle. The way I ran jbb2005, with short timing intervals, was a good way to reproduce this bug, given that jbb2005 creates new worker threads for each timing interval. If the marking cycle started during, say, warehouse setup and continued into the subsequent timing interval, the bug would surface. And surface it did. And a big well done to John for spotting the issue.
16-03-2010
SUGGESTED FIX Still in progress.
16-03-2010
EVALUATION There seem to be some fences missing when shared fields are accessed by concurrent marking threads. This might be one reason for the marking failures. I've opened 6673883 to track this.
11-03-2008
EVALUATION Hmmm... In one of the runs I got following guarantee failure: # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/concurrentMark.cpp:1941 # # An unexpected error has been detected by Java Runtime Environment: # # Internal Error (/java/east/u2/ap31282/hs-g1-debugging/src/share/vm/gc_impleme ntation/g1/concurrentMark.cpp:1941), pid=29659, tid=20 # Error: guarantee(satb_mq_set.completed_buffers_num() == 0,"invariant") Which basically says that we might not have processed at least one SATB buffer by the end of the remark phase. Well, this will definitely explain why we some objects were not marked. But, I wonder whether this guarantee failure is caused by the same issue as the marking failure. If it is, why haven't we seen the failure before? I'll treat them as the same right now; I'll open a new CR if I prove that the two issues are different.
06-03-2008
EVALUATION Here's the latest on the investigation. I print out the contents of the SATB buffers when they are processed. It looks as if some (but not all, 2 out 12 in one example) of the reachable objects that were not marked do appear in the SATB buffers. So, the problem doesn't seem to be a barrier issue. With this evidence, I'd guess there's a race in the marking code.
27-02-2008
EVALUATION I got a marking verification failure with server too. Interestingly, the live objects that point to the unmarked ones were non-arrays in the case of server, whereas they seem to be arrays in the case of client. I'll do a few more runs to see whether this is a recurring pattern.
23-02-2008
EVALUATION One of the scenarios, that I can reproduce on a Niagara box, is the following. Marking verification fails after the third marking cycle for that VM. The (simplified) note is that live object a points to garbage object x live object b points to garbage object y live object c points to garbage object z live object d points to garbage object z All a, b, c, and d reside on the same region, which is still young (it was allocated before the remark pause and there were no evacuation pauses after that). Objects x, y, and z reside on different regions, which were all allocated as to-space of some previous GC (all such GCs took place before the current marking cycle). The regions containing x, y, and z look 100% used. In this case, a, b, c, and, d are object arrays, but I have seen cases where they are not. So, this looks like a "classic" concurrent marking bug; either there's a race in the marking code and we don't correctly mark the objects, or we're missing a pre-barrier. This can also be reproduced on Tigerton too. I think I ran with -client on both of them, for some reason. I'll try with -server to see what happens. Interestingly, x, y, and z all seem to be biased, as their lock word is 0x5. I don't know if this tells us anything. I'm attaching a (very large) log that contains the failure, as well as a full heap dump afterwards.
22-02-2008

Duplicate :	JDK-6900941 - G1: VerifyDuring failure
Relates :	JDK-6909756 - G1: guarantee(!G1CollectedHeap::heap()->mark_in_progress(),"Precondition.")
Relates :	JDK-6935821 - G1: threads created during marking do not active their SATB queues
Relates :	JDK-6940310 - G1: MT-unsafe calls to CM::region_stack_push() / CM::region_stack_pop()