JDK-6666145 : G1: marking verification fails
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 7
  • Priority: P2
  • Status: Closed
  • Resolution: Cannot Reproduce
  • OS: generic
  • CPU: generic
  • Submitted: 2008-02-21
  • Updated: 2013-09-18
  • Resolved: 2010-11-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
hs18Resolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Description
During testing for 6659663 I came across failures that look like this (they were discovered during heap verification before GC):

Field 0xdb97e138 of live obj 0xd35ff238 points to dead obj 0xd35ff234.
Live obj:
[Ljava.util.Date; 
 - klass: 'java/util/Date'[]
 - length: 5
 -   0 : a 'java/util/Date'
 -   1 : a 'java/util/Date'
 -   2 : a 'java/util/Date'
 -   3 : a 'java/util/Date'
 -   4 : a 'java/util/Date'
Bad referent:
java.util.Date 
 - klass: 'java/util/Date'
fields:
   - private transient 'fastTime' 'J' @8 
   - private transient 'cdate' 'Lsun/util/calendar/BaseCalendar$Date;' @16 

Thse are not the "standard" missing remset entry failures. They actually indicate that a live object is pointing to a garbage one. I ran with -XX:+G1VerifyConcMark and I get the same failure at the end of marking (marking verification happens at the end of cleanup).

I can reproduce this issue relatively easily but with the heavy instrumentation I have added to the VM when investigating 6659663. I went back to my error logs for previous instances of this failure and I can only find one (but it happens very easily with the added instrumentation).

I have reproduced it with jbb2005 on Niagara and Tigerton.

Comments
EVALUATION Additionally, CR 6940310 can definitely explain some of these marking failures, as it can cause a region pop to pop the stack, but get the wrong information and "hide" a region from the marking threads.
05-04-2010

EVALUATION We should also note that, even though it's not very likely to happen, the problem in CR 6909756 can also cause marking to be incomplete due to the concurrent marking thread accidentally zeroing parts of the marking bitmap (see that CR for more details).
25-03-2010

EVALUATION A little bit more info on the remaining marking bug. It happens very rarely. Out of 3,000+ jbb2005 runs on three different machines I hit only 3 times and only on one of the three machines. I was running with 8 warehouses and the machine which hit the failure did longer jbb2005 runs than the other two (it has 8 CPUs and went into the longer timing run, the others have 16 and 32 CPUs and they only did shorter warm-up runs). So, maybe, longer runs will help uncover the issue quicker than shorter ones. Now, all three failures have the same symptoms: a small number of live objects on a region R point to a small number of dead objects on the same region R. Further, R was an old to-space region during the marking cycle (i.e., during a collection between and excluding initial-mark and remark).
22-03-2010

EVALUATION It turns out that after 3,000 runs of jbb2005 I hit a single marking failure. So, there are still ongoing marking issues, even though they now seem very infrequent / hard to reproduce. I'll keep this CR open to carry on the investigation and I opened 6935821 for the SATB queues not getting activated properly issue.
17-03-2010

EVALUATION We investigated the issue for some time and all the indications were that the failure was being caused by a missing write barrier. By getting heap snapshots at the start and end of the marking cycle we could see that: start of marking a.f == b end of marking a.f == c x.g == b (x was allocated after marking started) x was implicitly live (since it was allocated since the start of marking) and a was explicitly live. It was the assignment a.f = c that should have caused b to be queued up and marked, as objects allocated since the start of marking do not have to be visited by the marking threads (according to the SATB invariants). We could also reproduce it with -Xint so John Cuthbertson checked the interpreter code for missing / incorrect write barriers and all looked good. It was John finally who spotted the problem: the SATB queues of newly-created threads were created with their active field set to false, irrespective of whether the thread was created during a marking cycle or not. So, updates were ignored for threads that were created during a marking cycle. The way I ran jbb2005, with short timing intervals, was a good way to reproduce this bug, given that jbb2005 creates new worker threads for each timing interval. If the marking cycle started during, say, warehouse setup and continued into the subsequent timing interval, the bug would surface. And surface it did. And a big well done to John for spotting the issue.
16-03-2010

SUGGESTED FIX Still in progress.
16-03-2010

EVALUATION There seem to be some fences missing when shared fields are accessed by concurrent marking threads. This might be one reason for the marking failures. I've opened 6673883 to track this.
11-03-2008

EVALUATION Hmmm... In one of the runs I got following guarantee failure: # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/concurrentMark.cpp:1941 # # An unexpected error has been detected by Java Runtime Environment: # # Internal Error (/java/east/u2/ap31282/hs-g1-debugging/src/share/vm/gc_impleme ntation/g1/concurrentMark.cpp:1941), pid=29659, tid=20 # Error: guarantee(satb_mq_set.completed_buffers_num() == 0,"invariant") Which basically says that we might not have processed at least one SATB buffer by the end of the remark phase. Well, this will definitely explain why we some objects were not marked. But, I wonder whether this guarantee failure is caused by the same issue as the marking failure. If it is, why haven't we seen the failure before? I'll treat them as the same right now; I'll open a new CR if I prove that the two issues are different.
06-03-2008

EVALUATION Here's the latest on the investigation. I print out the contents of the SATB buffers when they are processed. It looks as if some (but not all, 2 out 12 in one example) of the reachable objects that were not marked do appear in the SATB buffers. So, the problem doesn't seem to be a barrier issue. With this evidence, I'd guess there's a race in the marking code.
27-02-2008

EVALUATION I got a marking verification failure with server too. Interestingly, the live objects that point to the unmarked ones were non-arrays in the case of server, whereas they seem to be arrays in the case of client. I'll do a few more runs to see whether this is a recurring pattern.
23-02-2008

EVALUATION One of the scenarios, that I can reproduce on a Niagara box, is the following. Marking verification fails after the third marking cycle for that VM. The (simplified) note is that live object a points to garbage object x live object b points to garbage object y live object c points to garbage object z live object d points to garbage object z All a, b, c, and d reside on the same region, which is still young (it was allocated before the remark pause and there were no evacuation pauses after that). Objects x, y, and z reside on different regions, which were all allocated as to-space of some previous GC (all such GCs took place before the current marking cycle). The regions containing x, y, and z look 100% used. In this case, a, b, c, and, d are object arrays, but I have seen cases where they are not. So, this looks like a "classic" concurrent marking bug; either there's a race in the marking code and we don't correctly mark the objects, or we're missing a pre-barrier. This can also be reproduced on Tigerton too. I think I ran with -client on both of them, for some reason. I'll try with -server to see what happens. Interestingly, x, y, and z all seem to be biased, as their lock word is 0x5. I don't know if this tells us anything. I'm attaching a (very large) log that contains the failure, as well as a full heap dump afterwards.
22-02-2008