Bug ID: JDK-7033292 G1: nightly failure: Non-dirty cards in region that should be dirty

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: hs21

Priority: P3
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2011-04-01
Updated: 2013-09-18
Resolved: 2011-04-25

JDK 7	Other
7Fixed	hs21Fixed

We see nightly failures that look like this:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/tmp/jprt/P1/B/143634.ap31282/source/src/share/vm/memory/cardTableModRefBS.cpp:724), pid=3488, tid=10
#  guarantee(blk.result()) failed: Non-dirty cards in region that should be dirty
#
# JRE version: 7.0-b135
# Java VM: Java HotSpot(TM) Server VM (21.0-b05-internal-201103301436.ap31282.hotspot-g1-push-fastdebug mixed mode solaris-x86 )

This failure started appearing after I pushed a changeset that includes the fixes for 7023069, 7023151, and 7018286.
gc/gctests/SoftReference/soft004
gc/gctests/WeakReference/weak004
gc/gctests/WeakReference/weak006
gc/memory/UniThread/Circular3
gc/memory/UniThread/Linear3
gc/vector/FloatArrayHigh
gc/vector/FloatArrayLow
gc/vector/ObjectArrayHigh
nsk/regression/b4493566

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/c84ee870e0b9
04-04-2011
EVALUATION The situation ended up being caused by a small bug in the card cache. All the epochs on the card cache array are initialized to 0 and the current epoch variable (_n_periods) is also initialized to 0. So, until the first GC happens (when we'll increase the current epoch) all default card cache entries look valid. When a thread "claimed" one of them, it then materialized the claimed card which, by default, ended up being card 0, i.e., the one corresponding to the bottom of the heap. The fix is straightforward: initialize _n_periods to 1. There are a few reasons that this was not uncovered before: a) We were not checking that all the young cards were dirtied properly at the start of each GC (we do now which is how we caught it). b) The refinement code is very robust with respect to processing cards that are not on regions it expects (in this case: young regions) given that it might have to deal with information that might be out-of-date. c) The bug only happens if we try to add a new card to the cache (since it tries to evict the existing one). Given that we do not process cards on young regions, any application that only allocates young regions until the first GC will never hit it. The only way currently to generate cards for processing until the first GC is if we allocate humongous regions before the first GC (as it was the case for the failing test). Many thanks for John Cuthbertson for his help with this.
04-04-2011
EVALUATION dbx watchpoints to the rescue! The problem happens because the card seems to be concurrently refined and cleaned as part of the concurrent refinement process. This is of course strange given that we should be filtering out cards on young regions and we should not be refining them, even if they are somehow enqueued on an update buffer. I'll try to work out how this can happen. Having said that, given this diagnosis the issue does not seem to be a serious correctness issue. In the worst case, we'll just refine a few cards we don't need to but I don't forsee any failures apart from the failing assert. I would have reduced its priority to P4. The only reason I'll leave as a P3 is because it's failing in the nightlies and it will be good to clean those failures up.
01-04-2011
EVALUATION It's hard to immediately judge how severe this issue is. There are two ways to see it: 1) The reason we dirty cards on young regions is to avoid taking the slow path of the post-write barrier since we never need to process updates on them. So, if that card is clean in the worst case we'll dirty it again and we might do a tiny amount of extra work. BUT: 2) Even though the card becoming clean in this particular scenario is not a big deal, the issue could be a big deal if the region was not an eden region. The card getting clean might hide an update we need to process which could cause a very subtle and rare bug in the future. So, we really need to evaluate why the card gets clean and we can take it from there.
01-04-2011
EVALUATION I can reproduce this failure very easily on my workstation with a workspace (call it: WS) that includes the changeset mentioned in the Description. If I back out of that changeset (call that workspace: WS') the failure does not happen. BUT: a) The changeset does include extra verification code which goes over the young regions before a collection and confirms that all their cards have been dirtied correctly. This verification code was not there before. I added some extra instrumentation and it looks as if verification finds that the first card of the first eden region that's allocated is clean at the beginning of the first GC. b) More instrumentation shows that i) we do call the method that does the dirtying (dirty_young_block()) for an allocated block that includes that card and ii) all eden regions (including the "offending" one) seem to have their cards correctly dirtied at the point when they are retired. So it looks as if the "offending" card becomes clean between the containing region is retired and the start of the first GC. c) If I add the same verification code to workspace WS' I get exactly the same failure. So, it looks as if the failure is not caused by the last changeset but it's most likely uncovered by the additional verification code that was added with that changeset.
01-04-2011

Relates :	JDK-7035144 - G1: nightly failure: Non-dirty cards in region that should be dirty (failures still exist...)
Relates :	JDK-7023151 - G1: refactor the code that operates on _cur_alloc_region to be re-used for allocs by the GC threads
Relates :	JDK-7023069 - G1: Introduce symmetric locking in the slow allocation path
Relates :	JDK-7018286 - G1: humongous allocation attempts should take the GC locker into account