JDK-7035144 : G1: nightly failure: Non-dirty cards in region that should be dirty (failures still exist...)
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs21
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2011-04-08
  • Updated: 2013-09-18
  • Resolved: 2011-05-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 Other
7Fixed hs21Fixed
Related Reports
Relates :  
Description
We've seen one more failure like this:

#  Internal Error (C:\temp\jprt\P1\B\195806.jc234399\source\src\share\vm\memory\cardTableModRefBS.cpp:724), pid=16232, tid=3564
#  guarantee(blk.result()) failed: Non-dirty cards in region that should be dirty

in nightly testing, even the push of the fix that was supposed to eliminate such failures (see CR 7033292: G1: nightly failure: Non-dirty cards in region that should be dirty).

Comments
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/063382f9b575
30-04-2011

SUGGESTED FIX The (expanded) fix is described in detail in Evaluation Note #2. The summary is: 1) During concurrent refinement only clean cards that we commit to scan. 2) Regions that are reclaimed during cleanup should have their cards cleaned. 3) During STW RSet scanning only claim cards that we commit to scan. 4) In addition, I revamped the "verify CT region is clean / dirty" to print out detailed information when this verification fails.
20-04-2011

EVALUATION It turned out that the simple fix described on Note #1 in Evaluation actually uncovered more issues, which in turn uncovered more issues, etc. So the final fix ended up being more involved than I had originally thought. Here's a quick summary, to have everything in one place: 1) The reason the failure happened is as described already: during concurrent refinement we'd clean the card first and then decide whether to scan it or not. This way, if we ended up with a stale card on a young region we'd always clean it. This is not a correctness issue but it is a (minor) performance issue (the card might get redirtied by mutator threads, maybe more than once) and it does break the "all young cards are dirty" invariant which is good to maintain. 2) Unfortunately, the simple fix of only cleaning the card if we commit to scanning it introduced new failures (missing RSet-type failures). The reason here is very surprising. It turns out that, when we reclaimed regions during cleanup, we would not clean their cards (this was not an issue for CSet regions reclaimed during GC pauses, just for regions reclaimed during cleanup). So we relied on concurrent refinement cleaning any cards that happened to be dirty on those regions (!!!), even if those cards were over the region's top and we didn't have to scan them. So, the fix that stopped cleaning those cards actually left dirty cards on some regions, which caused the write barrier to ignore and writes on those cards, which ultimately caused RSets not to be updated appropriately. Ouch. So, in addition to only cleaning cards after we commit to scanning them, we also have to make sure that regions reclaimed during cleanup also have their cards cleaned. 3) The above eliminated regions having dirty cards when they were allocated (the reclamation mechanisms ensured that). However, I was still coming across regions that had claimed (!!!) cards when they were allocated. I don't think this really is a correctness issue (the write barrier checks for dirty, not non-clean) but it was perplexing given that we only claim regions during GCs and we were definitely cleaning the card table of all CSet regions at the end of a collection. And maybe it might have caused issues later so I wanted to eliminate this issue too. The reason behind this ended up being very subtle. Consider old region A having references to old region B. During partially-young GC #N we add A, but not B, to the CSet and at the end of the GC we reclaim A (and clean its card table). At partially-young GC #N+1 we add B to the CSet and scan its RSet which has (stale) entries to region A. When we scan those entries we, again, claim the cards before deciding whether to scan them or not (i.e., it's a similar issue to 1). This is how A was ending up with claimed cards. The solution is again similar to 1: only claim a card when we commit to scanning it.
20-04-2011

EVALUATION This bug does indeed have a different cause to 7033292 and is also minor (i.e., it should not affect the product apart from causing a few more cards to be scanned whose overhead will be next to nothing). The bug is in the following code in G1RemSet::concurrentRefineOneCard_impl(): ... // Undirty the card. *card_ptr = CardTableModRefBS::clean_card_val(); // We must complete this write before we do any of the reads below. OrderAccess::storeload(); // And process it, being careful of unallocated portions of TLAB's. ... bool filter_young = true; HeapWord* stop_point = r->oops_on_card_seq_iterate_careful(dirtyRegion, &filter_then_update_rs_oop_cl, filter_young); oops_on_card_seq_iterate_careful() does indeed bail out of the iteration if the card is on a young region. But, by that time we have already undirtied it.
08-04-2011

SUGGESTED FIX We should undirty cards when we make the decision that the card is not on a young region i.e., in oops_on_card_seq_iterate_careful(), not before.
08-04-2011