This bug does indeed have a different cause to 7033292 and is also minor (i.e., it should not affect the product apart from causing a few more cards to be scanned whose overhead will be next to nothing).
The bug is in the following code in G1RemSet::concurrentRefineOneCard_impl():
// Undirty the card.
*card_ptr = CardTableModRefBS::clean_card_val();
// We must complete this write before we do any of the reads below.
// And process it, being careful of unallocated portions of TLAB's.
bool filter_young = true;
HeapWord* stop_point =
oops_on_card_seq_iterate_careful() does indeed bail out of the iteration if the card is on a young region. But, by that time we have already undirtied it.
It turned out that the simple fix described on Note #1 in Evaluation actually uncovered more issues, which in turn uncovered more issues, etc. So the final fix ended up being more involved than I had originally thought.
Here's a quick summary, to have everything in one place:
1) The reason the failure happened is as described already: during concurrent refinement we'd clean the card first and then decide whether to scan it or not. This way, if we ended up with a stale card on a young region we'd always clean it. This is not a correctness issue but it is a (minor) performance issue (the card might get redirtied by mutator threads, maybe more than once) and it does break the "all young cards are dirty" invariant which is good to maintain.
2) Unfortunately, the simple fix of only cleaning the card if we commit to scanning it introduced new failures (missing RSet-type failures). The reason here is very surprising. It turns out that, when we reclaimed regions during cleanup, we would not clean their cards (this was not an issue for CSet regions reclaimed during GC pauses, just for regions reclaimed during cleanup). So we relied on concurrent refinement cleaning any cards that happened to be dirty on those regions (!!!), even if those cards were over the region's top and we didn't have to scan them. So, the fix that stopped cleaning those cards actually left dirty cards on some regions, which caused the write barrier to ignore and writes on those cards, which ultimately caused RSets not to be updated appropriately. Ouch.
So, in addition to only cleaning cards after we commit to scanning them, we also have to make sure that regions reclaimed during cleanup also have their cards cleaned.
3) The above eliminated regions having dirty cards when they were allocated (the reclamation mechanisms ensured that). However, I was still coming across regions that had claimed (!!!) cards when they were allocated. I don't think this really is a correctness issue (the write barrier checks for dirty, not non-clean) but it was perplexing given that we only claim regions during GCs and we were definitely cleaning the card table of all CSet regions at the end of a collection. And maybe it might have caused issues later so I wanted to eliminate this issue too.
The reason behind this ended up being very subtle. Consider old region A having references to old region B. During partially-young GC #N we add A, but not B, to the CSet and at the end of the GC we reclaim A (and clean its card table). At partially-young GC #N+1 we add B to the CSet and scan its RSet which has (stale) entries to region A. When we scan those entries we, again, claim the cards before deciding whether to scan them or not (i.e., it's a similar issue to 1). This is how A was ending up with claimed cards. The solution is again similar to 1: only claim a card when we commit to scanning it.