Bug ID: JDK-8058209 Race in G1 card scanning could allow scanning of memory covered by PLABs

JDK 8	JDK 9
8u40 b16Fixed	9Fixed

noreg-hard justification: This is an extremely rare race deep inside the G1 GC code, reliably triggering it is close to impossible.
28-11-2014
Reading the region timestamp as being lower than the global timestamp suggests that the card scanning thread, T2, read the timestamp before it could see the store of the allocating thread, T1, in record_top_and_timestamp. Since T1's setting of the timestamp does happen before any allocations (which set top) it must be the case that T1 can see the higher top value after reading the old value of timestamp. The write of time stamp is visible to other threads but we cannot in any way guarantee that the timestamp will not we incremented to match the current gc time stamp at the instant after we read the timestamp. The way to resolve this race is to first load the top field and then the timestamp, ensuring that the top field we read was valid at the point we read the timestamp. This is true on x86 because it has TSO and guarantees StoreStore and LoadLoad ordering. The card scanning thread should never be allowed to see top values updated during the current garbage collection. The root cause is that the top value can be changed after the timestamp has been observed to be less than the current GC's timestamp. The way to solve that race is to read the top value before checking the timestamp. T1 performs three writes which must be stored in order w(_saved_mark_word) w(_gc_time_stamp) w(_top) T2 which reads all these values must read them in the revere order in order to be ensured that it has observed a valid state combination r(_top) r(_gc_time_stamp) r(_saved_mark_word) This is due to the fact that seeing the latest top value guarantees us that our reads of the earlier written fields are correct.
10-11-2014
It appears that my theory from the previous comment is plausible at least, by inserting a short sleep between the read of _gc_time_stamp and _top (for some free regions): if (static_cast<const HeapRegion>(this)->is_free() && os::random() % 1000 == 0) os::naked_short_sleep(5); I was able to reproduce a crash similar to the ones seen by the coherence team. I suspect that the cause of the bug is that the read of _top is done after the read of _gc_time_stamp, by reading _top first and adding a loadload() barrier I've so far been unable to reproduce the crash. When taking a gc alloc region several stores are performed where store ordering needs to be enforced ST(_saved_mark_word) #StoreStore ST(_gc_time_stamp) ST(_top) // satisfying alloc request ST(_alloc_region) // publishing to other gc workers ST(_top) // potential further allocations When we inspect a region during remembered set scanning we need to ensure that these values are read in the reverse order with load ordering enforced to be sure that we see a consistent view: The current code does LD(_gc_time_stamp) LD(_top) I would argue that this is incorrect, since the value of _top could have been changed after we read the time stamp to determine that we should return _top instead of _saved_mark_word. My suggested fix is to load _top first: HeapWord G1OffsetTableContigSpace::saved_mark_word() const { G1CollectedHeap* g1h = G1CollectedHeap::heap(); assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" ); HeapWord* local_top = top(); OrderAccess::loadload(); if (_gc_time_stamp < g1h->get_gc_time_stamp()) { return local_top; } else { return Space::saved_mark_word(); } }
05-11-2014
The most recent core file shows a similar crash, a worker thread is scanning cards in a region which is used as an old gc alloc region, seeing incomplete objects. The worker thread should not be looking at this region whatsoever, the used_region_at_save_marks call should return an empty MemRegion for all old gc alloc regions allocated during this GC. We have a theory about where the race in the code is. When a gc worker calls used_region_at_save_marks to get the sm_region it ends up in G1OffsetTableContigSpace::saved_mark_word. http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/jdk9-b37/src/share/vm/gc_implementation/g1/heapRegion.cpp#l1000 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const { G1CollectedHeap* g1h = G1CollectedHeap::heap(); assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" ); if (_gc_time_stamp < g1h->get_gc_time_stamp()) return top(); else return Space::saved_mark_word(); } If thread T1 reads the gc_time_stamp and if it's not current then it assumes that all objects up to _top are safe to look at. If another thread, T2, allocates this region from the free list after T1 has read _gc_time_stamp but not yet read _top, and T2 has managed to modify _top to point to _end before T1 reads _top then T1 could see the region as a candidate for scanning even though that's incorrect. I need to do some further testing and thinking of this and I'll provide a new build for Lina to try later this week.
04-11-2014
I have verified that the new crash files indeed show the same problematic condition as previously seen. There is an option to get separated gc logs for G1PrintHeapRegions, -Xloggc accepts a %p which expands to the pid of the running jvm, so if you could ask Lina to run with -Xloggc:gclog.%p that would hopefully get us untangled gc log output. The problem still seems to be that calling "used_region_at_saved_marks" returns a MemRegion which spans the entire HR, and when scanning a card inside that region we crash on a broken object header. The object header looks correct in the core file but may have been concurrently written to by another thread. When I inspect the HeapRegion instance it looks like used_region_at_saved_marks should return an empty MemRegion since the region's _gc_time_stamp is equal to the global _gc_time_stamp and _saved_mark_word is equal to _bottom. For some reason sm_region is (_bottom, _end) There are two possibilities for this to occur: Either the local _gc_time_stamp was read as below the global one, and _top was equal to _end. In that case there should not be any threads allocating into the region and no races should occur. Or the local _gc_time_stamp was not below the global one, but the read of _saved_mark_word returned _end instead of what should have been set at the point of calling record_top_and_timestamp. For now I suggest that we request runs with -Xloggc:%p.gclog with G1PrintHeapRegions enabled to figure out if this region was the retained alloc region or not.
24-10-2014
Additionally, the java stack trace for the pid944 crash has the following line: j com.oracle.common.internal.net.ipclw.mql.Unsafe.abort()V+0 They are calling their "Unsafe.abort()" function and the JVM crashes in some strange and spectacular way. I'm not surprised.
23-10-2014
No, OrderAccess::storestore(); does not do any kind of flush on x86, it just forces the c++ compiler to not move memory operations around. The original failure looked like it was caused by gcc moving the write of _gc_time_stamp to before _saved_mark_word was written, I verified this by disassembling the record_top_and_timestamp function. With my fixed build this is resolved by editing the code of OrderAccess::release() (which is called by storestore() on x86). Lina suggested in the bugdb bug that they are still seeing the original crash, I'd like to look at a core file which crashed in the same way, in HeapRegionDCTOC::walk_mem_region around the scanCard code. The crash creating hs_err_pid944.log and core.944 are completely unrelated. They are caused by some thread sending a SIGBUS signal through the tgkill(2) syscall this is most likely due to a bug in _their_ code. Perhaps we should ask coherence about their usage of tkill/tgkill in their infiniband interface code.
23-10-2014
My fix/workaround is: diff -r 38539608359a src/os_cpu/linux_x86/vm/orderAccess_linux_x86.inline.hpp --- a/src/os_cpu/linux_x86/vm/orderAccess_linux_x86.inline.hpp +++ b/src/os_cpu/linux_x86/vm/orderAccess_linux_x86.inline.hpp @@ -49,7 +49,12 @@ inline void OrderAccess::release() { // Avoid hitting the same cache-line from // different threads. - volatile jint local_dummy = 0; + volatile intptr_t local_dummy; +#ifdef AMD64 + __asm__ volatile ("movq $0, %0" : "=m" (local_dummy) : : "memory"); +#else + __asm__ volatile ("movl $0, %0" : "=m" (local_dummy) : : "memory"); +#endif // AMD64 }
22-10-2014
Encountered one more crash, and this time with -XX:+G1PrintHeapRegions. The heap region being scanned which is also being allocated into by another thread is: (gdb) print ((HeapRegion)0x00007f2ab40fad00) $5 = {<G1OffsetTableContigSpace> = {<CompactibleSpace> = {<Space> = {<CHeapObj<mtGC>> = {<No data fields>}, _vptr.Space = 0x7f2acbfccdf0 <vtable for HeapRegion+16>, _bottom = 0xc6900000, _end = 0xc6a00000, _saved_mark_word = 0xc6900000, _preconsumptionDirtyCardClosure = 0x0, _par_seq_tasks = {<StackObj> = {<No data fields>}, _n_tasks = 0, _n_claimed = 0, _n_threads = 0, _n_completed = 0}}, _compaction_top = 0xc6900000, _next_compaction_space = 0x0, _first_dead = 0x0, _end_of_live = 0x0}, _top = 0xc6a00000, _offsets = {<G1BlockOffsetArray> = {<G1BlockOffsetTable> = { _vptr.G1BlockOffsetTable = 0x7f2acbfc91d0 <vtable for G1BlockOffsetArrayContigSpace+16>, _bottom = 0xc6900000, _end = 0xc6a00000}, _array = 0x7f2ab40603c0, _gsp = 0x7f2ab40fad00, _init_to_zero = true, _unallocated_block = 0xc6900000}, _next_offset_threshold = 0xc6a00000, _next_offset_index = 217088}, _par_alloc_lock = {<Monitor> = {<CHeapObj<mtInternal>> = {<No data fields>}, _LockWord = {FullWord = 0, Address = 0x0, Bytes = "\000\000\000\000\000\000\000"}, _owner = 0x0, _EntryList = 0x0, _OnDeck = 0x0, _WaitLock = {0}, _WaitSet = 0x0, _snuck = false, NotifyCount = 0, _name = "OffsetTableContigSpace par alloc lock", '\000' <repeats 26 times>}, <No data fields>}, _gc_time_stamp = 7, _pre_dummy_top = 0xc69ffff0}, _rem_set = 0x7f2ab40faef0, _hrm_index = 105, _allocation_context = 0 '\000', _type = { _tag = HeapRegionType::OldTag}, _humongous_start_region = 0x0, _orig_end = 0xc6a00000, _in_collection_set = false, _evacuation_failed = false, _next_in_special_set = 0x0, _next_young_region = 0x0, _next_dirty_cards_region = 0x7f2ab433d300, _next = 0x0, _prev = 0x0, _claimed = 0, _prev_marked_bytes = 0, _next_marked_bytes = 0, _gc_efficiency = 147451.17665432399, _young_index_in_cset = -1, _surv_rate_group = 0x0, _age_index = -1, _prev_top_at_mark_start = 0xc6900000, _next_top_at_mark_start = 0xc6900000, _recorded_rs_length = 0, _predicted_elapsed_time_ms = 0, _predicted_bytes_to_copy = 0, static LogOfHRGrainBytes = 20, static LogOfHRGrainWords = 17, static GrainBytes = 1048576, static GrainWords = 131072, static CardsPerRegion = 2048} (gdb) print sm_region $6 = {_start = 0xc6900000, _word_size = 65535} g1h->_gc_time_stamp is 7 From the GC log file, the entries related to this region are (I have included their line numbers in the file so as to understand the position of these in the log) G1HR CLEANUP 0x00000000c6900000 //line 30368 G1HR ALLOC(Old) 0x00000000c6900000 //line 30471 G1HR CSET 0x00000000c6900000 // line 31867 G1HR ALLOC(Old) 0x00000000c6900000 // line 32110 G1HR CSET 0x00000000c6900000 // 34092 2014-10-19T19:14:49.161-0400: 187.257: [GC pause (G1 Evacuation Pause) (mixed) G1HR #StartGC 346 // line 34320 G1HR ALLOC(Old) 0x00000000c6900000 // line 34427 G1HR #EndGC 417 // line 34519 2014-10-19T19:15:36.551-0400: 7733.322: [GC pause (G1 Evacuation Pause) (mixed) G1HR #StartGC 1709 //line 47955 G1HR #EndGC 1709 // line 48275 G1HR #StartFullGC 1710 // line 48437 G1HR #EndFullGC 1710 // line48478 From the hs_err log file: AC 0 O TS 7 PTAMS 0x00000000c6900000 NTAMS 0x00000000c6900000 space 1024K, 99% used [0x00000000c6900000, 0x00000000c6a00000) So, the region with _bottom=0xc6900000 is an Old GC Alloc region and looks like was retained across GCs. Shouldn't there have been an entry that this region is retained and reused?
20-10-2014
Thanks for the link. However, I don't think these two issues are the same since _saved_mark_word = 0xc1200000 is set to the same value as _bottom = 0xc1200000. We really shouldn't be scanning this HeapRegion at all.
10-10-2014
There was a similar issue fixed in 8u31: https://bugs.openjdk.java.net/browse/JDK-8044406
09-10-2014
It looks like the bug happens because we're scanning cards for a HeapRegion that another thread is allocating into. That should not happen, and we have code that tries to prevent this: 148 // Set the "from" region in the closure. 149 _oc->set_region(r); 150 HeapWord* card_start = _bot_shared->address_for_index(index); 151 HeapWord* card_end = card_start + G1BlockOffsetSharedArray::N_words; 152 Space sp = SharedHeap::heap()->space_containing(card_start); 153 MemRegion sm_region = sp->used_region_at_save_marks(); 154 MemRegion mr = sm_region.intersection(MemRegion(card_start,card_end)); 155 if (!mr.is_empty() && !_ct_bs->is_card_claimed(index)) { sp->used_region_at_save_marks() should return an empty region, but it doesn't: sm_region = { _start = 0xc1200000, _word_size = 131072 } Interesting fields from the heap region: (const HeapRegion const) 0x7f86580a94a0 _bottom = 0xc1200000, _end = 0xc1300000, _saved_mark_word = 0xc1200000, _gc_time_stamp = 5, _prev_top_at_mark_start = 0xc1200000, _next_top_at_mark_start = 0xc1200000, The current gc time stamp is 5: (gdb) p G1CollectedHeap::_g1h->_gc_time_stamp $65 = 5 So, according the following functions we should have returned a memregion going from [bottom(), bottom()), but we get a region from [bottom(), top()): MemRegion used_region_at_save_marks() const { return MemRegion(bottom(), saved_mark_word()); } and HeapWord* G1OffsetTableContigSpace::saved_mark_word() const { G1CollectedHeap* g1h = G1CollectedHeap::heap(); assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" ); if (_gc_time_stamp < g1h->get_gc_time_stamp()) return top(); else return Space::saved_mark_word(); } void G1OffsetTableContigSpace::record_top_and_timestamp() { G1CollectedHeap* g1h = G1CollectedHeap::heap(); unsigned curr_gc_time_stamp = g1h->get_gc_time_stamp(); if (_gc_time_stamp < curr_gc_time_stamp) { // The order of these is important, as another thread might be // about to start scanning this region. If it does so after // set_saved_mark and before _gc_time_stamp = ..., then the latter // will be false, and it will pick up top() as the high water mark // of region. If it does so after _gc_time_stamp = ..., then it // will pick up the right saved_mark_word() as the high water mark // of the region. Either way, the behavior will be correct. Space::set_saved_mark_word(top()); OrderAccess::storestore(); _gc_time_stamp = curr_gc_time_stamp; // No need to do another barrier to flush the writes above. If // this is called in parallel with other threads trying to // allocate into the region, the caller should call this while // holding a lock and when the lock is released the writes will be // flushed. } } virtual HeapWord* saved_mark_word() const { return _saved_mark_word; } void set_saved_mark_word(HeapWord* p) { _saved_mark_word = p; } The current hypothesis is that this happens because the _gc_time_stamp of this newly evacuated old region has not been cleared, and therefore we get an incorrect time stamp and this in turn leads to a race condition between the card scanning code and another evacuating GC thread.
09-10-2014

Relates :	JDK-8061964 - Insufficient compiler barriers for GCC in OrderAccess functions
Relates :	JDK-8065358 - Refactor G1s usage of save_marks and reduce related races