JDK-8069367 : Eagerly reclaimed humongous objects left on mark stack
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8u40,9
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2015-01-20
  • Updated: 2017-07-26
  • Resolved: 2015-04-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8 JDK 9
8u60Fixed 9 b64Fixed
Related Reports
Blocks :  
Blocks :  
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
#  Internal Error (hotspot/src/share/vm/gc_implementation/g1/concurrentMark.cpp:3408), pid=2457, tid=9
#  assert(_nextMarkBitMap->isMarked((HeapWord*) obj)) failed: invariant

Stack:
V  [libjvm.so+0x15f53f8]  void VMError::report_and_die()+0x700;;  __1cHVMErrorOreport_and_die6M_v_+0x700
V  [libjvm.so+0xa7cf98]  void report_vm_error(const char*,int,const char*,const char*)+0x70;;  __1cPreport_vm_error6Fpkci11_v_+0x70
V  [libjvm.so+0xa1a7a4]  void CMTask::drain_local_queue(bool)+0x5c4;;  __1cGCMTaskRdrain_local_queue6Mb_v_+0x5c4
V  [libjvm.so+0xa1baa8]  void CMTask::do_marking_step(double,bool,bool)+0x748;;  __1cGCMTaskPdo_marking_step6Mdbb_v_+0x748
V  [libjvm.so+0xa21948]  void CMConcurrentMarkingTask::work(unsigned)+0x398;;  __1cXCMConcurrentMarkingTaskEwork6MI_v_+0x398
V  [libjvm.so+0x1659bcc]  void GangWorker::loop()+0x294;;  __1cKGangWorkerEloop6M_v_+0x294
V  [libjvm.so+0x12abb98]  java_start+0x378;;  java_start+0x378

Comments
Bugs found by nightly testing. Verified by passed nightly.
26-07-2017

ILW = High (crash), Medium (a few occurrenses), Medium (disable eager reclamation) = P2
25-02-2015

A less intrusive fix compared to cleaning the mark stacks/SATB queues could simply look if the object we are trying to scan is above next top-at-mark-stack. This should automatically filter out reclaimed large objects that were eagerly freed. (This is also what we do when we regularly drain the SATB queues).
03-02-2015

Still working on reproducing the case described in an earlier comment, i.e. when we reclaim a large object that is still referenced in the marking stack(s). Note that the observation that in that particular location there is a valid humongous object does not invalidate the argument that eager reclaim causes this failure. Consider the following sequence of occurrences: Large object O1 is allocated into region R Concurrent marking pushes the location r of large object O1 on its marking stack, marking O1 in the process (In more detail: in ConcurrentMark::do_marking_step() line 4145 we drain the SATB buffers, marking and pushing r on the mark stack. Then during drain_satb_buffers() and before drain_local_queue(), it yields to the GC request, "aborting" the marking) Young GC occurs, reclaiming O1 and removing the mark on O1 (r is still in the local queue) The mutator allocates another large object O2 into R (possibly putting r in the SATB buffers again, but that does not matter) Concurrent marking processes r on the local mark stack (in do_marking_step()), finds out that r is not marked --> assertion failure This explains the situation and seems most likely because the test applications it fails with allocate humongous objects like crazy. It's also dependent on above exact timing. The explanation I have for why this only occurs now is that with the change JDK-8048179 there is more reclamation going on in these tests, but the problem existed since the first implementation of it. As for impact contrary to what I mentioned earlier, it is not possible to encounter an uncommitted page here: we only shrink the heap at full gc, and that aborts marking. We scan the first object in the region needlessly. The problem is if that region does not contain a valid object, that _is_ possible. I.e. it might contain a tail of a humongous object. Workaround/short term fix: disable eager reclaim completely during concurrent marking (beginning from the initial mark GC). Proposed fix that keeps eager reclaim on at any time: when reclaiming regions, also clean the mark stacks and the SATB queues of references to that region.
30-01-2015

Also note that I never managed to reproduce the problem at least once. This is pure deduction from the end results and source code analysis.
30-01-2015

We do not need to necessarily clear the SATB queues, we could also just not push references to free or humongous tails onto the mark stack.
30-01-2015

Some results from the core file from the Humongous_Arrays test: The offending oop is at 0xa6400000, a humongous byte array (klass name: [B, size 1003834 elements, 1M heap region size). Previous gc has been a young gc (from the hs_err log), so potentially early reclaimed. That regions remembered set is empty too. On the other hand the region type of this region is still humongous and looks like a valid humongous object (should not be after reclaim). This basically rules out eager reclaim as a problem though. It is strange that the failing thread is thread 1 though - concurrent marking threads typically use higher thread numbers.
28-01-2015

Tried reproducing locally with no result.
28-01-2015

You can selectively disable this particular feature with the experimental option G1EagerReclaimHumongousObjectsWithStaleRefs btw.
23-01-2015

Most likely the early reclaim of humongous objects with a few references change added lately introduced that problem (JDK-8048179). Do we know if the referenced obj is a large object (it should be enough to check whether it's an address that is region aligned). Looking at the hs_err file heap layout this is very likely, so probably not even worth checking. I think that while the remembered set to that region has been there before the gc, during eager reclaim we noticed that there is no actual reference any more and removed that remembered set entry (and the entire object with it). However marking still had this reference on the marking stack and wants to scan it. A quick fix would be to disable this kind of early reclaim during marking. A longer term, clean fix would need to somehow remove these references from the mark stack quickly - with the current structure of the mark stack this is rather expensive as we would need to scan every mark stack *and* the overflow stack. Another alternative would be to just record that this particular region has been reclaimed since the last mark start, and when finding a reference in the mark stack that has been dropped recently, skip it. The problem does not go away with just fixing the assert. G1 might not only have reclaimed the region, but also uncommitted it in the meantime. JDK-8048179 is not in 8u40 btw.
23-01-2015

I = High (crash, nightly), L = High (twice in nightlies), W = High (nightly) = P1
21-01-2015