Bug ID: JDK-8077144 Concurrent mark initialization takes too long

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 9

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2015-04-07
Updated: 2018-06-21
Resolved: 2016-04-06

JDK 9
9 b116Fixed

Concurrent mark needs to clear internal data structures (in ConcurrentMark::clear_all_count_data()) every marking cycle.

This data structure can get really huge (80G+ @ 3T heap with 800 threads), so it takes a long time to process that data structure by a single thread.

Even startup is delayed by ~15mins by that on such machines

There is a better way to solve this issue: Current G1 uses per-mark thread liveness mark bitmaps that span the entire heap to be ultimately able to create information about areas in the heap where there are any live objects on a card basis. This information is needed for scrubbing remembered sets later. Basically, in addition to updating the previous bitmap required for SATB, the marking threads also, for every live object, mark all bits corresponding to the area the object covers on a per thread basis on these per-thread liveness mark bitmaps. During the remark pause, this information is aggregated into (two) global bitmaps ("Liveness Count Data"), then in the cleanup pause augmented with some more liveness information, and then used for scrubbing the remembered sets. The main problems with that solution: - the per-mark thread data structures take up a lot of space. E.g. with 64 mark threads, this data structure has the same size of the Java heap. Now, when you need to use 60 mark threads, the heap is big. And at those heap sizes, needing that much more memory hurts a lot. - management of these additional data structures is costly, it takes a long time to initialize, and regularly clear them. The increased startup time has actually been the cause for this issue. - it takes a significant amount of time to aggregate this data in the remark pause. - it slows down marking, the combined bitmap update (the prev bitmap and these per-thread bitmaps) is slower than doing these phases seperately. This proposed solution removes the per-thread additional mark bitmaps, and recreates this information from the (complete) prev bitmap in an extra concurrent phase after the Remark pause. This can be done since the Prev bitmap does not change after Remark any more. In total, this separation of the tasks is faster (lowers concurrent cycle time) than doing this work at once for the following reasons: - I did not observe any throughput regresssions with this change: actually, throughput of some large applications even increases with that change (not taking into account that you could increase heap size now since not so much is taken up by these additional bitmaps). - the concurrent phase to prepare for the next marking is much shorter now, since we do not need to clear lots of memory any more. - the remark pause can be much faster (I have measurements of a decrease in the order of a magnitude on large applications, where this aggregation phase dominates the remark pause). - startup and footpring naturally decreases significantly. As a nice side-effect, the change in effect removes a significant amount of LOC.
14-03-2016
Updated patch
22-02-2016
Same issue on small applications: the prototype reduces footprint and startup time significantly, so I am changing the title.
20-11-2015
It would be best to make the initial clear part of the concurrent phase (at the moment it is not, not even its time tracked). The problem is that the current code initializes this data structure before marking threads are available to use (and probably the fact that worker threads are usable at that time is something that needs to be looked into more carefully). Initialization could be moved to the OS though by always allocating this data using manual virtual memory allocation, which guarantees that the memory is zero-filled when first used, so this initial zeroing during startup can be skipped completely.
21-04-2015
Can also be seen in running GCOld on a large heap with many threads, could be used as test application during development.
14-04-2015
ILW = Limited performance issue = P2
08-04-2015

Blocks :	JDK-8151386 - Extract card live data out of G1ConcurrentMark
Duplicate :	JDK-8151482 - G1 uses too much physical memory for BitMap during JVM start-up
Duplicate :	JDK-8143024 - Make aggregate-data phase concurrent
Duplicate :	JDK-8151517 - Investigate moving the aggregate count data phase into concurrent
Relates :	JDK-8151215 - Modify layout of (large) Concurrent Mark data structures
Relates :	JDK-8152932 - Investigate optimal method to set bits in card live data
Relates :	JDK-8153843 - G1CardLiveDataHelper incorrectly sets next_live_bytes on dead humongous regions
Relates :	JDK-8151069 - Parallelize clearing the per-thread concurrent mark data structures
Relates :	JDK-8151814 - Tune thread usage for concurrent tasks
Relates :	JDK-8017163 - G1: Refactor remembered sets
Relates :	JDK-8178105 - Switch mark bitmaps during Remark
Relates :	JDK-8155917 - Memory access in free regions during G1 full gc causes regressions in SPECjvm2008 scimark.fft,lu,sor,sparse with 9+116 on Linux-x64