Bug ID: JDK-8027295 Free CSet takes ~50% of young pause time

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: hs25,8

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2013-10-25
Updated: 2014-07-29
Resolved: 2014-03-17

JDK 8	JDK 9
8u20Fixed	9 b08Fixed

In the refworkload/wls_webapp_atomic JSP-HTTPS-POST-3 benchmark freeing the collection set takes ~50% of total pause time with G1.

Command line to run:

-showversion -Xmx8g -Xms8g -XX:+UseLargePages -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:G1MaxNewSizePercent=29 -XX:+UseAdaptiveSizePolicy -XX:+AlwaysPreTouch

See the attached figure wls_webapp_atomic-long-free-cset.png which shows a stacked graph of the pause time time consumers. The gray area represents the Free CSet time.

The problematic CSet component is most likely cleaning up the sparse table (but needs to be verified).

Yeah, sorry, was not sure which issue to use. Not much else I can say on this anyway! :-)
14-02-2014
I agree on all your points, particularly on your conclusion. This problem just needs to be taken under consideration when implementing that - that was just mention of a possible pain point. Do you mind moving over to JDK-8034873 when discussing concurrent cset freeing? I will copy your helpful comments manually this time because I am sure they will be forgotten here :) Thanks.
14-02-2014
"Problem seems that 500ms is already quite long, and may cause this work to spill over into the next pause (if done only by a single thread)." A couple of things: First, if CSet freeing takes 500ms, I assume the CSet is quite large, hence the eden is also quite large. So, you're probably not going to fill it up very quickly to require another GC after 500ms. Second, note that the concurrent freeing process at the end of marking will actually work through a GC pause to make regions available if they are required by the GC (look at how the secondary free list is used). So, if a GC does happen it doesn't necessarily have to wait for the concurrent process to finish (of course, as long as it doesn't happen often; the concurrent process will eventually fall behind and it might have to do a safepoint to catch up). Speeding up the freeing operation is of course always good (whether it's done at a safepoint or concurrently)! But, IMHO, the only way to totally eliminate this overhead is to do the work concurrently.
14-02-2014
On TOPLINK, with the serial optimizations mentioned above only, Free CSet time drops from 4.33ms to 2.81ms on average.
13-02-2014
Splitting this CR into serial optimizations (this CR) and parallelization efforts (JDK-8034842).
13-02-2014
Linked JDK-6868854, same observation, with some measurements on an unknown application.
13-02-2014
The prototype changes do not take the lock any more in this case. An early prototype parallelizes Hot Card Cache, RSet size counting and RSet clearing, and keeping the rest in a serial phase. So on that machine mentioned above, CSet freeing is reduced from the mentioned ~500ms to 240ms avg (334ms max) with only the tweaking mentioned above to 40ms avg (51 max) with 26 worker threads (this number is derived from some magic constants in the prototype; problem is that starting and synchronizing on all 200+ threads must be considered). There needs to be some ergonomics for that. Also the serial phase and the parallel phase(s) can be done in parallel, i.e. one thread performs the serial work and the other threads start doing the parallelized work. This could get the Free CSet time down to ~12ms. Going to see what and if work can be done concurrently, thanks for the idea. Problem seems that 500ms is already quite long, and may cause this work to spill over into the next pause (if done only by a single thread). Concurrent operation sounds particularly attractive for the cases when parallelizing does not make sense (due to long worker task synch time), but still takes a significant time of the pause.
06-02-2014
I'd recommend that you look into doing the CSet freeing work concurrently after the pause has completed, instead of in parallel at the end of the pause. There's already a mechanism in G1 to free regions concurrently (look at how the CM cleanup phase frees regions concurrently and makes them available on the secondary free list). It should be possible to re-use a similar mechanism to also free the collection set regions too. Note that doing the RSet freeing work in parallel might not work all that great, if free() takes a single lock.
05-02-2014
Parts of the CSet freeing should be parallelized - on really big machines (460GB heap, 320GB young gen) freeing the CSet takes 500ms. The only operation that does not seem to be parallelizable seems to be recolouring the young gen region as old gen during evacuation failure. Everything else is self-contained and only ever accesses the region's private data structures or is easily parallelizable (e.g. separate free region lists that can be merged later). Also, instead of going through the linked list of collection set regions, just claim on the region index.
24-01-2014
refworkload/TOPLINK main contributor to pause time is also the Free CSet phase (~30%)
03-12-2013
At a higher level, ie. G1CollectedHeap::free_evacuation_set() in this case around - 70% of time is spent in actually freeing the region (G1CollectedHeap::free_region()) - 20% is spent counting the number of occupied cards of the freed region (e.g. HeapRegionRemSet::occupied()) - 10% in statistics Of that (after tweaking card cache cleanup slightly), - ~50% code roots clearing - ~17% sparse remembered set clear - ~10% coarse remembered set clear - ~7% fine remembered set clear - ~15% card cache clear (Note that particularly in the operations taking less time the time taking overhead should be noticable; yes, these numbers do not match up) Which basically means that we should take a look how to improve the code cache root data structure management.
15-11-2013
Attached measurements (ods + image) including a breakdown of the time consumers (accumulated) for the Free CSet phase. Some observations: - cleaning up the from card cache takes most of the time. That's unexpected. - code cache root data structure cleanup takes most of the time: we probably should not always call the C memory manager to deallocate (and later allocate) the data - sparse table is third, this is expected as each region does not have many remembered set entries. Takes an unproportional amount of time though - probably we have every region referencing many other regions with each only a few remembered set entries. - fine and coarse table do not take much time to clean as they are mostly unused.
14-11-2013

Blocks :	JDK-8034842 - Parallelize the Free CSet phase in G1
Relates :	JDK-8034873 - Concurrent collection set freeing
Relates :	JDK-6868854 - G1: Eliminate serial Other times at the end of a GC pause
Relates :	JDK-8035815 - Cache-align and pad the from card cache
Relates :	JDK-8034868 - Extract G1 From Card Cache into separate class