Bug ID: JDK-8230187 Throughput post-write barrier for G1

Summary
-------
Have the G1 garbage collector use a throughput-optimized barrier when the user disables concurrent refinement to achieve better throughput at the cost of latency on certain workloads that are not so latency sensitive.

Non-Goals
---------
- Let the VM determine, either at startup or at runtime, when to disable concurrent refinement and optimize its barriers.
- Changes to the general garbage collection cycle, i.e. G1 will stay generational, and alternate between a young-only phase and a space reclamation phase that incrementally reclaims space in the old generation.
- Changes to G1’s throughput or features when concurrent refinement is enabled.
- Limit the availability of other G1 features like string deduplication, AppCDS, and eager reclamation of humongous objects when concurrent refinement is disabled.
- Additional ergonomics changes to improve throughput or better meet pause time goals when concurrent refinement is disabled.
- Match the performance of G1 to Parallel GC when concurrent refinement is disabled.

Motivation
----------
The G1 garbage collector has a more complicated post-reference write barrier (write barrier in short in the following) than the write barriers for more traditional collectors such as the Parallel or the Concurrent Mark-Sweep collector. This complexity is largely due to support for concurrent refinement which moves some scanning work in the collection pause to work done concurrently to the application.

The current mechanism is as follows: the write barrier adds newly dirtied cards to a local per-thread dirty card queue. If a local per-thread dirty card queue is full, the write barrier either adds this (full) queue into a global dirty card queue set, and either receives a new empty dirty card queue to fill, or is told to process the entries in this dirty card queue. Concurrent refinement threads also pick up dirty card queues from the global dirty card queue set and process them. In either case, this processing determines whether a given dirty card needs to be scanned in the next collection.

As a result, this refinement mechanism incurs noticeable overhead during execution in several places: the G1 write barrier is significantly more complicated and larger than others, taking more execution resources, and has a larger code cache footprint. The larger write barrier may also negatively affect compiler decisions for e.g. inlining during code generation. Additionally the concurrent dirty card processing, either inline or using additional threads, takes additional CPU cycles.

An observation we have made in the past is that concurrent refinement offers limited benefit for certain types of workloads. Examples include throughput-oriented workloads where latency is not the primary concern or workloads that are tuned to minimize old-generation collections. For these cases, G1 could perform better if concurrent refinement could be disabled to allow the use of a simpler write barrier.

Currently, concurrent refinement cannot be disabled completely. G1 creates concurrent GC worker threads to do the refinement work by default. The user could specify e.g. `-XX:G1ConcRefinementThreads=0` to disable these worker threads, but the processing of dirty card queues by mutator threads directly can not be disabled, and so the write barrier can not be simplified.

Description
-----------
We propose a new JVM flag `-XX:-G1UseConcRefinement` to turn off concurrent refinement and allow G1 to use a new "throughput post-write barrier". By default, `G1UseConcRefinement` is enabled. If the user specifies `-XX:-G1UseConcRefinement`, the compilers and interpreter will issue a simplified post-barrier for a given reference write `p.f = q`:

if (p and q in same region) -> exit
if (q is NULL) -> exit
if (*card(p) == DIRTY) -> exit
*card(p) = DIRTY

To ensure correctness under `-XX:-G1UseConcRefinement`, now G1 scans all dirty cards mapped to regions not in the collection set in addition to remembered sets for regions in the collection set during a collection pause.

`-XX:-G1UseConcRefinement` will improve G1's throughput and reduce overall CPU usage. The simplified write barrier is much shorter in length, thus also improves instruction cache hit rate. This mode reduces the total amount of work for handling a dirty card and compilation work for JIT compilers. In addition, it reduces memory footprint by shrinking remembered sets and not using per-thread dirty card queues.

Alternatives
------------
Performance testing of alternative throughput post-write barriers have been conducted, ranging from using the same barrier as Parallel GC to more complicated variants.

We found that the first two lines (`if (p and q in same region) -> exit` and `if (q is NULL) -> exit`) in the proposed write barrier above effectively filter out unnecessary cards during execution that thus do not need to be processed in the GC pause, reducing GC pause times without impacting throughput. The filter in the third line (`if (*card(p) == DIRTY) -> exit`) has been kept because:

- it does not have noticeable impact on throughput;
- it corresponds to conditional card marking other collectors already optionally do (via `-XX:+UseCondCardMark`) to reduce coherency traffic on larger machines;
- by keeping this filter the proposed barrier can be a complete prefix of the default write barrier. This simplifies or makes it possible to implement further enhancements such as dynamically switching between default and throughput write barrier in the future.

Some of the overhead impact of the concurrent refinement may be removed by better handling in the compiler, and improved scheduling of the concurrent refinement threads. We expect that the gains from these changes would have a significantly smaller impact on throughput compared to simplifying the barrier as proposed here: there will also always remain some refinement work that decreases throughput. Such an effort would be orthogonal to this change.

Testing
-------
- to provide correctness of the new write barrier, existing test cases must pass with `-XX:-G1UseConcRefinement` to cover the case where concurrent refinement is disabled.
- regarding performance, we intend to compare benchmark scores between `-XX:-G1UseConcRefinement` and `-XX:+G1UseConcRefinement` for several well-known benchmarks.

Risks and Assumptions
---------------------
For certain workloads, it will be harder to meet small pause time goals with concurrent refinement disabled. Examples include workloads with large heaps and a considerable proportion of long-lived objects. We suggest that on such a workload the user should keep concurrent refinement enabled and use the default write barrier.

This proposal has been superseded by JDK-8340827.
23-04-2025
[~Thomas]: Thanks for the revision, looks good. I revised the sentence, which means "-XX:-G1UseConcRefinement" is not a silver bullet to put G1 in a throughput mode.
30-08-2019
[~manc]: I do not completely understand the "Automatically tune G1 for throughput when concurrent refinement is disabled." statement, can you clarify?
30-08-2019
Agreed and revised. I also took over JDK-8134303 and it could be part of the implementation of this JEP.
29-08-2019
I can understand [~kbarrett] here, although I assume the setting described here is very unusual. We could use the suggestion from JDK-8134303 here.
29-08-2019
I don't think repurposing -XX:G1ConcRefinementThreads=0 to request this completely different behavior is appropriate. This is a semantic change to an existing product option. That option value already has a well-defined meaning with concurrent refinement; having no separate refinement threads is just an edge case. It means that all concurrent refinement must be done by conscripted mutator threads. Also, (a not implemented feature) if the concurrent refinement threads are unable to keep up, such that mutator threads are being conscripted to also perform refinement, the mutator threads could process more than one buffer if that's what's needed to keep the unprocessed logs under control. I would prefer a new option to request this new behavior, with incompatible option errors if it is enabled and any of the existing concurrent refinement controlling options are also explicitly specified. (And I say this as someone who strongly believes we have way too many options and should be trying hard to reduce that number.) Perhaps -XX:G1UseConcurrentRefinement or -XX:G1EnableConcurrentRefinement, defaulting true but can be set false?
27-08-2019

Relates :	JDK-8340827 - G1: Improve Application Throughput with a More Efficient Write-Barrier
Relates :	JDK-8132233 - Provide the option to disable conditional card marking in G1
Relates :	JDK-8134303 - Introduce -XX:-G1UseConcRefinement
Relates :	JDK-8226197 - Reduce G1’s CPU cost with simplified write post-barrier and disabling concurrent refinement