JDK-8340827 : G1: Improve Application Throughput with a More Efficient Write-Barrier
  • Type: JEP
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P4
  • Status: Submitted
  • Resolution: Unresolved
  • Submitted: 2024-09-24
  • Updated: 2025-03-17
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Summary
-------

Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers.

Goals
-----

 - Reduce the overall synchronization overhead necessary to operate the G1 garbage collector,
 - Reduce the size of the generated code for the G1 barriers,
 - Keep the overall behavior of the G1 garbage collector the same, alternating between two phases where G1 only collects the young or both young and old generation memory and reclaims memory incrementally.


Non-Goals
---------

 - Match the throughput of the G1 garbage collector with any other garbage collector.

Motivation
----------

Java Virtual Machine (JVM) overhead is always a concern, especially in cloud-based Java deployments where billing is based on actual resource usage. One component that can impact CPU usage significantly is the garbage collector (GC). The default GC in the HotSpot JVM, the Garbage-First (G1) collector is tuned to keep a balance between latency and throughput for Java applications. However, keeping this balance can sometimes lead to considerable processing overhead compared to other throughput-oriented garbage collectors such as Parallel and Serial GC.

The main reason for this difference is that G1 performs some tasks concurrently with the application to meet latency goals. Other garbage collectors might handle these tasks during the collection pause, however concurrent task execution requires additional synchronization between the garbage collector and the application, leading to substantial overhead. Specifically, G1 uses a relatively large amount of code, called write-barrier, to track every modification to the object graph in the Java heap. So for every write of a reference in the Java application, the G1 write-barrier adds about 50 x64 instructions [[Protopopovs23]](https://epub.jku.at/obvulihs/download/pdf/8800715), [[JEP475]](https://openjdk.org/jeps/475), containing very expensive cross-CPU memory synchronization instructions, compared to 3 x64 instructions for the Parallel GC write-barrier.


Depending on the Java application, this difference can result in 10-20% throughput regression compared to the Parallel collector in the current JDK (as noted in [[JDK-8062128]](https://bugs.openjdk.org/browse/JDK-8062128), [[JDK-8253230]](https://bugs.openjdk.org/browse/JDK-8253230), [[JDK-8132937]](https://bugs.openjdk.org/browse/JDK-8132937), and [[deSmet21]](https://www.optaplanner.org/blog/2021/09/15/HowMuchFasterIsJava17.html)).

Reducing this overhead is essential for increasing the adoption of the G1 in the cloud and minimizing the need for users to switch to a different garbage collection algorithm.


Description
-----------

G1 is an incremental garbage collector. It selectively evacuates parts of the Java heap in every garbage collection. G1 must track locations of references into the selected parts of the heap from areas not selected, as inspecting the entire heap to find such references during a garbage collection pause would take too long. G1 keeps track of locations that may contain such references of interest to solve this problem. Tracking reference locations directly would impose a significant memory overhead, instead, G1 employs a technique called Card Marking [[Hölzle93]](https://bibliography.selflanguage.org/_static/write-barrier.pdf).

Card Marking divides the heap into fixed-size chunks called cards. Every card is represented as a byte in a separate fixed-size array called a card table. Each entry corresponds to a card in the Java heap. A card may either be marked or unmarked, a mark indicating the potential presence of interesting references in the Java heap corresponding to the card.

When the application modifies an object reference, the G1 write barrier intercepts the modification and marks the corresponding card in the card table.


<a href="https://bugs.openjdk.org/secure/attachment/113251/write%20barrier%20assignment.png">
  <img src="https://bugs.openjdk.org/secure/attachment/113251/write%20barrier%20assignment.png" width="700" />
</a>

During garbage collection pauses, G1 examines Java heap contents corresponding to the marked cards. Therefore, G1 garbage collection pause times depend on the number of changes made in the object graph during a Java application's execution. Examining cards on the card table can take up a majority of the total garbage collection pause time due to the large number of cards to examine.

To reduce pause times and keep the pause time goal, G1 tries to minimize the number of cards to examine during a garbage collection pause by re-examining, unmarking, and classifying cards according to what areas they reference concurrent to the application.

 - Re-examining a card may reveal that a given card does not have any interesting references anymore.
 - Classifying cards according to their referent helps limit the number of cards examined during garbage collection to only cards that refer to evacuated areas.

These measures reduce the cards to inspect to cards referring to areas selected for garbage collection in addition to cards not yet examined concurrently.

Currently, the G1 write barrier implements the following tasks:

 - mark the card corresponding to the reference location written to if some pre-filtering determined the mark is necessary (e.g. the value written is not a null value).
 - synchronize card table memory with the concurrent re-examination threads for correctness as they may modify the same card. This requires expensive memory synchronization instructions.
 - store that recently marked card location in a buffer so that concurrent re-examination threads can quickly find it again

This explains the large size of the G1 write barrier.

This JEP changes G1's management of marked cards, switching from the current fine-grained memory synchronization in the write-barrier to a much less frequent but more intrusive coarse-grained approach. G1 also adopts a different data structure for storing to-be-examined card marks. This enables G1 to use a much smaller write-barrier; in this change we only keep the mentioned pre-filters that may avoid card marks completely to lessen concurrent re-examination work, reducing the write-barrier by about 20 instructions on the x86-64 architecture.

Overall this approach results in significantly higher throughput due to the smaller write barrier allowing better compiled code quality, smaller code size, and increased concurrent re-examination performance due to reduced synchronization.

The main idea is to segregate the application and the garbage collection threads' activities onto distinct card tables. This eliminates the need for memory synchronization between the application marking cards and the concurrent garbage collection threads examining them. The application marks cards on one (regular, existing) card table, and the garbage collection threads inspect a second, initially empty, refinement (card) table. When G1 heuristically detects that processing the card marks on the card table in the garbage collection pause would start to exceed the pause time goal, the Hotspot VM atomically swaps the card and refinement tables. The Java application resumes marking on the former empty refinement table, and the garbage collection threads process marks from the previous regular card table without any synchronization. G1 repeats this process as necessary to keep the overall amount of cards to examine during garbage collection to a certain threshold.

Performance
------------

With the proposed change, concurrent operations typically take less CPU resources, because the refinement table is very cache-friendly to access compared to the previous data structure that holds card locations that would require random memory access. There is also a small improvement in garbage collection pause times because previously required garbage collection phases can be simplified or completely removed.

However the main benefit is increased throughput of Java applications by doing less memory synchronization in the write-barrier, particularly for Java applications that heavily modify the object graph in the Java heap and so execute the write-barrier in full frequently. Depending on the CPU architecture and Java application we measured an increase in throughput by 5-15% or more due to decreased memory synchronization and concurrent examination.

Even throughput-oriented applications that do very little or no concurrent examination at all may benefit from the decreased size of the barrier code improving compiled code performance, showing an increased throughput of around 5%.

Due to G1’s garbage collection CPU-usage based heap sizing, the impact of this change often results in decreased Java heap footprint instead of increased throughput.

Native Memory Footprint
------------

This change allocates an extra card table of the same size as the regular card table. By default, a card table corresponds to around 0.2% of Java heap memory size, which corresponds to 2MB native memory usage per 1GB of Java heap.

Removing the previous data structures to track concurrent examination of cards approximately offsets half of that native memory unless the application does almost no object graph modifications. The optimization to keep the interesting cards of always collected regions on the card table also saves a significant amount of native memory.

Overall, some applications may even use less native memory than before.

Alternatives
------------

Several alternatives to reduce the throughput difference between G1 and other collectors due to complex write-barriers have been proposed and prototyped before:

 - Use operating system-dependent synchronization mechanisms to implement the coarse-grained synchronization between application and garbage collection threads.
This made the synchronization operating system and even operating system version dependent. Some platforms do not implement this functionality directly (for example OSX-aarch64, or old versions of linux-riscv) or in a too generic way (older Linux kernels), so workarounds that for example use debugging APIs would need to be used. This led to the performance of this functionality being very poor on some platforms, so the current technique using Thread-Local Handshakes [[JEP312]](https://openjdk.org/jeps/312) would have been needed anyway as a workaround for these deficiencies.
 - Keep existing data structures to store where marked cards on the card table are instead of using the refinement table.
This would have the advantage that the change itself would be smaller, however, the possible reduction of the write-barrier size would be much smaller and limited to avoid the memory synchronization, saving only five instructions instead of twenty. Removing the old data structures also reduces work in the garbage collection pause, reducing garbage collection pause times.
Management of the refinement table is much less complex. The prototype in this proposal removes around 1000 lines of code in the JVM sources, mostly related to this data structure, also significantly reducing code complexity.
 - Use the same throughput-optimized write-barrier as Parallel GC, disable all concurrent examination of cards, and have G1 do all work in the garbage collection pauses. The end user determines to use one or the other “garbage collection mode” based on his service level preferences.
A master’s thesis [[Protopopovs23]](https://epub.jku.at/obvulihs/download/pdf/8800715) showed that the pause time impact can be negligible in throughput-oriented applications. The drawback is that this proposal would have two distinct “modes” of operation for G1. This adds significant complexity to the code base, increases the test surface significantly, and is inconvenient for the end user as he needs to know in advance which mode to select.
Alternatively, the end user may as well select Parallel GC instead, which would still be slightly faster than G1 on pure throughput loads as shown in the thesis.
 - Modify the original write-barrier to batch the memory barrier execution, and remove the per-object graph memory synchronization. The cost of this approach is much more enqueuing work in some applications, decreasing throughput compared to the current G1 garbage collector. There are no observed performance regressions with the approach proposed here.

Overall we think that the trade-off between code complexity, maintenance overhead, and performance characteristics is most favorable for this proposal.


Risks and Assumptions
---------------------

We assume that the refinement table increases the native memory footprint for G1 slightly. We also assume that a significant portion of this will be compensated by the removal of the data structures for maintaining examining marked cards and optimizations like keeping the interesting cards of always collected regions on the card table.

This is an intrusive change to critical components of G1 interaction with application threads, there is a non-negligible risk of bugs that may cause failures and introduce performance regressions. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.

Comments
Reviewed.
02-12-2024

Overall this version much better. Thank you for cleaning it up.
26-11-2024

Small inconsistency, may be it is about different set of instructions and not just barriers: in Motivation: "G1 GC barrier adds about 50 x64 instructions " in Description: "12 instructions instead of 32 on the x86-64 architecture"
26-11-2024

What about Serial GC?: "Match the throughput of the Parallel garbage collector with the G1 garbage collector."
26-11-2024

May be simplify Summary: "Increase the application throughput of the G1 garbage collector by reducing the impact of its barriers, which record information about object graph changes, exchanging per-reference memory synchronization with overall lower-cost virtual machine-wide synchronization." to: "Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers." The rest is just details which you already have in Description section.
26-11-2024

Okay, the additional memory usage for second table is not big issue. I agree. Now, what about performance? Any visible change? I don't need exact numbers, just general value.
29-10-2024

[~kvn]: Kim already explained most of your questions, I would like to amend them a bit: About the second card table size: A single card table takes 1/GCCardTableSizeInBytes amount of additional memory, so typically, with default 512 byte card sizes this is 0.2% of native heap. So the second card table will take up to 0.2% of the heap. That's two MB per GB of heap. For comparison, the existing bitmap alone takes 1.5% of Java heap, plus two existing card table sized data structures (the primary card table, plus block offset table) in total taking 0.4%. (Note that we only removed a third card table sized data structure in JDK 21, so anyone coming from before that won't take notice that we re-added one) On the upside, there are two larger memory consumers that go away with this change: * young generation remembered sets; their memory consumption depends, so it is hard to give numbers. There are some applications where this counters the extra memory usage for the second card table, but not throughput-y applications where the second card table is practically unused. * Dirty Card Queue data structures; measurement showed that even in applications that do not use a lot of the DCQ machinery (like SPECjbb2015) around half of the second card table can be saved by removing them. There are some optimizations that have not been implemented in this JEP regarding secondary card table memory usage: e.g. basically as long as there is not much necessity to do refinement (most "throughput" applications) we do not need to commit the secondary card table, but could do lazily just before card table swapping (and can uncommit in phases of no activity again). I will add future extensions/RFEs we already thought of later. > The old approach (with the young card filter) lets us only commit memory for cards associated with oldgen regions, but only if we ensure young card values are set before the region is used (which has some cost during region allocation). I _think_ the current implementation of this new mechanism doesn't do that, and commits all of the memory for both tables. That could be revisited in a followup. Maybe I am misreading this, but with the old approach we always need to commit the card table for all regions (even if only to mark its cards as young). In the current implementation this is also true for the second card table, but see above for optimizations.
29-10-2024

Thank you, Kim, for explanation. I still think that technical details should be moved into "Implementation" subtask. JEP text should be high level. Mechanism of tables swapping should be in description (in Implementation subtask). Yes, it would be nice to re-do experiments with order of prefilting checks.
28-10-2024

Note that many of the linked issues are proposed enhancements to the old mechanism that may be rendered moot by this new mechanism, in which case they'll be closed as not an issue.
27-10-2024

[~kvn] Regarding the footprint cost for the additional card table, a card table uses 1 byte per GCCardSizeInBytes bytes of Java heap (default 512, can be 128 or 1024). That's the reservation size. The old approach (with the young card filter) lets us only commit memory for cards associated with oldgen regions, but only if we ensure young card values are set before the region is used (which has some cost during region allocation). I _think_ the current implementation of this new mechanism doesn't do that, and commits all of the memory for both tables. That could be revisited in a followup.
27-10-2024

[~kvn] Regarding swapping the tables. The thread-local reference to the card table is written by the refinement controller thread without synchronization. This means the affected thread may use either the old or the new value. The refinement controller thread then synchronizes (using thread-local handshakes or STS) to ensure the affected threads now definitely are using the new value. Only after that has been done for all threads is refinement on the former card table allowed to proceed. A probably more performance intrusive alternative would be to update all the threads and then issue a safepoint.
27-10-2024

[~kvn] Regarding the order of the prefilting checks. The same-region check tends to filter out a lot, making it amenable to hardware branch prediction. The null check is often optimized away (compile-time known null => eliminate the barrier entirely, compile-time known not null => eliminate the check in the barrier), with the residue requiring a check in the barrier being relatively small. Anectdotal evidence suggest that residue isn't so amenable to hardware branch prediction. But I think the real reason for now is that the current barrier does them in that order, based on some old measurements, and that isn't the focus of this change. It might be worth revisiting that as a followup.
27-10-2024

My biggest complain is this JEP uses a lot of very technical GC (G1) specific terms which hard to understand for people not familiar with them. As you know JEPs are used for advertisement of Java improvements and should be clear for people not familiar with details of G1 GC.
25-10-2024

There is no any performance numbers to show benefits of this new design.
25-10-2024

"having two card tables increases the native memory footprint for G1" - what size you are talking about for average usage?
25-10-2024

"using a handshake" - is not clear for me what it means. " SuspendibleThreadSet" - does it means that all GC threads suspended while card table pointer is updated?
25-10-2024

"Swap the card tables". Is it use some synchronization to do that? Or atomic swap is enough? Is it possible to have small concurrent window when mutator threads and GC threads see the same table?
25-10-2024

I have question: why not do `if (y == NULL) -> exit` check first? It should be cheaper than check regions.
25-10-2024

JDK-8342382 is the implementation CR.
16-10-2024