JDK-8340827 : JEP 522: G1 GC: Improve Throughput by Reducing Synchronization
  • Type: JEP
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P4
  • Status: Candidate
  • Resolution: Unresolved
  • Submitted: 2024-09-24
  • Updated: 2025-08-25
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Summary
-------

Increase application throughput when using the G1 garbage collector by reducing the amount of synchronization required between application threads and GC threads.

Goals
-----

 - Reduce the G1 garbage collector’s synchronization overhead.

 - Reduce the size of the injected code for G1’s write barriers.

 - Maintain the overall architecture of G1, with no changes to user interaction.

Non-Goals
---------

 - It is not a goal to match the throughput of the G1 garbage collector with that of any other garbage collector in the HotSpot JVM.

Motivation
----------

The Garbage-First collector (G1), which is the default garbage collector of the HotSpot JVM, is designed to balance latency and throughput. However, achieving this balance can sometimes impact application performance compared to throughput-oriented collectors such as the Parallel and Serial collectors.

Relative to Parallel, G1 performs more of its work concurrently with the application, reducing the duration of GC pauses and thus improving latency. Unavoidably, this means that application threads must share the CPU with GC threads, and coordinate with them. This synchronization both lowers throughput and increases latency.

Description
-----------

We propose to improve both throughput and latency by reducing the amount of synchronization required between application threads and GC threads.

### Background

G1 reclaims memory by copying live objects in the heap into new memory regions, making the regions previously occupied by those objects available for the allocation of new objects. References to the original objects, stored in the fields of other objects in other regions, must somehow be updated to point to their new copies.

To find the references that need to be updated, G1 does not scan the entire heap, which would be inefficient. Instead, G1 keeps track of cross-region object references in a data structure called a _card table_, which is updated every time an object reference is stored in a field. These updates are performed by small fragments of code called _write barriers_, which G1 injects into the application in cooperation with the JIT. During a GC pause, G1 only needs to scan the card table to find the objects containing fields that require updating.

Scanning the card table is efficient. However, some applications frequently allocate new objects and store references in the fields of those objects. In such applications the card table may grow so large that scanning it causes G1 to exceed its pause-time goal. To avoid that, G1 optimizes the card table in the background via separate optimizer threads. For that to work, however, the optimizer threads must synchronize with application threads to avoid conflicting updates to the card table. Accordingly, the write-barrier code injected into application threads is complex and therefore slow, and so is the code that optimizes the card table.

### Proposal

We propose to introduce a second card table so that optimizer and application threads can run freely. The write barriers in application threads update the first card table without any synchronization, making them simpler and faster. Meanwhile, the optimizer threads work on the second, initially empty, card table.

When G1 detects that scanning the first card table during a GC pause would likely exceed the pause-time goal, it atomically swaps the card tables. Application threads resume updating the empty, formerly second table, while optimizer threads work on the full, formerly first table without any further synchronization. G1 repeats this process as necessary to keep the amount of work on the active card table under control.

### Performance

This change reduces overhead both while the application is running and during garbage collection pauses.

The main benefit comes from the reduced synchronization between application and optimizer threads. In applications that heavily modify object-reference fields, we have observed throughput gains in the range of 5–15%.

Some additional benefit comes from the write barriers being simpler and faster. For example, on x64, write barriers are reduced from around 50 instructions to just 12. Owing to this, we have observed throughput gains of up to 5% even in applications that do not heavily modify object-reference fields.

The second card table is more efficient than the auxiliary data structure that previously kept track of modified references, so garbage collection pause times also decrease slightly.

### Native memory footprint

The second card table has the same capacity as the first, and uses the same amount of additional native memory. Each card table requires 0.2% of Java heap capacity, corresponding to an additional 2MB of native memory usage per 1GB of Java heap capacity.

The second card table replaces the auxiliary data structure that previously kept track of modified references. In some cases the second card table is smaller than that data structure would have been. Even in cases where the second card table is larger, however, resulting in greater usage of native memory, this should not be a significant concern. In [JDK 20](https://bugs.openjdk.org/browse/JDK-8210708) and [JDK 21](https://bugs.openjdk.org/browse/JDK-8225409) we removed other large data structures from G1 that in total used more than eight times the size of the second card table.

Alternatives
------------

We previously prototyped several other approaches to improving the throughput of G1:

 - Use operating-system-dependent synchronization mechanisms to atomically swap the card table. This required operating-system-specific code or even operating-system-version-specific code, increasing code complexity extraordinarily without providing the expected benefit. Here we have avoided this problem by using generic thread-local handshakes ([JEP 312](https://openjdk.org/jeps/312)).

 - Keep the existing auxiliary data structure for tracking modified references in order to decrease the amount of code change. This approach limited the degree to which write barriers could be simplified, thereby limiting the potential performance improvement.

 - Have two separate G1 garbage collector modes: One mode would work as before, while a new mode would use the simpler write barriers of the Parallel garbage collector and [disable background card-table optimization](https://openjdk.org/jeps/8230187). Supporting two modes of operation for G1 would add significant complexity to the code base. It would also be inconvenient for users, who would have to select the mode at JVM startup. Adoption could, furthermore, be limited since users who choose the new mode might as well just use the Parallel collector, which would still be [slightly faster than G1 on pure throughput workloads](https://epub.jku.at/obvulihs/download/pdf/8800715).

We made other attempts to decrease synchronization between application threads and optimizer threads by modifying the write barriers. These approaches exhibited severe throughput regressions in some situations with only modest gains otherwise.

Overall, we think that the current proposal has the best trade-off between code complexity and performance.

Risks and Assumptions
---------------------

This is an intrusive change to critical components of the G1 collector’s interaction with application threads. There is, therefore, a non-negligible risk of bugs that may cause failures and introduce performance regressions. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.

We assume there is no need to provide additional user control over the optimization threads beyond what is already provided. The user may continue to completely disable the optimization threads via `-XX:-G1UseConcRefinement` or limit their number via `-XX:G1ConcRefinementThreads=<number>`. We assume that these two options continue to cover all necessary use cases not otherwise better handled by internal heuristics.

We assume that the throughput gains offered by this change, combined with the aforementioned native memory savings in earlier releases, justify the additional memory usage of the second card table.

Comments
Reviewed.
02-12-2024

Overall this version much better. Thank you for cleaning it up.
26-11-2024

Small inconsistency, may be it is about different set of instructions and not just barriers: in Motivation: "G1 GC barrier adds about 50 x64 instructions " in Description: "12 instructions instead of 32 on the x86-64 architecture"
26-11-2024

What about Serial GC?: "Match the throughput of the Parallel garbage collector with the G1 garbage collector."
26-11-2024

May be simplify Summary: "Increase the application throughput of the G1 garbage collector by reducing the impact of its barriers, which record information about object graph changes, exchanging per-reference memory synchronization with overall lower-cost virtual machine-wide synchronization." to: "Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers." The rest is just details which you already have in Description section.
26-11-2024

Okay, the additional memory usage for second table is not big issue. I agree. Now, what about performance? Any visible change? I don't need exact numbers, just general value.
29-10-2024

[~kvn]: Kim already explained most of your questions, I would like to amend them a bit: About the second card table size: A single card table takes 1/GCCardTableSizeInBytes amount of additional memory, so typically, with default 512 byte card sizes this is 0.2% of native heap. So the second card table will take up to 0.2% of the heap. That's two MB per GB of heap. For comparison, the existing bitmap alone takes 1.5% of Java heap, plus two existing card table sized data structures (the primary card table, plus block offset table) in total taking 0.4%. (Note that we only removed a third card table sized data structure in JDK 21, so anyone coming from before that won't take notice that we re-added one) On the upside, there are two larger memory consumers that go away with this change: * young generation remembered sets; their memory consumption depends, so it is hard to give numbers. There are some applications where this counters the extra memory usage for the second card table, but not throughput-y applications where the second card table is practically unused. * Dirty Card Queue data structures; measurement showed that even in applications that do not use a lot of the DCQ machinery (like SPECjbb2015) around half of the second card table can be saved by removing them. There are some optimizations that have not been implemented in this JEP regarding secondary card table memory usage: e.g. basically as long as there is not much necessity to do refinement (most "throughput" applications) we do not need to commit the secondary card table, but could do lazily just before card table swapping (and can uncommit in phases of no activity again). I will add future extensions/RFEs we already thought of later. > The old approach (with the young card filter) lets us only commit memory for cards associated with oldgen regions, but only if we ensure young card values are set before the region is used (which has some cost during region allocation). I _think_ the current implementation of this new mechanism doesn't do that, and commits all of the memory for both tables. That could be revisited in a followup. Maybe I am misreading this, but with the old approach we always need to commit the card table for all regions (even if only to mark its cards as young). In the current implementation this is also true for the second card table, but see above for optimizations.
29-10-2024

Thank you, Kim, for explanation. I still think that technical details should be moved into "Implementation" subtask. JEP text should be high level. Mechanism of tables swapping should be in description (in Implementation subtask). Yes, it would be nice to re-do experiments with order of prefilting checks.
28-10-2024

Note that many of the linked issues are proposed enhancements to the old mechanism that may be rendered moot by this new mechanism, in which case they'll be closed as not an issue.
27-10-2024

[~kvn] Regarding the footprint cost for the additional card table, a card table uses 1 byte per GCCardSizeInBytes bytes of Java heap (default 512, can be 128 or 1024). That's the reservation size. The old approach (with the young card filter) lets us only commit memory for cards associated with oldgen regions, but only if we ensure young card values are set before the region is used (which has some cost during region allocation). I _think_ the current implementation of this new mechanism doesn't do that, and commits all of the memory for both tables. That could be revisited in a followup.
27-10-2024

[~kvn] Regarding swapping the tables. The thread-local reference to the card table is written by the refinement controller thread without synchronization. This means the affected thread may use either the old or the new value. The refinement controller thread then synchronizes (using thread-local handshakes or STS) to ensure the affected threads now definitely are using the new value. Only after that has been done for all threads is refinement on the former card table allowed to proceed. A probably more performance intrusive alternative would be to update all the threads and then issue a safepoint.
27-10-2024

[~kvn] Regarding the order of the prefilting checks. The same-region check tends to filter out a lot, making it amenable to hardware branch prediction. The null check is often optimized away (compile-time known null => eliminate the barrier entirely, compile-time known not null => eliminate the check in the barrier), with the residue requiring a check in the barrier being relatively small. Anectdotal evidence suggest that residue isn't so amenable to hardware branch prediction. But I think the real reason for now is that the current barrier does them in that order, based on some old measurements, and that isn't the focus of this change. It might be worth revisiting that as a followup.
27-10-2024

My biggest complain is this JEP uses a lot of very technical GC (G1) specific terms which hard to understand for people not familiar with them. As you know JEPs are used for advertisement of Java improvements and should be clear for people not familiar with details of G1 GC.
25-10-2024

There is no any performance numbers to show benefits of this new design.
25-10-2024

"having two card tables increases the native memory footprint for G1" - what size you are talking about for average usage?
25-10-2024

"using a handshake" - is not clear for me what it means. " SuspendibleThreadSet" - does it means that all GC threads suspended while card table pointer is updated?
25-10-2024

"Swap the card tables". Is it use some synchronization to do that? Or atomic swap is enough? Is it possible to have small concurrent window when mutator threads and GC threads see the same table?
25-10-2024

I have question: why not do `if (y == NULL) -> exit` check first? It should be cheaper than check regions.
25-10-2024

JDK-8342382 is the implementation CR.
16-10-2024