Bug ID: JDK-8340827 G1: Improve Application Throughput with a More Efficient Write-Barrier

JDK-8340827 : G1: Improve Application Throughput with a More Efficient Write-Barrier

Type: JEP
Component: hotspot
Sub-Component: gc

Priority: P4
Status: Submitted
Resolution: Unresolved

Submitted: 2024-09-24
Updated: 2025-05-30

Related Reports

Relates :	JDK-8253230 - G1 20% slower than Parallel in JRuby rubykon benchmark
Relates :	JDK-8230187 - Throughput post-write barrier for G1
Relates :	JDK-8062128 - Dramatic difference between UseConcMarkSweepGC and UseG1GC
Relates :	JDK-8132937 - G1 compares badly to Parallel GC on throughput on javac benchmark
Relates :	JDK-8231734 - G1DirtyCardQueueSet assumed to use 0 start for G1FreeIdSet
Relates :	JDK-8233438 - Use zero_filled optimization for young gen regions when committing young regions
Relates :	JDK-8233990 - G1 card refinement: joining, prefetching
Relates :	JDK-8256282 - Investigate putting card table base into G1 thread local data
Relates :	JDK-8273137 - Improve skipping card enqueuing during gc if there are no remembered sets in old gen
Relates :	JDK-8286586 - G1: Return memory in Dirty Card buffer allocator to OS
Relates :	JDK-8290736 - Improve scalability of Merge Heap Roots/Merge Log Buffers
Relates :	JDK-8293616 - Concurrent refinement by mutators only processes one buffer
Relates :	JDK-8342382 - Implementation of JEP G1: Improve Application Throughput with a More Efficient Write-Barrier
Relates :	JDK-8344951 - Stabilize write barrier micro-benchmarks
Relates :	JDK-8226197 - Reduce G1’s CPU cost with simplified write post-barrier and disabling concurrent refinement
Relates :	JDK-8226731 - Remove StoreLoad in G1 post barrier
Relates :	JDK-8166899 - Deferred card marking of large objArrays generates lots of unnecessary work

Description

Summary
-------

Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers.

Goals
-----

- Reduce the overall synchronization overhead necessary to operate the G1 garbage collector,
- Reduce the size of the injected code for the G1 write-barriers,
- Maintain the overall architecture of the G1 garbage collector with no changes to user interaction.

Non-Goals
---------

- Match the throughput of the G1 garbage collector with any other garbage collector.

Motivation
----------

The default garbage collector in the HotSpot JVM, the Garbage-First (G1) collector, is designed to balance latency and throughput for Java applications. However, achieving this balance can sometimes impact the application's performance compared to more throughput-oriented garbage collectors such as Parallel and Serial GC.

Relative to Parallel, G1 performs more of its work concurrently with the application, reducing the duration of GC pauses and thus improving the latency of the application. Unavoidably, this means that application threads need to share the CPU with special-purpose GC threads, and coordinate with them. This synchronization of the application threads with the GC threads lowers application throughput and adds latency.

In this JEP, we substantially reduce the total synchronization overhead without any fundamental change in the architecture of G1. Application threads and GC threads still coordinate as the Java heap evolves. Instead of synchronizing with the GC threads at every object reference modification, the application threads can run freely most of the time. The JVM enforces consistency by forcing them to synchronize with the GC threads when necessary. The end result is lower total synchronization overhead which results in higher application throughput and improved latency.

Description
-----------
G1 reclaims memory by copying Java objects in the heap into new memory areas and making the memory previously occupied by those objects available for the application to allocate new objects. This means that references to the original objects, stored in the fields of other objects, must be updated to point to new locations.

To find the references that need to be updated, G1 does not scan the entire heap; rather, G1 frontloads the work by arranging for every field assignment to update a data structure called a _card table_. Code called a _write barrier_ is injected into the application to perform the update. Later, during GC pauses, G1 only needs to scan the card table to find where there are there are objects whose fields need updating in the Java heap.

Even though scanning the card table is efficient, a dynamic Java application might frequently allocate new objects and assign to their fields, causing the size of the card table to increase beyond the threshold that is unsustainable to examine within G1’s pause time goal. To avoid these lengthy GC pauses, G1 optimizes the card table in the background. However, this means that optimizer threads must synchronize with application threads to avoid racy updates to the card table. Accordingly, the injected write barrier code in application threads is complex, as well as the auxiliary mechanisms introduced to optimize the card table.

We propose to introduce a _second_ card table that will let optimizer and application threads run freely. The write barriers in application threads update the first card table without any synchronization or auxiliary overhead, making them smaller and simpler. Meanwhile, optimizer threads work on the second, initially empty, card table.

When G1 detects that processing the first card table during a GC pause would likely exceed the pause time goal, the JVM atomically swaps the card tables. Application threads resume updating on the empty, formerly second table, while optimizer threads work on the full, formerly first table without any synchronization with the application. G1 repeats this process as necessary to keep the size of the card table examined during GC pauses under a certain threshold.

Performance
------------

The proposed change reduces garbage collection overhead both while the application is running and during garbage collection pauses.

The main benefit originates from less synchronization between the optimizer and the application. In applications that heavily modify object references we observe throughput gains in the range of 5-15%. The observed throughput gain is up to 5% for other applications. They only benefit from better code generation due to the smaller and simpler write barriers. For example, on x86, the number of injected instructions is reduced from around 50 to 12 per write barrier.

The second card table poses less management overhead than the previous auxiliary data structure that kept track of modified references, so garbage collection pause length also decreases slightly.

Native Memory Footprint
------------

The second card table has the same capacity as the first and uses an equivalent amount of additional native memory. Each card table requires 0.2% of Java heap capacity, corresponding to an additional 2MB of native memory usage per 1GB of Java heap capacity. This second card table replaces the original dynamically sized data structures used to track modified references, that offsets this expense. In some cases these savings are larger than the cost of the second card table.

However, in cases where the new card table results in higher native memory usage, we argue that this should not be a concern. In JDK 20 ([JDK-8210708](https://bugs.openjdk.org/browse/JDK-8210708)) and JDK 21 ([JDK-8225409](https://bugs.openjdk.org/browse/JDK-8225409)), we removed large data structures that in total used more than eight times the size of the second card table.

Alternatives
------------

In the past, several attempts to decrease the throughput difference between G1 and the other collectors have been proposed and prototyped:

- Use operating system dependent synchronization mechanisms to atomically swap the card table in the hope of better performance. However this required operating system or even operating system version specific code, increasing code complexity extraordinarily without providing the expected benefit. The current implementation uses generic Thread-Local Handshakes [[JEP312]](https://openjdk.org/jeps/312) instead that avoid this problem.
- Keep existing auxiliary data structures to store reference update locations to decrease the amount of changes. This approach limited the potential reduction of the write-barrier complexity, limiting the performance improvement of this method. The current use of the second card table requires no particular synchronization between optimizer and application.
- Have two separate G1 garbage collector modes, the active one chosen at JVM startup time. One mode that works as before, and another one that uses the same write barrier as the Parallel garbage collector, disabling background card table optimization as suggested by [JDK-8230187](https://openjdk.org/jeps/8230187). Having two distinct modes of operation for G1 adds significant complexity to the code base, increases maintenance overhead and is inconvenient for the end user as he needs to select the mode in advance. Adoption could be limited because in this case end-users might as well use the Parallel garbage collector instead, as it would still be slightly faster than G1 on pure throughput workloads as shown in a master’s thesis [[Protopopovs23]](https://epub.jku.at/obvulihs/download/pdf/8800715).
- There were other attempts to decrease the synchronization between the application and the optimizer with changes to the write barrier. These approaches exhibited severe throughput regressions in some situations with only very modest gains otherwise.

Overall we think that the trade-off between code complexity, maintenance overhead, and performance characteristics is most favorable for this proposal.

Risks and Assumptions
---------------------

We assume there is no need to provide additional end user control for the optimization threads beyond what is already provided by the VM. The end user may continue to opt to completely disable the optimization threads via `-XX:-G1UseConcRefinement` or limit their number via `-XX:G1ConcRefinementThreads=<number>`. These two options are assumed to continue to cover all necessary use cases not otherwise better handled by the internal heuristics.

Further we believe that the throughput gains offered by this change, combined with native memory savings in earlier JDK releases described in the Native Memory Footprint section justify the additional memory usage of the second card table.

This is an intrusive change to critical components of G1 interaction with application threads, there is a non-negligible risk of bugs that may cause failures and introduce performance regressions when adapting existing heuristics to the new auxiliary data structure. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.

Comments

Reviewed.
02-12-2024
Overall this version much better. Thank you for cleaning it up.
26-11-2024
Small inconsistency, may be it is about different set of instructions and not just barriers: in Motivation: "G1 GC barrier adds about 50 x64 instructions " in Description: "12 instructions instead of 32 on the x86-64 architecture"
26-11-2024
What about Serial GC?: "Match the throughput of the Parallel garbage collector with the G1 garbage collector."
26-11-2024
May be simplify Summary: "Increase the application throughput of the G1 garbage collector by reducing the impact of its barriers, which record information about object graph changes, exchanging per-reference memory synchronization with overall lower-cost virtual machine-wide synchronization." to: "Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers." The rest is just details which you already have in Description section.
26-11-2024
Okay, the additional memory usage for second table is not big issue. I agree. Now, what about performance? Any visible change? I don't need exact numbers, just general value.
29-10-2024
[~kvn]: Kim already explained most of your questions, I would like to amend them a bit: About the second card table size: A single card table takes 1/GCCardTableSizeInBytes amount of additional memory, so typically, with default 512 byte card sizes this is 0.2% of native heap. So the second card table will take up to 0.2% of the heap. That's two MB per GB of heap. For comparison, the existing bitmap alone takes 1.5% of Java heap, plus two existing card table sized data structures (the primary card table, plus block offset table) in total taking 0.4%. (Note that we only removed a third card table sized data structure in JDK 21, so anyone coming from before that won't take notice that we re-added one) On the upside, there are two larger memory consumers that go away with this change: * young generation remembered sets; their memory consumption depends, so it is hard to give numbers. There are some applications where this counters the extra memory usage for the second card table, but not throughput-y applications where the second card table is practically unused. * Dirty Card Queue data structures; measurement showed that even in applications that do not use a lot of the DCQ machinery (like SPECjbb2015) around half of the second card table can be saved by removing them. There are some optimizations that have not been implemented in this JEP regarding secondary card table memory usage: e.g. basically as long as there is not much necessity to do refinement (most "throughput" applications) we do not need to commit the secondary card table, but could do lazily just before card table swapping (and can uncommit in phases of no activity again). I will add future extensions/RFEs we already thought of later. > The old approach (with the young card filter) lets us only commit memory for cards associated with oldgen regions, but only if we ensure young card values are set before the region is used (which has some cost during region allocation). I _think_ the current implementation of this new mechanism doesn't do that, and commits all of the memory for both tables. That could be revisited in a followup. Maybe I am misreading this, but with the old approach we always need to commit the card table for all regions (even if only to mark its cards as young). In the current implementation this is also true for the second card table, but see above for optimizations.
29-10-2024
Thank you, Kim, for explanation. I still think that technical details should be moved into "Implementation" subtask. JEP text should be high level. Mechanism of tables swapping should be in description (in Implementation subtask). Yes, it would be nice to re-do experiments with order of prefilting checks.
28-10-2024
Note that many of the linked issues are proposed enhancements to the old mechanism that may be rendered moot by this new mechanism, in which case they'll be closed as not an issue.
27-10-2024
[~kvn] Regarding the footprint cost for the additional card table, a card table uses 1 byte per GCCardSizeInBytes bytes of Java heap (default 512, can be 128 or 1024). That's the reservation size. The old approach (with the young card filter) lets us only commit memory for cards associated with oldgen regions, but only if we ensure young card values are set before the region is used (which has some cost during region allocation). I _think_ the current implementation of this new mechanism doesn't do that, and commits all of the memory for both tables. That could be revisited in a followup.
27-10-2024
[~kvn] Regarding swapping the tables. The thread-local reference to the card table is written by the refinement controller thread without synchronization. This means the affected thread may use either the old or the new value. The refinement controller thread then synchronizes (using thread-local handshakes or STS) to ensure the affected threads now definitely are using the new value. Only after that has been done for all threads is refinement on the former card table allowed to proceed. A probably more performance intrusive alternative would be to update all the threads and then issue a safepoint.
27-10-2024
[~kvn] Regarding the order of the prefilting checks. The same-region check tends to filter out a lot, making it amenable to hardware branch prediction. The null check is often optimized away (compile-time known null => eliminate the barrier entirely, compile-time known not null => eliminate the check in the barrier), with the residue requiring a check in the barrier being relatively small. Anectdotal evidence suggest that residue isn't so amenable to hardware branch prediction. But I think the real reason for now is that the current barrier does them in that order, based on some old measurements, and that isn't the focus of this change. It might be worth revisiting that as a followup.
27-10-2024
My biggest complain is this JEP uses a lot of very technical GC (G1) specific terms which hard to understand for people not familiar with them. As you know JEPs are used for advertisement of Java improvements and should be clear for people not familiar with details of G1 GC.
25-10-2024
There is no any performance numbers to show benefits of this new design.
25-10-2024
"having two card tables increases the native memory footprint for G1" - what size you are talking about for average usage?
25-10-2024
"using a handshake" - is not clear for me what it means. " SuspendibleThreadSet" - does it means that all GC threads suspended while card table pointer is updated?
25-10-2024
"Swap the card tables". Is it use some synchronization to do that? Or atomic swap is enough? Is it possible to have small concurrent window when mutator threads and GC threads see the same table?
25-10-2024
I have question: why not do `if (y == NULL) -> exit` check first? It should be cheaper than check regions.
25-10-2024
JDK-8342382 is the implementation CR.
16-10-2024