JDK-8276094 : JEP 423: Region Pinning for G1
  • Type: JEP
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P4
  • Status: Closed
  • Resolution: Delivered
  • Fix Versions: 22
  • Submitted: 2021-10-28
  • Updated: 2024-02-05
  • Resolved: 2024-02-05
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8277542 :  
JDK-8322403 :  
Description
Summary
-------

Reduce latency by implementing region pinning in G1, so that garbage collection need not be disabled during Java Native Interface (JNI) critical regions.


Goals
-----

  - No stalling of threads due to JNI critical regions.

  - No additional latency to start a garbage collection due to JNI critical regions.

  - No regressions in GC pause times when no JNI critical regions are active.

  - Minimal regressions in GC pause times when JNI critical regions are active.


Motivation
----------

For interoperability with unmanaged programming languages such as C and C++, [JNI][jni] defines [functions to get and then release direct pointers to Java objects][get-rel-critical]. These functions must always be used in pairs: First, get a pointer to an object (e.g., via `GetPrimitiveArrayCritical`); then, after using the object, release the pointer (e.g., via `ReleasePrimitiveArrayCritical`). Code within such function pairs is considered to run in a _critical region_, and the Java object available for use during that time is a _critical object_.

When a Java thread is in a critical region, the JVM must take care not to move the associated critical object during garbage collection. It can do this by _pinning_ such objects to their locations, essentially locking them in place as the GC moves other objects. Alternatively, it can simply disable GC whenever a thread is in a critical region.

The default GC, [G1][g1], takes the latter approach, [disabling GC][tschatzl] during every critical region. This has a significant impact on latency: If a Java thread triggers a GC then it must wait until no other threads are in critical regions. The severity of the impact depends upon the frequency and duration of critical regions. In the worst cases users report critical sections [blocking their entire application for minutes][jira], unnecessary out-of-memory conditions due to [thread starvation][starve], and even premature VM shutdown.  Due to these problems, the maintainers of some Java libraries and frameworks have chosen not to use critical regions by default (e.g., [JavaCPP][javacpp]) or even at all (e.g., [Netty][netty]), even though doing so can adversely affect throughput.

With the change that we propose here, Java threads will never wait for a G1 GC operation to complete.

[jni]: https://docs.oracle.com/en/java/javase/21/docs/specs/jni/index.html
[get-rel-critical]: https://docs.oracle.com/en/java/javase/21/docs/specs/jni/functions.html#getprimitivearraycritical-releaseprimitivearraycritical
[g1]: https://docs.oracle.com/en/java/javase/21/gctuning/garbage-first-g1-garbage-collector1.html#GUID-ED3AB6D3-FD9B-4447-9EDF-983ED2F7A573
[tschatzl]: https://tschatzl.github.io/2021/06/28/evacuation-failure.html
[jira]: https://confluence.atlassian.com/jirakb/jira-running-out-of-memory-due-to-gc-allocation-race-condition-957122851.html
[starve]: https://bugs.openjdk.java.net/browse/JDK-8192647
[javacpp]: https://github.com/bytedeco/javacpp/issues/16#issuecomment-267885319
[netty]: https://netty.io/news/2019/03/08/4-1-34-Final.html


Description
-----------

### Background

[G1][g1] partitions the heap into fixed-size _memory regions_ (not to be confused with _critical_ regions). G1 is a generational collector, so any non-empty region is a member of either the young generation or the old generation. In any particular collection operation, objects are _evacuated_ (i.e., moved) from only a subset of the regions to some other subset.

If G1 is unable to find space to evacuate an object during a minor (i.e., young-generation) collection then it leaves the object in place and marks both it and its containing region as having _failed evacuation_. After evacuation, G1 fixes up the failed regions by promoting them from the young generation to the old generation, potentially keeping them ready for subsequent evacuation.

G1 is already capable of pinning objects to their memory locations during major (i.e., full) collection operations, simply by not evacuating the regions that contain them. For example, G1 pins _humongous_ regions, which contain large objects. It also pins, for the duration of a single collection, any region that exceeds a specified liveness threshold.

G1 cannot pin arbitrary regions during minor collection operations, though it does exclude humongous regions from such collections.

### Pinning regions during minor collection operations

We aim to achieve the above goals by extending G1 to pin arbitrary regions during both major and minor collection operations, as follows:

  - Maintain a count of the number of critical objects in each region: Increment it when a critical object in that region is obtained, and decrement it when that object is released. When the count is zero then garbage-collect the region normally; when the count is non-zero, consider the region to be pinned.

  - During a major collection, do not evacuate any pinned region.

  - During a minor collection, treat pinned regions in the young generation as having failed evacuation, thus promoting them to the old generation. Do not evacuate existing pinned regions in the old generation.

Once we have done this then we can implement JNI critical regions — without disabling GC — by pinning regions that contain critical objects and continuing to collect garbage in unpinned regions.

Alternatives
------------

The JNI specification suggests two other ways to implement critical regions:

  - At the start of a critical region, copy the critical object to the C heap, where it will not be moved; at the end of the critical region, copy it back.

    This is very inefficient in both time and space. In G1 we could do this only for critical objects in regions that cannot be pinned. Those regions are in the young generation, however, in which most object use and modification typically occurs, so we do not expect that this would help much.

  - Pin objects individually.

    G1 can only evacuate whole regions, so a single pinned object in a region would prevent the collection of that region. The end result would be little different from what we propose above except that it would have higher overhead, since tracking individual pinned objects is more costly than maintaining per-region counts of critical objects.


Testing
-------

Aside from functionality tests, we will do benchmarking and performance measurements to collect the performance data necessary to ensure that our goals are met.


Risks and Assumptions
---------------------

We assume that there will be no changes to the expected usage of JNI critical regions: They will continue to be used sparingly, and they will be short in duration.

There is a risk of heap exhaustion when an application pins many regions at the same time. We have no solution for this, but the fact that the Shenandoah GC pins memory regions during JNI critical regions and does not have this problem suggests that it will not be a problem for G1.

Comments
Removed the section about performance problems with evacuation failure regions. This has been completely fixed outside of this JEP and evacuation failure is on par with regular collection.
17-11-2023

Removed the note about pinned archive regions - there are no explicit archive regions any more since JDK-8298048; humongous objects are not permantly pinned either since JDK-8302215.
27-04-2023

Thanks Mark for your reivew and rewriting, it looks much better. :) I just made some minor modification. And let me wait for a while to see if Thomas and Vladimir would like to make some modification, then I will assign it to you later.
27-01-2022

I’ve rewritten this fairly heavily to improve the overall flow, omit unnecessary detail, and make it easier to understand by readers who are not GC experts. Please review the text and make any necessary corrections, then assign the issue to me and I’ll move it to Candidate.
26-01-2022

Thanks Vladimir for your review. :) Thanks Thomas for updating the content. :)
02-12-2021

Yes, Thomas, it is good idea to add this to Motivation.
02-12-2021

[~kvn]: just as an addendum about performance data: the main improvement is that there will be *no* delay at all because of blocking Java thread progress at all after this change. The performance improvements provided by [~mli] are improvements to handling these objects (which are admittedly necessary). Probably we should emphasize at the end of the motivation section again that there will be no more garbage collection/java thread delays because of critical regions. Thanks.
02-12-2021

Thank you for performance data. Very impressive. Since the text on preview page is good then it is fine. Reviewed. You can submit it.
02-12-2021

Thanks Vladimir. Here is a list of performance data for several (done/in progress) sub-tasks: 8274191: Improve g1 evacuation failure injector performance - (Pause time measurement): https://bugs.openjdk.java.net/secure/attachment/96560/20210923-evac-fail-pause-times.png JDK-8254167: Record regions where evacuation failed to provide targeted iteration - (end-to-end time measurement): https://bugs.openjdk.java.net/secure/attachment/96208/speed.up.iterate.regions.png JDK-8254739: Optimize evacuation failure for regions with few failed objects - (end-to-end time measurement): https://bugs.openjdk.java.net/secure/attachment/96200/speed.up.iterate.objs.png - (`Remove Self Forwards` time decreasing): https://bugs.openjdk.java.net/secure/attachment/96321/20210902-remove-self-forwards-improvement.png, "`Remove Self Forwards` time decreases from 80-180ms to 4-5ms..." JDK-8256265: Improve parallelism in regions that failed evacuation (in progress) - (end-to-end time measurement): https://bugs.openjdk.java.net/secure/attachment/97228/parallel.evac.failure-threads.128.png - (end-to-end time measurement): https://bugs.openjdk.java.net/secure/attachment/97227/parallel.evac.failure-threads.32.png
02-12-2021

Improvements of evacution failure mechanism is for support region pinning. With region pinning supported, users can write more efficient Java/JNI program by avoiding copying by using critical functions GetXxxCritical/ReleaseXxxCritical, and without considering too much about the risk of these critical functions which could stall GC for any time (depends on the usages of these critical functions).
02-12-2021

Yes, the text looks strange, I tried to fix it, but seems it's a bug in JBS, as the Preview page is good, and it's also good at http://openjdk.java.net/jeps/8276094.
02-12-2021

Can you add some performance improvement data? Especially for reported cases. We should advertise how much it will help. Text for link to Java doc looks strange. I don't think it should look like this: <code class="prettyprint" data-shared-secret="1638393791455-0.18479000805431045">GetXXXCritical</code> and <code class="prettyprint" data-shared-secret="1638393791455-0.18479000805431045">ReleaseXXXCritical</code>
01-12-2021

Appeared on reddit: https://old.reddit.com/r/java/comments/qilv4v/jep_draft_region_pinning_in_g1/ The suggested improvement seems to mostly be about improving the situation with potential long running JNI critical regions. While an interesting idea, this seems out of scope for this JEP (mitigating issues with long running JNI critical regions).
08-11-2021