Bug ID: JDK-8245511 G1 adaptive IHOP does not account for reclamation of humongous objects by young GC

JDK-8245511 : G1 adaptive IHOP does not account for reclamation of humongous objects by young GC

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 11,14.0.1,15

Priority: P3
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2020-05-20
Updated: 2024-11-20
Resolved: 2020-08-21

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11	JDK 16
11.0.12Fixed	16 b13Fixed

Related Reports

Blocks :	JDK-8246274 - G1 old gen allocation tracking is not in a separate class
Relates :	JDK-8240556 - Abort concurrent mark after effective eager reclamation of humongous objects
Relates :	JDK-8027959 - Early reclamation of large objects in G1
Relates :	JDK-8136677 - Adaptive sizing for IHOP in G1
Relates :	JDK-8238163 - Improve G1 Adaptive IHOP heuristics

Description

Filed on behalf of Ziyi Luo, luoziyi@amazon.com.

After every young-only GC, G1's Adative IHOP calculation involves the old generation allocation rate. This rate is calculated as: old_gen_allocated_bytes_since_last_gc / allocation_time_in_sec.

The value old_gen_allocated_bytes_since_last_gc is supposed to refer to all allocations into old regions, including humongous regions. This does not account for the possibility that humongous objects, at least primitive arrays, can be reclaimed by young collections (see JDK-8027959, G1ReclaimDeadHumongousObjectsAtYoungGC). This discrepancy can become problematic in applications which churn through a lot of rather short-lived humongous objects.

A real world example experienced in a production service is shown in Fig 1. The three vertical dash lines represent initial marking phases at the beginning of concurrent marking runs. Adaptive IHOP takes control after the third one of these. Here, most humongous objects are in fact collected during young GC, yet their allocations still count towards the old generation allocation rate in Adaptive IHOP, which erroneously reaches ~200 MB/s. In result, the estimated IHOP is pushed down.

Even though there is no immediate need for concurrent marking and mixed collections at this point, they now occur back-to-back. Further down this path, frequent inefficient young GC, high promotion rates, and high CPU usage ensue. By 24:00 the high old gen allocation rate is compounded by high promotion rates, and CPU usage jumps to above 90%.

There is a fix as shown in the webrev in the comments below that works as follows. In each young-only collection cycle, record these numbers of humongous regions:
A) present after the last GC,
B) newly allocated since the last GC,
C) present after this GC.
Estimate the number of humongous regions reclaimed by this GC as:
(A > C) ? B : A + B - C

Reproduction:

Run the attached standalone program "AdaptiveIHOPIssueRepro.java" to approximate the allocation pattern that triggered this issue in our service. The necessary JVM options are listed in a code comment at the top.

Fig 2 shows a test results from 120 second runs with and without the proposed fix that we invoked like this:

java -Xmx512m -Xms512m -XX:G1HeapRegionSize=1m -XX:+UnlockExperimentalVMOptions -XX:G1MaxNewSizePercent=30 -XX:G1NewSizePercent=30 -Xlog:gc*=debug:file=gc-%p-%t.log AdaptiveIHOPIssueRepro 120

Each mark dot in Fig 2 represents a young GC invocation. There are 44 of these with the fix and 5254 without. The predicted old generation allocation rate of the fixed version is around 1/8 of the rate of the unfixed version.

Comments

Perhaps this solves the problem that JDK-8240556 also tries to solve? Our production experience in JDK 11 suggests that -XX:-G1UseAdaptiveIHOP -XX:InitiatingHeapOccupancyPercent=60 could also avoid back-to-back concurrent marks caused by humongous allocations. I haven't got the time to dig into the root cause. Thank you for backporting it to 11u. I will definitely check if this solves the problem we are experiencing. If not, I'll try backporting JDK-8240556 to our internal JDK as well.
30-04-2021
Fix Request (11u) Webrevs and review: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2021-April/005926.html
30-04-2021
URL: https://hg.openjdk.java.net/jdk/jdk/rev/d2eafcf20079 User: tschatzl Date: 2020-08-21 09:58:29 +0000
21-08-2020
Some more comments about the background of adaptive IHOP in the thread starting at https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-May/029824.html ; particularly this email https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-May/029882.html.
27-05-2020
I have attached an updated repro program ("AdaptiveIHOPIssueRepro2.java") from Ziyi that evokes the problem on JDK 15 as well. The main difference is that each large object now spans more than the space needed for a G1 region. Apparently, JDK 15 is able to squeeze region-sized objects in without making them humongous, whereas JDK 14 and before insist on half a region or less as non-humongous size. See also https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-May/029866.html.
26-05-2020
Have been looking at the issue in more detail, and while and I am not completely convinced the analysis is entirely correct, but incorporating humongous reclaim in allocation rate calculation certainly helps not pushing down IHOP in such cases. If you compare latest jdk15 behavior which does not show this problem here at all like with earlier jdks (you tagged 15 as affected, but I can't reproduce with latest. Might be a local fluke though). On latest jdk15 you can see the normal concurrent mark start->prepare mixed->mixed gc cycle, while with earlier jdks most concurrent marks are started by humongous allocations. (i.e. gc cause: humongous allocation). These start before adaptive IHOP kicks in, but adaptive IHOP certainly does not improve the situation. I do not know what change causes the difference - jdk15 regardless of initial IHOP value stabilizes, while earlier jdks get unstable earlier the lower the initial IHOP is. Need to look at this in more detail.
26-05-2020
Hi Ziyi! thanks for your contribution. Note that the proper review channel are the mailing lists, hotspot_gc_dev@openjdk.java.net in this case, so please send out a review request if you haven't done so. Some very initial comments while very briefly looking at the change, but I will look at this in more detail in the official review thread. - it would be nice if the test you gave could be added as stress test to the other ones, like in test/hotspot/jtreg/gc/stress. Maybe some fairly stable success/failure metric can be derived somehow. - one other improvement could be to put the old gen allocation tracking into its own class (e.g. G1OldGenAllocationTracker; which G1Policy has an instance of) to not add too much detail to the IHOP code itself. In G1IHOPControl::update_allocation_info() then pass the expected allocation rate as experienced by/relevant to the IHOP control as it is in now its respective components. That would keep the IHOP calculation itself and how to determine the IHOP related old gen allocation rate separate. The "additional_buffer_size" that is passed to update_allocation_info() might need to be adjusted with some measure of the "recent" maximum allocated humongous region count during a mutator phase; or might not, if you assume that the g1 heap reserve will cover that. I.e. that additional_buffer_size is/should be representative to the memory reserve that is needed to cover the time between concurrent mark start and the first mixed gc since as you can see it is directly subtracted from the target heap size. - there should be some logging about the old gen allocation tracking, at least log statements, but also some JFR events are appreciated.
25-05-2020
Here is the webrev for the fix mentioned above: http://cr.openjdk.java.net/~bmathiske/8245511/webrev.00/
21-05-2020