JDK-8316226 : GenShen: Consider forcing auto-tenure age to be greater than 1
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: repo-shenandoah
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-09-13
  • Updated: 2023-12-02
  • Resolved: 2023-11-29
Related Reports
Relates :  
Relates :  
Description
The census for age-0 objects means something different than the census for all other ages.  In particular, everything allocated during a particular cycle is considered to have age 0 at the end of that cycle.  We cannot reclaim any of these objects until the next GC cycle.

Normally, we take the census at the end of concurrent marking.  But age-0 objects will continue to accumulate during evacuation and update-refs.

The auto-tenure-age selection algorithm considers the delta between census at age 0 during cycle N and census at age 1 during cycle N+1.  If this delta is sufficiently small, it concludes that we should promote objects of age 1, because these cohorts have demonstrated low mortality.

There is a suspicion that this choice causes undesirable premature promotion.

Either we need to understand and document this as a desirable behavior, or we want to disable auto-tenuring-age-selection from choosing age 1 as the promotion age.

A test workload has been demonstrated to generate almost 3x the amount of old-gen garbage between the GenShen version on 6/15 and the GenShen version on 9/11.  
Comments
Thanks for the incremental software update on this issue. I'm revisiting some of my analysis and perspectives on the auto tenuring. Here's some ASCII art to "clarify" my understanding. Is this accurate? 1. At the end of marking, or at the end of update, depending on when we take our census, we will take a measurement of age 0 population. 2. Following the measurement, these objects will be known as age 1, because they will be subsequently evacuated and their age incremented. 3. I'm less confident I understand what is meant by the age-0 census if we take it at the end of evacuation. Age-0 at that point includes any objects allocated following the start of evac, but it also implicitly includes all objects allocated before the start of the next evac. We do not have a baseline for comparing the population of the age-1 census at the point of the next measurement. 3. When we compare the two census measurements, it may look like low mortality even though it is high mortality. 4. If our measurement would give special consideration to age 1 calculation, by adding to the initial census all objects allocated following that census, the analysis might be more reliable. See below. Example: census 0 census 1 v v |--------|---------|----------|----------------------------------------------|---------|---------|----------| mark0 evac0 update0 idle1 mark1 evac1 update1 Some questions to be clarified: 1. At census 0, all objects allocated after evac(-1) are considered to be age 0. Any object allocated before evac(-1) should have age 1 at the time of census 0. Are objects created after mark0 (i.e. above TAMS) counted in census 0? 2. At census 1, the tally of age 0 objects will represent all surviving objects allocated after evac0, and the tally of age 1 objects will represent the tally of all surviving objects allocated before evac(-1).
02-12-2023

Follow up in ticket: https://bugs.openjdk.org/browse/JDK-8321041
29-11-2023

For some reason, the PR and commit didn't get linked here by the skara infra. Here they are for future reference: 1. https://github.com/openjdk/shenandoah/pull/359 2. https://github.com/openjdk/shenandoah/commit/2618aa699475a7ad62335daa5d3f48c98526f376
29-11-2023

TLDR of performance results so far: "it's complicated" 1. There is no performance difference or appreciable difference in tenuring behaviour observed between reference and specimen with default settings for either specjbb or extremem. 2. There is similarly no performance difference or appreciable difference in tenuring behavior observed with MinTenuringAge set to 2 with either specjbb or extremem 3. However, with the extremem workload, performance seems to be superior (lower gc overhead, fewer gc cycles) with min=max=7 compared with the default. This is consistent with what was observed by Kelvin's experiment above. [More on that later.] 4. Similarly, min=max=15 performs better than default (lower gc overhead, fewer gc cycles), although worse than min=max=7 5. The age census of objects shows very little mortality after age 2; however, promoting these objects (as the adaptive algorithm would recommend) appears not to help; these objects usually die all at once in subsequent epochs. Presumably promoting them to old sooner, as the default settings might try to do, results in more promotion, increasing the size of the old generation, shrinking the young generation, and causing more frequent minor cycles, which also tend to identify relatively younger objects for promotion. This in turn also produces pressure on the old generation causing old cycles. These compound to increase GC overhead. 6. It is possible that tweaking the cohort survivor volume threshold and the cohort mortality threshold might help us perform better, but this is far from clear based on the experiments so far and looking at the GC age census tables. TBD: Collect similar figures for SPECjbb as well (in progress). I think at this point that we have to go back to the drawing board to rethink the tenuring policy/algorithm and determine a better strategy to find the optimal tenuring threshold. Plan: I would like to check in the correctness/improvements already in the PR, and pursue improvements to the basic algorithm in a separate PR. I suspect that work will entail at least characterizing the specific configuration of our generational system where earlier promotions cost space in the old generation that shrinks the headroom for the young generation causing an increase in minor GC frequency. This calls for a strategy that is also cognizant not only of the low mortality of a cohort that is identified for promotion, but the cost of that promotion because of stranding that memory until a major collection cycle occurs in the future. In heap constrained scenarios, the resulting increase in major and minor collections can easily overwhelm the time saved in marking fewer objects in a minor cycle. I would like to model this system at a high level and try and determine what the analytical model determines to be the population that should be tenured at each cycle, based on an assumed simple lifetime distribution of objects, which is likely to be a function of the instantaneous sizes and occupancies of the old and the young generation. A separate ticket will be filed for that modeling and experimentation exercise.
22-11-2023

After correction of some errors in the computation of tenuring threshold that would previously cause premature promotion, I am rerunning the suggestion of setting the default min tenuring age to 2 to see if it makes any difference, especially with the extremem workload referenced in this bug report. Will update this space with results when available.
20-11-2023

I'll check in a couple of small corrections to the current implementation that fix a few edge cases under this ticket, so I'll keep it open for those changes for which I'll open a PR shortly.
30-10-2023

I am closing this as the changes in default to the extent of my measurements with SPECjbb and Extremem, made no difference to the tenuring behavior.
30-10-2023

I am investigating this, and will have an update soon. The investigation has revealed a few small (edge case) bugs in the implementation, but raising the floor has so far not been effective. Will update once I have completed this investigation in the next week or so.
28-09-2023

A quick and dirty workaround is to set `-XX:ShenandoahGenerationalMinTenuringAge=2` while we look into this issue more carefully.
13-09-2023

[~kdnilsen] please do not use "client-libs" Component for hotspot bugs
13-09-2023

I've added logs from the 6/15 configuration and the 9/11 configuration.
13-09-2023

Reproducer: echo Run OpenJDKTip GenShen GC with memory size 54g with 6s customer period >&2 echo Run OpenJDKTip GenShen GC with memory size 54g with 6s customer period ~/github/shenandoah.experiments.9-11-2023/build/linux-x86_64-server-release/jdk/bin/java \ -XX:+UnlockExperimentalVMOptions \ -XX:-ShenandoahPacing \ -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms54g -Xmx54g \ -XX:+UseShenandoahGC \ -XX:ShenandoahGCMode=generational \ -Xlog:"gc*=info,ergo" \ -XX:+UnlockDiagnosticVMOptions \ -jar ~/github/heapothesys/Extremem/target/extremem-1.0-SNAPSHOT.jar \ -dInitializationDelay=45s -dDictionarySize=16000000 -dNumCustomers=28000000 \ -dNumProducts=64000 -dCustomerThreads=2000 -dCustomerPeriod=6s -dCustomerThinkTime=1s \ -dKeywordSearchCount=4 -dServerThreads=5 -dServerPeriod=5s -dProductNameLength=10 \ -dBrowsingHistoryQueueCount=5 \ -dSalesTransactionQueueCount=5 \ -dProductDescriptionLength=64 -dProductReplacementPeriod=25s -dProductReplacementCount=5 \ -dCustomerReplacementPeriod=30s -dCustomerReplacementCount=1000 -dBrowsingExpiration=1m \ -dPhasedUpdates=true \ -dPhasedUpdateInterval=60s \ -dSimulationDuration=20m -dResponseTimeMeasurements=100000 Build: commit 69aa97198c4e71051492f368e37aa50fc75834ee (HEAD -> experiments-9-11, origin/master, origin/HEAD, master) Author: William Kemper <wkemper@openjdk.org> Date: Fri Sep 8 17:40:46 2023 +0000 8315875: GenShen: Remove heap mode check from ShenandoahInitLogger Reviewed-by: kdnilsen, shade, ysr
13-09-2023