JDK-8060697 : G1: Investigate heap sizing heuristics
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-10-15
  • Updated: 2018-06-21
  • Resolved: 2015-12-09
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b99Fixed
Related Reports
Duplicate :  
Description
There is a trade off in the heap sizing heuristics that is difficult to get right. The more aggressively we increase the heap the better throughput we get but this comes at the cost of memory footprint.

Currently the heap sizing is based on the GCTimeRatio which means that the heap sizing is affected by how long the GC pauses are. Any changes that reduce or increase the GC pauses will also affect the heap sizing. This in turn means that we get throughput or footprint regressions reported on startup benchmarks even if the only change we have done is to the GC pause. See for example: JDK-8056957.

To mitigate this problem it seems like a good idea to investigate how we want the heap sizing to behave. If we can come up with a clearer policy it will be easier to foresee effects of changes and also easier to evaluate whether changes are bugs or just expected behavior.
Comments
Since I noticed a performance change with the updated base code, I re-ran the benchmarks in the charts with both the new base and new base plus my changes. I've attached ScaledHeapGrowth_NewBase.ods (spreadsheet) and HeapGrowthSamplesLargeSys_rebase.odt (growth charts) for the udpated versions. A few are a bit higher, a few are a bit lower. (Note: I also used an explicit Xmx on the small system this time). One thing I noticed this time is that on Compress on the small system, the new code is not a factor - the heap grows only due to allocation failures, not due to crossing GCTimeRatio. It is a factor on the large system, though. But that explains why on the small system, there is still an outlying heap size on the small system.
07-12-2015

>> Even going to 14 didn't help JBB on the small system, so I want to rerun both versions >> there again to make sure the baseline hadn't just 'gotten lucky' there somehow. Apparently the baseline runs on the small system *had* gotten lucky - there was very little variation. Running another set of 10, I get a few similar results (scores around 89K), but also some around 32K and one at 8K, with heap sizes ranging from 2500m to 6100m (RSD = 38%). The score average is 63K, so the the 87K the new version consistently got looks good. Interestingly, when I refreshed my base code yesterday, I now get higher scores in this config both with and without my changes. Without my changes, the scores range from 33K to 100K (average 70K) and heaps from 2100m to 6100m (RSD 32%). With my changes, the average score is 96K, and the average heap is about 2500m, with only 3% RSD. So that's all good. I completed the runs with GCTimeRatio bumped to 14, and don't think the difference it makes is worth it. A couple of benchmark scores increase slightly, but variability also seems to increase slightly. I'm going to leave it at 12 for now.
03-12-2015

Now having looked at pause times, I don't think there are any surprises. Heaps are sometimes smaller, sometimes larger now, and as you'd expect, total pause time is lower for larger heaps. The old version has a larger heap in some instances because it grew more than necessary to meet the GCTimeRatio goal. Earlier, I said: "It's possible that lowering the GCTimeRatio one more notch (now at 8%) would eliminate some of those, at the expense of making the heap larger in several cases." I've tried that experiment with the jvm/jbb tests I normally run, bumping GCTimeRatio to 13 (~7%. It's actually about 7.7% now). A couple of benchmarks did improve slightly, and for some the heap got slightly larger. Even going to 14 didn't help JBB on the small system, so I want to rerun both versions there again to make sure the baseline hadn't just 'gotten lucky' there somehow. But I'm re-running the jvm tests with 14 in the background now. I also tried some different scaling settings (shifting the scaling points and amounts) and didn't find an alternative consistently better than the current values, in terms of keeping variation low, so I will leave those as-is for now.
02-12-2015

No, I haven't specifically looked at that. Only the relative sizes and benchmark scores. I assume you mean pause time effects from different heap sizes, not the additional time spent "doing the math" in expansion_amount itself. I'll look at some samples of that.
01-12-2015

Have you done any comparisons of pause times with the new sizing vs. the original?
25-11-2015

I've attached HeapGrowthSamplesLargeSys.odt and ...SmallSys.odt to show examples of heap growth over time with some of the JVM benchmarks. The old versions look nicer in the graphs, with smooth curves from growing at every GC, while the new versions look more jagged since it takes a minimum of 4 GCs to grow. These demonstrate how scaling up the increment size can at least partially compensate for the potential delay of the growth.
24-11-2015

I've attached the spreadsheet ScaledHeapGrowth.ods. This shows before and after results for a subset of JVM'08 tests on a 32 core system and a 12 core system. I'd been primarily focused on the former. The primary thing to look for is the improvement in relative standard deviation in heap sizes between the Base and New versions. For example, for crypto.signverify it goes from 41.23% down to 5.59 (see line 50) on the large system. There are also some improvements in scores, the largest by far being in Compress on the large system (see line 22). JDK-8132077 reported a 40% degradation in this, because the heap went to about half the size it used to due to G1 performance changes. With the new code, you can see steps will tend to be smaller, such as between 1100 and 1400 in this case. On the small system, OTOH, Compress loses about 2.5% on average due to a slightly smaller heap. There are a few other degradations as well. SPEC jbb'05 lost about 1% on the large system, 2.5% on the small system. It's possible that lowering the GCTimeRatio one more notch (now at 8%) would eliminate some of those, at the expense of making the heap larger in several cases. This is something to consider, especially if in the future we allow the heap to shrink more readily when it has grown too large.
24-11-2015

I've been running with experimental changes to the heap sizing code, which I'm now preparing for review. More accurately, the change is to the heap growth code based on GCTimeRatio. There is no change to the code that shrinks the heap, which is still worth looking at. The main problems with the existing code are the large step-wise nature of the growth, and the inaccurate tracking of the time ratio. These cause a lot of run-to-run variability, and sometimes result in a performance degradation when G1 has actually improved, because the heap size suddenly takes a large reduction. In other cases, there's excess growth beyond what's needed to maintain the ratio. In the existing code, the time ratio is tracked by means of the _recent_gc_times_ms vector (a TruncatedSeq), which maintains the elapsed time for the last 10 GC pauses, and the _recent_prev_end_times_for_all_gcs_sec vector, which is used to calculate how much application time has passed. Currently, the average of the last 10 (or fewer, if there aren't 10 yet) entries is used to calculate the GC time to application time ratio. This has two problems. First, if a resize has occurred in that range, the vector includes times that reflect the old heap size. Second, it means 1 very slow GC can cause more than 1 heap expansion. as long as that time is still recorded in the vector. Clearing out the vector when an expansion occurs would solve that. The proposed change uses the vector in a slightly different way, described later. In the existing code, the heap is always expanded by either doubling the current size, or by adding 20% of the remaining uncommitted size, whichever is smaller. So if the average ratio exceeds the threshold in one run and not another, the heap might be twice as large, or larger if it happens more than once. Using SPECjvm'08 as an example, I ran 10 instances of each benchmark, and computed the relative standard deviation of the heap sizes. Several had RSD's in the 40-60% range. For example, Sunflow heaps ranged from 1024 to 4608. Fortunately in most cases, even that kind of heap size variation has only a small impact on score, but this is the sort of thing behind JDK-8132077. I'll attach a speadsheet showing the variation before and after the proposed changes. The second major change is to scale the heap expansion amount according to how much the desired ratio has been exceeded by, rather than using a fixed amount. So if one run gets an expansion because the ratio barely crosses the threshold, a second run where the threshold is not crossed will have a much more modest difference in heap size. The starting expansion sizes are as they were before, but those values may then be scaled up or down depending on the average ratio during a measurement window. In the old code, an expansion is triggered when the ratio of the average of recent pause times to application time exceeds the threshold. In the new code, a measurement window begins when the *current* pause has exceeded the threshold. This ratio is computed not by comparing the current GC pause time against the time since the last pause, but by comparing it to the entire time range represented by the _recent_prev_end_times_for_all_gcs_sec vector. (Hence it must be multiplied by the number of entries in the vector). This has the effect of smoothing out the effects of a small number GCs happening more quickly, but that don't reflect a real change in occupancy. For each subsequent GC pause, if the threshold is exceeded, a counter is incremented. A resize is triggered when either: - we've refilled the _recent_gc_times_ms vector and the average of the entries still exceeds the threshold - the number of GCs that exceeded the threshold has reached MinOverThresholdForGrowth. MinOverThresholdForGrowth is a new constant (currently == 4, just under half the size of the vectors) which sets the minimum number of times the threshold must be crossed before a resize is triggered. This count must be reached before we reach NumPrevPausesForHeuristics (the number of elements in the vectors). Using the count means we don't have to wait for the entire vector to fill before triggering a resize, and is by far the more common case when the heap needs to grow aggressively. But even one or two very slow GCs can still result in an expansion, if the average is high enough when the window ends. (Earlier experiments didn't allow this, and used only the count - which could mean some application memory usage patterns might never get a needed expansion.) If the end of the window (IE, enough GCs to fill the vector have occurred) is reached without a resize being triggered, the counter is cleared. Another GC above the threshold will be needed to start the sequence again. When a resize is triggered, the amount by which the threshold has been exceeded (on average) is used to compute a scale factor. If we hit MinOverThresholdForGrowth entries that exceeded the threshold, the amount is the average of the last_gc_overhead values that exceeded the threshold. Or if we filled the entire vector and still have an average over the limit, recent_avg_pause_time_ratio is used. Several constants are declared in the code that control the scale factor behavior, which should make it obvious what to change to try alternatives. With the current settings, the amount will be reduced if the ratio exceeds the threshold by less than the threshold amount itself. It may also be expanded (up to 2x) if the ratio exceeds the threshold by more than 1.5x the threshold value. This increase is scaled across a range of 2x the threshold. In other words, it will take until the ratio is at 3.5x the threshold before it is fully doubled. The scaling-up of the heap increment compensates for the fact that, compared to the old mechanism, the new code might delay the heap expansion slightly under high stress. With the old code, 2 consecutive pauses could both expand the heap, while with the new code, it will take 4 pauses before the second can occur. For triggering the first expansion after increased heap usage, the two are probably about the same. With the old code, the average of the last 10 pauses has to cross a threshold. With the new, we start the 'window' with the first pause that crosses the threshold, so it could happen in as few as 4 - plus we'll have the data from 4 pauses to more accurately set the expansion size. I will attach some charts showing heap expansion rates for some benchmarks to illustrate. In the past, the thresold was set at 10% (GCTimeRatio=9). With these changes, it is lowered to about 8% (GCTimeRatio=12). There is no one perfect setting of this (or the scaling knobs described above) that will give ideal scaling for all applications - all are a compromise based on iterative testing. We could consider making a subset of the scaling knobs, or MinOverThresholdForGrowth available as experimental tunables, but it's probably better not to. GCTimeRatio is still there to force faster or slower growth. Another aspect of the change is to scale the threshold itself, when the heap is at the small end of the reserved size. If the current heap size is less than half of the reserve, it is scaled down to as low as 1%. This encourages more heap growth when the heap is small, with the goal of "not leaving performance on the table." The attached charts will show that this doesn't result in excessive heap growth. The maximum sizes the heap grows to when using UseParallelGC are shown for comparison. One more adjustment to sizing affects the growth amount when the current size is much smaller than the Initial (not maximum) heap size. This primarily comes into play if no -Xms value is specified, and a system.gc() is performed before much data is in the heap, as is done at the start of many benchmarks. If the current size is less than 1/4 of the Initial heap size, the growth value used is half the delta between the Initial value and the current, to more quickly grow back. (If an Xms was specified, the heap won't shrink below that value).
23-11-2015