JDK-8166500 : Adaptive sizing for IHOP causes excessively long mixed GC pauses
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • OS: linux_ubuntu
  • CPU: x86_64
  • Submitted: 2016-09-21
  • Updated: 2016-09-28
  • Resolved: 2016-09-28
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
FULL PRODUCT VERSION :
java version "9-ea"
Java(TM) SE Runtime Environment (build 9-ea+131)
Java HotSpot(TM) 64-Bit Server VM (build 9-ea+131, mixed mode)

FULL OS VERSION :
Linux 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


A DESCRIPTION OF THE PROBLEM :
We have ran our application on JDK 9-ea+131 and generated workload similar to one we have on production. We have noticed unacceptably long mixed GC pauses - more than 3 seconds long. Actually all mixed GC pauses where much longer than we have on JDK 1.8.0_102 in exactly similar conditions.
We have used following flags with JDK 9:
-XX:+DisableExplicitGC -XX:+PerfDisableSharedMem  -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:+UseG1GC -XX:MaxGCPauseMillis=40 -Xms64g -Xmx64g
It was found that JDK-8136677 causes this since disabling Adaptive sizing for IHOP resolve the issue.
Attaching a part of debug GC log. 
Reproducible on 9-ea+135

THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Did not try

REGRESSION.  Last worked in version 8u102

ERROR MESSAGES/STACK TRACES THAT OCCUR :
Part of debug GC log. 
https://www.dropbox.com/s/ahmbutoai2j3ehh/gc_log_07_09_16.log.zip?dl=0
Long pauses can be found near 44320, 48900, 54470, 64900 seconds

REPRODUCIBILITY :
This bug can be reproduced always.

CUSTOMER SUBMITTED WORKAROUND :
-XX:-G1UseAdaptiveIHOP brings results on par with JDK8 


Comments
Added logs provided in the email thread at http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2016-September/002493.html (thanks!)
28-09-2016

This is a known issue with bad defaults for some global options. The existing JDK-8159697 (or, more likely, sub-tasks of it) is supposed to cover such a change.
28-09-2016

I do not think the allocation rate spikes are problematic in this particular run; if it were a problem, G1 would have GCs that mention something like "to-space exhausted". I have not found any in the logs. Also, the only interesting allocation rate for the purpose of keeping up with marking is the allocation rate into the old generation: the gc, ihop "Basic information" logs show you that; that one shows a maximum allocation spike of ~1.3GB/s - much lower than what you suggest. The sustained allocation rate over longer intervals seems to be ~273MB/s (the gc, ihop "Adaptive IHOP information" shows that), both magnitudes lower than suggested. So the 80% IHOP (~12.8GB) reserve for ~10s max marking length observed seems more than enough to cover these. [You could give the adaptive IHOP additional buffer for these allocations by increasing G1ReservePercent (default 10) if you want: this setting exactly tells G1 what amount of space it should reserve for allocation spikes. The 10% means 6.4GB, so this would already be lower than what the adaptive IHOP suggests. The 10% is another heuristic set to a "random" value that fits most applications, but not all: if you know more than G1, you need to tell it. It can't predict the future, and does not know all applications. Setting G1ReservePercent to very high values as default to accommodate this particular application would significantly impair other applications as this takes away from the heap to use during regular operation.] [Note that marking length does not have stabilized yet in that run, so it might be longer in the future, but adaptive IHOP is already catching up to that; also, there are still some improvements to marking pending to land in an JDK9 EA still, making it faster] To me the problem is that after waiting until 80% IHOP and the following marking has been completed, the composition of regions is different. There seem to be mostly "expensive" regions to collect live at that point. Due to the heuristic 1) mentioned in the other comment, is forced to take lots (actually, many mixed gcs only contain such) of them, breaking the pause time goal a lot. As there is plenty of space, I recommend increasing G1HeapWastePercent (so that G1 does not try to reclaim "everything"), and increase G1MixedGCCountTarget (so that the minimum amount of these expensive regions to take per mixed GC is smaller) and potentially lower G1NewSizePercent so that G1 has more time left for any remaining expensive regions during mixed gc (try in that order). As mentioned, I think 8u suffers from the same issues, and could benefit from the same tuning. Or just disable adaptive IHOP to get old behavior if you like that better for now, as I do not see a particular problem with adaptive IHOP here, but with heuristic defaults (which surfaced by the higher IHOP values selected by the adaptiveness algorithm; you can also increase G1ReservePercent to make the max adaptive IHOP lower). If you have time, please continue trying to create a reproducer. It would tremendously help us in improving the heuristics and out-of-box performance for applications with such profiles. It would also be very nice to share logs with the application running on 8u (or 9 with adaptive IHOP disable) to analyze this issue a bit further (please consider sending an email to the hotspot-gc-use mailing list though). It's nicer to discuss there than the bug tracker.
23-09-2016

Received additional response from submitter == Right now i can reproduce it only in environment similar to production under a certain workload. Sadly it hardly possible to share this exact thing. I will try to create a simple test case targeted specifically to reproduce this problem. But it's going to take some time. I the meantime i can add that our application under this particular workload experiences significant allocation rate spikes - up to 10Gb/s while steady allocation rate is 3-4Gb/s. So allocation rate is really unpredictable in this case. And i think this fact makes Adaptive IHOP algorithm choose wrong numbers for InitiatingHeapOccupancyPercent in the runtime - up to 80% hence long pauses ==
23-09-2016

Submitter claims tha "We have ran our application on JDK 9-ea+131 and generated workload similar to one we have on production." requested application/testcase to reproduce this issue. == Could you please share the application/test case to reproduce this issue? ==
22-09-2016

This issue only observed in 9, Possibly JDK-8136677 could be the reason for long GC pauses. Attachment contains the debug GC logs
22-09-2016

The actual problem seems to be a combination of very high expectations on the pause time (40ms) on a large heap (64G) combined with too aggressive setting of some default values of some heuristics, not necessarily enabling the adaptive IHOP. Adaptive IHOP only caused the problems to be noticable as from what I can see, none of the mixed gcs keeps the pause time goal. Eg. taking GC 20029: 1) GC(20029) Added expensive regions to CSet (old CSet region num not reached min).old: 152 regions, expensive: 152 regions, min: 152 regions, remaining time: 0.00ms 2) GC(20029) Finish choosing CSet. old: 152 regions, predicted old region time: 2062.64ms, time remaining: 0.00 3) GC(20029) continue mixed GCs (candidate old regions available). candidate old regions: 150 reclaimable: 4391289584 (6.39) threshold: 5 I.e. line 3 tells you that while trying to meet the expected heap wastage of 5% in the candidate collection set to complete incremental cleanup, G1 knowingly took "expensive regions" (line 1) because settings/heuristics told it so. [G1 always tries to reclaim "easy regions first" (i.e. "garbage first" principle), the more that has been reclaimed, the more expensive individual regions get. Particularly reclaiming the last few percent to reach the end of the old generation reclamation goals. To maintain some forward progress in collecting the old gen heap, there is a lower amount of old gen regions G1 will always take every gc] The reason why I said that settings are too aggressive for this application, is that even the first mixed GC of any mixed gc phase (that can use the least expensive regions) already takes in "expensive regions", and is expected to blow the pause time goal (e.g. GC 20023). There are several options here (to mitigate or workaround the issue): - do not bother with reclaiming the last few percent of wastage, there seems to be enough free space to operate within anyway, as adaptive IHOP increased the IHOP to ~80% instead of the previous default of 45%. (Which should have translated in a much better memory usage, and significantly higher throughput compared to jdk 8u already btw). One suggestion is to increase G1HeapWastePercent from default 5% to something like 7 or 8. - increase G1MixedGCCountTarget (default 8) to lower the minimum amount of old gen regions to take per mixed gc, resulting in more (shorter) mixed gcs though. Another generic option to give G1 more time for old gen regions during (mixed) GC is to decrease G1NewSizePercent (default 5) to something lower than 2 or 3 on such large heaps with these time goal. Some tuning of these suggested options might make the issue disappear with increased throughput compared to with 8u (it might be nice to provide logs for a comparable 8u run too). Of course, disabling adaptive IHOP as suggested might be another viable workaround.
22-09-2016