Bug ID: JDK-6999988 CMS: Increased fragmentation leading to promotion failure after CR#6631166 got implemented

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 6u22

Priority: P2
Status: Closed
Resolution: Fixed
OS: solaris_10
CPU: sparc

Submitted: 2010-11-15
Updated: 2013-09-18
Resolved: 2011-03-07

JDK 6	JDK 7	Other
6u24-revFixed	7 b130Fixed	hs20Fixed

On a customer test system we can see increased fragmentation with 6u21+
leading to promotion failures, which could not be observed with 6u20-

I attach two logfiles which show the issue. Here is the interesting
section:

Statistics for IndexedFreeLists:
--------------------------------
Total Free Space: 475074
Max   Chunk Size: 254
Number of Blocks: 8584
Av.  Block  Size: 55
 free=105894392 frag=1,0000  <<<<<<<<<<<
Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
Statistics for IndexedFreeLists:
--------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
 free=0 frag=0,0000
9421,892: [ParNew (promotion failed)   <<<<<<<<<<<<<<<<<<
Desired survivor size 163840 bytes, new threshold 0 (max 0)
: 48512K->48512K(48832K), 0,2599337 secs]9422,152: [CMS (concurrent mode failure)size[2]
A second customer reports an issue with 6u22 with these jvm options.

-XX:+CMSConcurrentMTEnabled -XX:CMSInitiatingOccupancyFraction=60 -XX:CMSMaxAbortablePrecleanTime=3600000 -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:+CMSScavengeBeforeRemark -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=32212254720 -XX:MaxHeapSize=32212254720 -XX:MaxNewSize=2147483648 -XX:MaxTenuringThreshold=4 -XX:NewSize=2147483648 -XX:ParallelCMSThreads=8 -XX:PermSize=67108864 -XX:+PrintCommandLineFlags -XX:PrintFLSStatistics=1 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+UseBiasedLocking -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseLargePages -XX:+UseMembar -XX:+UseNUMA -XX:+UseParNewGC

After 14 CMS cycles with 6u22, they report a promotion failure, where a failure happened when tenured space filled to 15G. This is strange as the heap usually fills to 20G before a CMS cycle is triggered.

Here is the pertinent section from the gc log.

2010-11-18T19:30:38.703-0600: [GC Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 2004318869
Max   Chunk Size: 57061
Number of Blocks: 711866
Av.  Block  Size: 2815
Tree      Height: 90
Before GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 1786368
Max   Chunk Size: 1786368
Number of Blocks: 1
Av.  Block  Size: 1786368
Tree      Height: 1
[ParNew (promotion failed)
Desired survivor size 107347968 bytes, new threshold 1 (max 4)
- age   1:  152613256 bytes,  152613256 total
: 1861587K->1887488K(1887488K), 1.7214606 secs][CMSCMS: Large block 0xfffffd797ec6d6a0
: 13801830K->2261237K(29360128K), 46.1976857 secs] 15542483K->2261237K(31247616K), [CMS Perm : 47619K->47461K(79492K)]After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: -826325716
Max   Chunk Size: -826325716
Number of Blocks: 1
Av.  Block  Size: -826325716
Tree      Height: 1
After GC:
Statistics for BinaryTreeDictionary:
------------------------------------
Total Free Space: 0
Max   Chunk Size: 0
Number of Blocks: 0
Tree      Height: 0
, 47.9259930 secs] [Times: user=50.62 sys=0.02, real=47.93 secs]

Engineering reported that it almost seems like some kind of bug in CMS allocation because there is plenty of free space (and comparatively not that much promotion) when the promotion failure occurs (although full gc logs would be needed before one could be confident of this pronouncement). So it would be worthwhile to investigate this closely to see why this is happening. I somehow do not think this is a tuning issue, but something else. 

Also, in 6u21 we integrated 6631166 which is aimed at reducing fragmentation.
If at all possible, could you down-rev to 6u20 (i.e. pre-6u21) and see
if the onset of fragmentation changes in any manner? 

The customer repeated the test with 6u20. There were no promotion failures, even after 7000 cycles. CMS cycles were always triggered after tenured gen filled to 20G.

EVALUATION http://hg.openjdk.java.net/hsx/hsx20/baseline/rev/befa5106b5e9
12-02-2011
SUGGESTED FIX http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/e49cfa28f585
08-02-2011
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/e49cfa28f585
08-02-2011
SUGGESTED FIX diff -r b7a938236e43 -r e49cfa28f585 src/share/vm/gc_implementation/shared/allocationStats.hpp --- a/src/share/vm/gc_implementation/shared/allocationStats.hpp Mon Jan 31 16:28:40 2011 -0500 +++ b/src/share/vm/gc_implementation/shared/allocationStats.hpp Tue Feb 01 10:02:01 2011 -0800 @@ -116,10 +116,8 @@ _demand_rate_estimate.sample(rate); float new_rate = _demand_rate_estimate.padded_average(); ssize_t old_desired = _desired; - _desired = (ssize_t)(new_rate * (inter_sweep_estimate - + CMSExtrapolateSweep - ? intra_sweep_estimate - : 0.0)); + float delta_ise = (CMSExtrapolateSweep ? intra_sweep_estimate : 0.0); + _desired = (ssize_t)(new_rate * (inter_sweep_estimate + delta_ise)); if (PrintFLSStatistics > 1) { gclog_or_tty->print_cr("demand: %d, old_rate: %f, current_rate: %f, new_rate: %f, old_desired: %d, new_desired: %d", demand, old_rate, rate, new_rate, old_desired, _desired);
01-02-2011
EVALUATION A missing pair of parentheses in the calculation of _desired in the free list statistics caused the coalition statistics to go awry, apparently resulting in fragmentation. Fixing this reigned in the fragmentation.
01-02-2011

Relates :	JDK-6631166 - CMS: better heuristics when combatting fragmentation
Relates :	JDK-6996061 - CMS: Crash at CompactibleFreeListSpace::forward