Bug ID: JDK-6456806 Excessively long CMS preclean cycle eventually leads to concurrent mode failure

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 1.4.2_11
Priority: P2
Status: Resolved
Resolution: Fixed
OS: solaris_8
CPU: sparc
Submitted: 2006-08-03
Updated: 2014-02-27
Resolved: 2007-01-22
Other
1.4.2_14 b03Fixed
Customer has a voice over IP application running 1.4.2_11 that after a period of time
(usually 14-24 hours)
They have a pre-clean that takes quite some time.  We've seen from a few hundred seconds up to 6000+ 

GC options: -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled
-XX:MaxNewSize=128m -XX:NewSize=128m -Xms1G -Xmx1G -XX:SurvivorRatio=6
-XX:MaxTenuringThreshold=4 -XX:CMSInitiatingOccupancyFraction=60
-Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE
-Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE
-XX:+DisableExplicitGC -XX:+UseTLAB -verbose:gc
-XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -Xloggc:gc

an example of a 6000+ preclean:
90401.397: [GC  {Heap before GC invocations=6436:
Heap
par new generation   total 114688K, used 107474K [0xb5400000, 0xbd400000, 0xbd400000)
 eden space 98304K, 100% used [0xb5400000, 0xbb400000, 0xbb400000)
 from space 16384K,  55% used [0xbb400000, 0xbbcf4b70, 0xbc400000)
 to   space 16384K,   0% used [0xbc400000, 0xbc400000, 0xbd400000)
concurrent mark-sweep generation total 917504K, used 761505K [0xbd400000, 0xf5400000, 0xf5400000)
concurrent-mark-sweep perm gen total 51336K, used 30768K [0xf5400000, 0xf8622000, 0xf9400000)
90401.397: [ParNew (promotion failed): 107474K->107474K(114688K),
3.4757402 secs]90404.873: [CMS95947.013: [CMS-concurrent-preclean: 6187.143/6199.426 secs]
(concurrent mode failure): 763375K->507358K(917504K), 5560.5453822
secs] 868980K->507358K(1032192K) Heap after GC invocations=6437:
Heap
par new generation   total 114688K, used 0K [0xb5400000, 0xbd400000, 0xbd400000)
 eden space 98304K,   0% used [0xb5400000, 0xb5400000, 0xbb400000)
 from space 16384K,   0% used [0xbb400000, 0xbb400000, 0xbc400000)
 to   space 16384K,   0% used [0xbc400000, 0xbc400000, 0xbd400000)
concurrent mark-sweep generation total 917504K, used 507358K
[0xbd400000, 0xf5400000, 0xf5400000)
concurrent-mark-sweep perm gen total 51336K, used 30758K [0xf5400000, 0xf8622000, 0xf9400000) } , 5564.0254377 secs] 


The pattern in one of the gc logs running with -XX:PrintCMSStatistics=2 
shows the re-scan timings spiking.
At the time of the long prelcean there are 833857 cards rescaned.

113684.645: [ParNew (promotion failed): 107519K->107519K(114688K), 1.0264022 secs]113685.671: [CMS (cardTable: 0 cards, re-scanned 833857 cards, 3 iterations) 

Looks like a lot of cards re-scanned and just after it the number drops substantially.The one before it is ~4000.

I can see the ramp up where pre-clean was taking 2-8 seconds then starts to go up to the high teens and then some 40-60 seconds and then this one.

113684.644: [GC  {Heap before GC invocations=5683:
Heap
par new generation   total 114688K, used 107519K [0xb5400000, 0xbd400000, 0xbd400000)
 eden space 98304K, 100% used [0xb5400000, 0xbb400000, 0xbb400000)
 from space 16384K,  56% used [0xbc400000, 0xbccffed0, 0xbd400000)
 to   space 16384K,   0% used [0xbb400000, 0xbb400000, 0xbc400000)
concurrent mark-sweep generation total 917504K, used 809255K [0xbd400000, 0xf5400000, 0xf5400000)
concurrent-mark-sweep perm gen total 52888K, used 31705K [0xf5400000, 0xf87a6000, 0xf9400000)
113684.645: [ParNew (promotion failed): 107519K->107519K(114688K), 1.0264022 secs]113685.671: [CMS (cardTable: 0 cards, re-scanned 833857 cards, 3 iterations)
117418.012: [CMS-concurrent-preclean: 4482.648/4494.823 secs]
(CMS-concurrent-preclean yielded 0 times)
(concurrent mode failure): 815888K->451828K(917504K), 3748.8971119 secs] 916774K->451828K(1032192K) Heap after GC invocations=5684:
Heap
par new generation   total 114688K, used 0K [0xb5400000, 0xbd400000, 0xbd400000)
 eden space 98304K,   0% used [0xb5400000, 0xb5400000, 0xbb400000)
 from space 16384K,   0% used [0xbc400000, 0xbc400000, 0xbd400000)
 to   space 16384K,   0% used [0xbb400000, 0xbb400000, 0xbc400000)
concurrent mark-sweep generation total 917504K, used 451828K [0xbd400000, 0xf5400000, 0xf5400000)
concurrent-mark-sweep perm gen total 52888K, used 31693K [0xf5400000, 0xf87a6000, 0xf9400000)} , 3749.9312617 secs]


There is a jump from ~15 seconds to 57 to 0 then back to 15 then 4482 seconds between the CMS events.
The cards re-scanned are 156931 for the 57 second one
Drop to 1209
then the ramp up to the eventual 833857

112761.630: [CMS-concurrent-preclean: 14.293/14.561 secs]
(CMS-concurrent-preclean yielded 0 times)
112776.001: [CMS-concurrent-preclean-start]
112834.097: [CMS-concurrent-preclean: 57.002/58.097 secs]
(CMS-concurrent-preclean yielded 0 times)
112848.226: [CMS-concurrent-preclean-start]
112849.126: [CMS-concurrent-preclean: 0.650/0.900 secs]
(CMS-concurrent-preclean yielded 0 times)
112863.932: [CMS-concurrent-preclean-start]
112878.684: [CMS-concurrent-preclean: 14.478/14.751 secs]
(CMS-concurrent-preclean yielded 0 times)
112892.655: [CMS-concurrent-preclean-start]
112908.947: [CMS-concurrent-preclean: 15.788/16.292 secs]
(CMS-concurrent-preclean yielded 0 times)
112923.189: [CMS-concurrent-preclean-start]
117418.012: [CMS-concurrent-preclean: 4482.648/4494.823 secs]
(CMS-concurrent-preclean yielded 0 times)
117442.657: [CMS-concurrent-preclean-start]
117497.231: [CMS-concurrent-preclean: 54.569/54.574 secs]
(CMS-concurrent-preclean yielded 0 times)
117509.801: [CMS-concurrent-preclean-start]
117510.480: [CMS-concurrent-preclean: 0.679/0.680 secs]
(CMS-concurrent-preclean yielded 0 times)
117521.990: [CMS-concurrent-preclean-start]
117522.484: [CMS-concurrent-preclean: 0.493/0.494 secs]
(CMS-concurrent-preclean yielded 0 times)
117533.450: [CMS-concurrent-preclean-start]
117533.943: [CMS-concurrent-preclean: 0.493/0.494 secs]

The rate of card rescans, there's a big drop
here. Before that, it's typically (less than) 1 second per
1000 dirty cards scanned. Then suddenly for this particular
event it jumps up to about 5 times that value, some 5 seconds
per 1000 cards scanned.

Customer has sent in SAR data and there doesn't appear to be 
a burden on the system and paging/swap doesn't seem to
play a role in the impact of this behavior.
EVALUATION Backporting specific treatment for arrays in ScanMarkedObjectsAgainCarefullyClosure::do_object_careful from 5.0 to 1.4.2.
23-11-2006