JDK-6442774 : CMS cycle blocks a direct old gen allocation and is slow
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 6
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2006-06-23
  • Updated: 2010-04-04
  • Resolved: 2006-07-20
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6
6 b92Fixed
Related Reports
Relates :  
Relates :  
Description
The scenario is that a CMS marking cycle starts, not only it blocks the application, but the cycle is somehow also very slow. I can easily reproduce it with jbb2005 on an 8-way SPARC with the following cmd line parameters:

set heap_size = 64m
-XX:ParallelGCThreads=8
-XX:+DisableExplicitGC
-Xms${heap_size}
-Xmx${heap_size}
-XX:TLABSize=32k
-XX:+UseConcMarkSweepGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:PrintCMSStatistics=2
-XX:+CMSScavengeBeforeRemark
-XX:+PrintClassHistogram

Here's the output:

Loading Warehouse 1...
1.550: [GC 1.550: [ParNew: 13184K->1600K(14784K), 0.0214458 secs] 13184K->3095K(63936K), 0.0217076 secs]
1.805: [GC 1.805: [ParNew: 14784K->1600K(14784K), 0.0301589 secs] 16279K->6038K(63936K), 0.0303027 secs]
2.082: [GC 2.082: [ParNew: 14784K->1600K(14784K), 0.0476443 secs] 19222K->11914K(63936K), 0.0477794 secs]
2.359: [GC 2.359: [ParNew: 14784K->1600K(14784K), 0.0633704 secs] 25098K->19846K(63936K), 0.0635313 secs]
2.642: [GC 2.643: [ParNew: 14784K->1600K(14784K), 0.0599469 secs] 33030K->27771K(63936K), 0.0600886 secs]
2.703: [GC [1 CMS-initial-mark: 26171K(49152K)] 28015K(63936K), 0.0132906 secs]
2.717: [CMS-concurrent-mark-start]

=> the app gets stuck here for a long time, but eventually it gets unstuck

Finished cms space scanning in 1th thread: 28.661 sec
Finished cms space scanning in 0th thread: 28.662 sec
Finished perm space scanning in 1th thread: 0.052 sec
Finished perm space scanning in 0th thread: 0.054 sec
Finished work stealing in 1th thread: 0.003 sec
Finished work stealing in 0th thread: 0.000 sec
31.433: [CMS-concurrent-mark: 28.338/28.716 secs]
 (CMS-concurrent-mark yielded 591546 times)
31.433: [CMS-concurrent-preclean-start]

Notice the very long marking cycle (28.6 sec!!! for a 64MB heap) and the high number of yields (almost 600,000).

For comparison, the normal behavior is this (with -XX:ParallelCMSThreads=1)

Loading Warehouse 1...
1.554: [GC 1.555: [ParNew: 13184K->1600K(14784K), 0.0255298 secs] 13184K->3094K(63936K), 0.0257375 secs]
1.815: [GC 1.815: [ParNew: 14784K->1600K(14784K), 0.0349030 secs] 16278K->6038K(63936K), 0.0350438 secs]
2.096: [GC 2.096: [ParNew: 14784K->1600K(14784K), 0.0644119 secs] 19222K->11914K(63936K), 0.0645546 secs]
2.383: [GC 2.383: [ParNew: 14784K->1600K(14784K), 0.0737211 secs] 25098K->19846K(63936K), 0.0738495 secs]
2.673: [GC 2.673: [ParNew: 14784K->1600K(14784K), 0.0755298 secs] 33030K->27771K(63936K), 0.0756629 secs]
2.749: [GC [1 CMS-initial-mark: 26171K(49152K)] 28015K(63936K), 0.0136178 secs]
2.763: [CMS-concurrent-mark-start]


3.107: [CMS-concurrent-mark: 0.329/0.344 secs]
 (CMS-concurrent-mark yielded 6538 times)
3.107: [CMS-concurrent-preclean-start]
 (cardTable: 971 cards, re-scanned 971 cards, 1 iterations)
3.116: [CMS-concurrent-preclean: 0.007/0.009 secs]
 (CMS-concurrent-preclean yielded 225 times)
3.116: [CMS-concurrent-abortable-preclean-start]
 (cardTable: 191 cards, re-scanned 191 cards, 1 iterations)
 (cardTable: 2 cards, re-scanned 2 cards, 1 iterations)
I managed to hit this bug with SPECjvm98 too. These are the command line parameters I used.

java -XX:+UseConcMarkSweepGC -XX:NewSize=8m -XX:+DisableExplicitGC -XX:+PrintGCDetails -Xms32m -Xmx32m -XX:CMSInitiatingOccupancyFraction=10 SpecApplication -g _201_compress _202_jess _209_db _213_javac _222_mpegaudio _227_mtrt _228_jack

Notice that I start a CMS cycle very early, to increase the probability of a CMS cycle taking place during class loading and, hence, during direct allocation.

This is what I get with the original JVM:

[CMS-concurrent-mark: 2.613/2.647 secs]
  (CMS-concurrent-mark yielded 53247 times)

...and this is the output with the fix:

[CMS-concurrent-mark: 0.011/0.096 secs]
  (CMS-concurrent-mark yielded 9 times)

Again, this is very reproducible.

Comments
EVALUATION A few more observations. For the bug to happen, direct allocation needs to take place during a CMS cycle. It seems that this is not a common event. When I managed to reproduce it, direct allocation happened to the perm gen during class loading and resolution. I had to start a CMS cycle early (either by setting the initiating occupancy to something very low or using a very small heap) so that started before most classes were loaded. Additionally, a single direct allocation during a CMS cycle might not uncover the bug as it will slow down the cycle but not by much. It's when several direct allocations take place in succession (e.g., when a class is loaded) that the bug is uncovered and its severity multiplied.
29-06-2006

SUGGESTED FIX A quick fix is to sleep in the yield code (between releasing the bitmap lock and reacquiring it) while ConcurrentMarkSweepThread::should_yield() to give the mutator thread a chance to wake up, get the lock, do its work, and unset the "should yield" flag. Since, it's close to the Mustang code freeze date, I also suggest that we only do N number of sleeps before reacquiring the lock, to avoid getting into an infinite loop. We should really use the wait/notify pattern for the interaction between the threads in question. But, this is a riskier fix. So, we can do this for Dolphin and a Mustang update release. I'll open a second bug for this.
29-06-2006

EVALUATION The problem seems to be the following. An apploication thread A wants to do a direct allocation (into the perm gen). It signals a yield request and then tries to take the bitmap lock. The marking threads M1 and M2 notice the yield request and yield to the coordinator thread, who unlocks and then relocks the bitmap lock. Unfortunately, A fails to take the bitmap lock in the short period of time that it was released by the coordinator so it remains blocks. Additionally, it still has a pending yield request raised, which forces M1 and M2 to actually do a yield every time they try.
23-06-2006