Bug ID: JDK-6722116 CMS: Incorrect overflow handling when using parallel concurrent marking

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 6	JDK 7	Other
6u12Fixed	7Fixed	hs10Fixed

Here's a description from the Evaluation field of 6578335 during the investigation
of which this problem was first discovered:-

There was a third bug found which relates to the handling of
"second ring overflow" when using parallel concurrent marking
-- the overflow of the global overflow stack (which itself handles
the overflow from the local work queues). The intention was
that this second ring overflow should use the "restart mechanism"
to restart marking from the least overflown address.
That mechanism was not completely extended to the parallel
concurrent marking case. The restart_addr was not pushed
all the way through to the parallel concurrent marking task that controls
the parallel concurrent marking. Because of the partial
change to the state of the parallel concurrent marking task,
we can and often will end up missing the scan of some of the
addresses at the higher extremes of the CMS-collected
generations. Because second-ring overflow is a very rare
event in practice, this appears to have not been detected
before (or at least not until the first two bugs mentioned
above were moved out of our way).

The obvious workaround is to switch off parallel concurrent
marking via -XX:-CMSConcurrentMTEnabled.

SUGGESTED FIX changeset: 301:ebeb6490b814 parent: 299:387a62b4be60 user: ysr date: Tue Aug 26 14:54:48 2008 -0700 summary: 6722116: CMS: Incorrect overflow handling when using parallel concurrent marking
28-08-2008
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/ebeb6490b814
27-08-2008
SUGGESTED FIX The following is under testing and review: http://analemma.sfbay.sun.com/net/neeraja/export/ysr/cms_cmt/webrev/
22-08-2008
WORK AROUND -XX:-CMSConcurrentMTEnabled. Otherwise, increasing the size of the marking stack via -XX:CMSMarkingStackSize{,Max} would reduce the probability of hitting this bug.
03-07-2008
EVALUATION This is a bug since 6.0 when parallel concurrent marking was first introduced. Because this involves not the first, but the second level of overflow, it's much less frequent (other than under really high stress conditions), and so customers are not likely to run into this very frequently (i think).
03-07-2008
SUGGESTED FIX The restart_addr should be pushed down into the state of the ConcurrentMarkingTask and, further, the do_scan_and_mark() methods should use that restart address as the starting point when re-starting their marking work following an overflow-and-restart (rather than the bottom of the space as currently done).
03-07-2008

Duplicate :	JDK-6859466 - Java 6 u13 (64-bit) crashes on RHEL 5.2 (64-bit) in CMS; Need analysis of core file
Relates :	JDK-6611406 - C2 Crash in JVM_ArrayCopy
Relates :	JDK-6681372 - 64-bit VM CompilerThread received SEGV in ciObjectFactory::find_non_perm()
Relates :	JDK-6578335 - CMS: BigApps failure with -XX:CMSInitiatingOccupancyFraction=1 -XX:+CMSMarkStackOverflowALot ...
Relates :	JDK-6752663 - (audit) apply HSX-11 fix for 6722116 to HSX-12 and HSX-13
Relates :	JDK-6697967 - Java core file from Global Server prdgc01a --- 64-bit java process