Bug ID: JDK-2150439 CMS crash following parallel work queue overflow

Type: Backport
Backport of: JDK-6558100
Component: hotspot
Sub-Component: gc

Priority: P2
Status: Resolved
Resolution: Fixed

Submitted: 2007-06-22
Updated: 2010-12-08
Resolved: 2007-07-24

JDK 6	Other
6u4 b02Fixed	hs11Fixed

EVALUATION Fixed in 6u3 b02. See parent CR for related CR's.

17-07-2007

SUGGESTED FIX Event: putback-to Parent workspace: /net/jano2.sfbay/export2/hotspot/ws/1.6/update3/baseline (jano2.sfbay:/export2/hotspot/ws/1.6/update3/baseline) Child workspace: /net/prt-web.sfbay/prt-workspaces/20070717115529.ysr.hx3/workspace (prt-web:/net/prt-web.sfbay/prt-workspaces/20070717115529.ysr.hx3/workspace) User: ysr Comment: --------------------------------------------------------- Job ID: 20070717115529.ysr.hx3 Original workspace: karachi:/net/spot/workspaces/ysr/hx3 Submitter: ysr Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/2007/20070717115529.ysr.hx3/ Webrev: http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070717115529.ysr.hx3/workspace/webrevs/webrev-2007.07.17/index.html Fixed 6558100: CMS crash when -XX:+ParallelRefProcEnabled is set Partial 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses http://analemma.sfbay/net/spot/workspaces/ysr/hx3/webrev Approved for 6u3b02 by HotSpot P-Team (Penni Henry) (6558100) When CMS marking (either during parallel rescan or parallel reference processing) runs out of space on the per-worker work queues, the overflown grey objects are tracked by chaining through their mark word. In this case, we had two bugs: firstly, the method that took a prefix of the overflow list was not re-attaching the intended suffix correctly (this affects all JVM's going back to 1.4.2_14); secondly, the parallel reference processing code was entirely neglecting to process the overflow list (this affects JVM's going back to 5.0). The crucial debugging breakthrough came when Poonam used the SA to track down the objects that CMS remark was declaring as unreachable but unmarked, and found that they occurred in long chains linked via their mark word (but with the promoted bit not set, which helped distinguish them from the promoted chains that ParNew uses, and identified them as broken fragments of an erstwhile overflow list). Many thanks to Poonam Bajaj and Thomas Viessmann for crucial debugging help. The customer has since run with a version of 6u2 with the fix (thanks Poonam) and verified that the previous crash does not reproduce in > 2 days (previously the crash would happen in about 4 hours). Some debugging code was added as well as some asserts relaxed to allow for the possibility of examining an object lying at the end of the overflow list. This latter issue will be more thoroughly revisited and cleaned up under a separate bug id. (6572569) When CMSScavengeBeforeRemark is set, we were assuming that a scavenge would have necessarily preceded a remark and that therefore the heap would already be in a parsable state. However, it is possible that the scavenge may not have been done because, for instance, a JNI critical section was held. The main CR here will need other work to deal with the issue found at the customer, but this is a fix for the problem with CMSScavengeBeforeRemark which is a temporary workaround to this customer's performance issue as described in the bug report. Thanks to Chris Phillips for testing and backport help with 5uXX where the problem manifested most readily. Reviewed by: Jon Masamitsu & Andrey Petrusenko Fix Verified: y Verification Testing: 6558100: GCBasher on CMS with CMSMarkStackOverflowALot enabled 6572569: GCBasher on CMS with CMSScavengeBeforeRemark & no survivor spaces Other testing: PRT (also with CMS stress options) refworkload, runThese -quick and -testbase Note added in proof: Some late breaking big apps testing using the stress flags yesterday revealed an as-yet-undiagnosed issue when running Tomcat and ATG. Thanks to Ashwin for finding this issue, which is being tracked under CR 6578335. Files: update: src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp Examined files: 4209 Contents Summary: 3 update 4206 no action (unchanged)

17-07-2007

EVALUATION see parent CR.

05-07-2007