Bug ID: JDK-6572569 CMS: consistently skewed work distribution indicated in (long) re-mark pauses

Other	Other	JDK 6	JDK 7
5.0u16-revFixed	5.0u17Fixed	6u4Fixed	7 b01Fixed

EVALUATION The balance of the work remaining to be done here has been transferred to shadow CR 6990419 CMS: Remaining work for 6572569: consistently skewed work distribution in (long) re-mark pauses. I am closing this as fixed in 7. I would have to do a bit of archeology to determine the exact build of JDK 7 in which the fix for CMSScavengeBeforeRemark integrated, but until then here's the JPRT archive link (and i am using 7b01 as the build we fixed in, which is almost certainly a lie) :- http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/workspace/webrevs/webrev-2007.07.12/index.html
07-10-2010
WORK AROUND -XX:+CMSScavengeBeforeRemark is a partial workaround.
30-11-2009
SUGGESTED FIX One simple approach towards fixing this problem is to not operate the phase timeout until at least one scavenge suring the phase, i.e. something along the lines of: if (time_spent_in_phase > MAX(max_default,2*recent_inter_scavenge_duration) && at_least_one_scavenge_during_phase) then abort_phase. We should see if one of the customers (or a suitable in-house configuration) can test/verify the efficacy of such a heuristic across a range of conditions.
17-10-2007
EVALUATION One simple approach towards fixing this problem is to not operate the phase timeout until at least one scavenge suring the phase, i.e. something along the lines of: if (time_spent_in_phase > MAX(max_default,2*recent_inter_scavenge_duration) && at_least_one_scavenge_during_phase) then abort_phase. We should see if one of the customers (or a suitable in-house configuration) can test/verify the efficacy of such a heuristic across a range of conditions.
17-10-2007
SUGGESTED FIX From View message header detail "Y. S. Ramakrishna" <###@###.###> Sent Thursday, July 12, 2007 11:36 am To ###@###.### Subject Code Manager notification (putback-to) Event: putback-to Parent workspace: /net/jano2.sfbay/export2/hotspot/ws/main/gc_baseline (jano2.sfbay:/export2/hotspot/ws/main/gc_baseline) Child workspace: /net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace (prt-web:/net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace) User: ysr Comment: --------------------------------------------------------- Job ID: 20070712093851.ysr.mustang Original workspace: neeraja:/net/jano2.sfbay/export2/hotspot/users/ysr/mustang Submitter: ysr Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/ Webrev: http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/workspace/webrevs/webrev-2007.07.12/index.html Fixed 6558100: CMS crash when -XX:+ParallelRefProcEnabled is set Partial 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses http://analemma.sfbay/net/jano/export/disk05/hotspot/users/ysr/mustang/webrev (6558100) When CMS marking (either during parallel rescan or parallel reference processing) runs out of space on the per-worker work queues, the overflown grey objects are tracked by chaining through their mark word. In this case, we had two bugs: firstly, the method that took a prefix of the overflow list was not re-attaching the intended suffix correctly (this affects all JVM's going back to 1.4.2_14); secondly, the parallel reference processing code was entirely neglecting to process the overflow list (this affects JVM's going back to 5.0). The crucial debugging breakthrough came when Poonam used the SA to track down the objects that CMS remark was declaring as unreachable but unmarked, and found that they occurred in long chains linked via their mark word (but with the promoted bit not set, which helped distinguish them from the promoted chains that ParNew uses, and identified them as broken fragments of an erstwhile overflow list). Many thanks to Poonam Bajaj and Thomas Viessmann for crucial debugging help. The customer has since run with a version of 6u2 with the fix (thanks Poonam) and verified that the previous crash does not reproduce in > 2 days (previously the crash would happen in about 4 hours). Some debugging code was added as well as some asserts relaxed to allow for the possibility of examining an object lying at the end of the overflow list. This latter issue will be more thoroughly revisited and cleaned up under a separate bug id. (6572569) When CMSScavengeBeforeRemark is set, we were assuming that a scavenge would have necessarily preceded a remark and that therefore the heap would already be in a parsable state. However, it is possible that the scavenge may not have been done because, for instance, a JNI critical section was held. The main CR here will need other work to deal with the issue found at the customer, but this is a fix for the problem with CMSScavengeBeforeRemark which is a temporary workaround to this customer's performance issue as described in the bug report. Thanks to Chris Phillips for testing and backport help with 5uXX where the problem manifested most readily. Reviewed by: Jon Masamitsu & Andrey Petrusenko Fix Verified: y Verification Testing: 6558100: GCBasher on CMS with CMSMarkStackOverflowALot enabled 6572569: GCBasher on CMS with CMSScavengeBeforeRemark & no survivor spaces Other testing: PRT (also with CMS stress options) refworkload, runThese -quick and -testbase Note added in proof: Some late breaking big apps testing using the stress flags yesterday revealed an as-yet-undiagnosed issue when running Tomcat and ATG. Thanks to Ashwin for finding this issue, which is being tracked under CR 6578335. Files: update: src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp Examined files: 3991 Contents Summary: 3 update 3988 no action (unchanged)
12-07-2007
SUGGESTED FIX Of the above, the diffs relevant to this bug are merely the following:- * src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp- Sun Jun 10 16:38:11 2007 --- src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp Thu Jul 12 09:38:58 2007 * 4688,4702 **** } assert(haveFreelistLocks(), "must have free list locks"); assert_lock_strong(bitMapLock()); if (!init_mark_was_synchronous) { ! if (CMSScavengeBeforeRemark) { ! // Heap already made parsable as a result of scavenge ! } else { gch->ensure_parsability(false); // fill TLAB's, but no need to retire them - } // Update the saved marks which may affect the root scans. gch->save_marks(); { COMPILER2_PRESENT(DerivedPointerTableDeactivate dpt_deact;) --- 4724,4746 ---- } assert(haveFreelistLocks(), "must have free list locks"); assert_lock_strong(bitMapLock()); if (!init_mark_was_synchronous) { ! // We might assume that we need not fill TLAB's when ! // CMSScavengeBeforeRemark is set, because we may have just done ! // a scavenge which would have filled all TLAB's -- and besides ! // Eden would be empty. This however may not always be the case -- ! // for instance although we asked for a scavenge, it may not have ! // happened because of a JNI critical section. We probably need ! // a policy for deciding whether we can in that case wait until ! // the critical section releases and then do the remark following ! // the scavenge, and skip it here. In the absence of that policy, ! // or of an indication of whether the scavenge did indeed occur, ! // we cannot rely on TLAB's having been filled and must do ! // so here just in case a scavenge did not happen. gch->ensure_parsability(false); // fill TLAB's, but no need to retire them // Update the saved marks which may affect the root scans. gch->save_marks(); { COMPILER2_PRESENT(DerivedPointerTableDeactivate dpt_deact;)
12-07-2007
EVALUATION See suggested fix section for the fix putback. The CR is being kept open for remaining performance work including heuristically determining the situation in which CMSScavengeBeforeRemark is likely to help and/or of dynamcially toggling it as necessary. That work will however happen at lower urgency, so the priority of this bug will be lowered, based on some preliminary performance numbers made available from customer that indicates the efficacy of this flag as a workaround to the long remark problem.
12-07-2007
SUGGESTED FIX (6572569) When CMSScavengeBeforeRemark is set, we were assuming that a scavenge would have necessarily preceded a remark and that therefore the heap would already be in a parsable state. However, it is possible that the scavenge may not have been done because, for instance, a JNI critical section was held. The main CR here will need other work to deal with the issue found at the customer, but this is a fix for the problem with CMSScavengeBeforeRemark? which is a temporary workaround to this customer's performance issue as described in the bug report. See: http://analemma.sfbay/net/jano/export/disk05/hotspot/users/ysr/mustang/webrev (which also includes other fixes which you should elide in your reading for this CR).
05-07-2007
EVALUATION The heap shape and workload are such that a CMS cycle starts and finishes between two scavenges. Under these circumstances it is possible for the Eden space parallelization to not work very well. This can be partially worked around by means of -XX:+CMSScavengeBeforeRemark. Other heuristics to deal with this are also possible and will be investigated while we await customer feedback on the efficacy of +CMScavengeBeforeRemark in their case. SubCR's have been filed on releases earlier to 6.0 (when CMSScavengeBeforeRemark became a product flag) to make CMSScavengeBeforeRemark a product flag. See the subCR's for relevant diffs (also with ###@###.###). For the case of 7.0 a bug in CMSScavengeBeforeRemark needed to be fixed. See the Suggested Fix section of the details. That latter fix needs to be mae in 7.0 and 6u3, so an appropriate subCR for 6u3 has also been created.
05-07-2007

Relates :	JDK-6580132 - CMS: long reference processing times, even with +ParallelRefProcEnabled
Relates :	JDK-6538910 - CMS: excessively long abortable preclean cycles
Relates :	JDK-2148180 - CMS: excessively long abortable preclean cycles
Relates :	JDK-6558100 - CMS crash following parallel work queue overflow
Relates :	JDK-6990419 - CMS: Remaining work for 6572569: consistently skewed work distribution in (long) re-mark pauses