JDK-6558100 : CMS crash following parallel work queue overflow
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version:
    1.4.2,1.4.2_04,1.4.2_05,1.4.2_06,1.4.2_14,1.4.2_15,5.0,5.0u9,5.0u2,5.0u10,5.0u12,5.0u11,5.0u7,5.0u8,6,6u1 1.4.2,1.4.2_04,1.4.2_05,1.4.2_06,1.4.2_14,1.4.2_15,5.0,5.0u9,5.0u2,5.0u10,5.0u12,5.0u11,5.0u7,5.0u8,6,6u1
  • Priority: P1
  • Status: Closed
  • Resolution: Fixed
  • OS:
    generic,linux_2.6,solaris,solaris_8,solaris_9,solaris_10 generic,linux_2.6,solaris,solaris_8,solaris_9,solaris_10
  • CPU: generic,x86,sparc
  • Submitted: 2007-05-16
  • Updated: 2011-03-09
  • Resolved: 2011-03-07
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other JDK 6 JDK 7 Other
1.4.2_17,hs11Fixed 6u4Fixed 7Fixed hs11Fixed
Related Reports
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
When -XX:+ParallelRefProcEnabled is set, then the JVM crashes always after
3-4 hours with a customer application. -XX:-ParallelRefProcEnabled will
always prevent this crash.
Synopsis changed from:

CMS crash when -XX:+ParallelRefProcEnabled is set

to:

CMS crash following parallel work queue overflow

to better reflect the fact that this bug is not
just limited to +ParallelRefProcEnabled, but rather
is of a much more sever nature.

Comments
EVALUATION Background:- 6558100 happens when there is task queue overflow in CMS || remark. In this case the overflowed objects are chained via their mark words. The bug is that later we forget to process these overflowed objects. In effect, the reachability closure is not computed past these objects. So any objects that are reached only through these overflown objects will not be marked and will be collected. These collected objects will end up on the CMS free list. When the next collection starts, the marking phase might reach these now-freed objects, and the marker barfs trying to scan these free blocks as though they were objects. Note:- The symptom of the collector thread barfing while marking could potentially come from a host of possible root causes in the VM, including missing card-marks, a bug in the CMS precleaning or, in this case, a bug in the CMS remark (or parallel reference processing). Identifying an instance of 6263371 as a duplicate of 6558100:- The key to the identification is that although the overflown objects are not scanned, so the objects that they point to were collected prematurely, the overflow objects themselves are not collected because they had been marked before placing on the overflow list. But the reason the marker barfs is that it is trying to scan one of these prematurely collected objects that is referenced by some field of an overflown object. If you connect to the core file using the SA, and give it the address of the oop which we were trying to scan when we segv'd and ask it to find all locations in the heap that contain that address. There are likely to be only a few (for reasons that will become clear below). Look at each of these locations in turn. Each will be a normal object, except that it will have a strange looking mark word. The mark word will have the address of another object, which will likewise be a normal looking object with a strange looking mark word and so on. You have just found part of the overflow list from the previous remark, and have identified an instance of 6558100. Epilogue:- The above is just one symptom of 6558100. There are likely to be others. In partricular, since the mark word has been, in effect, clobbered and might have contained locking information or identity hash code, any computation on an object that involves synchronization or hashcode use could return the wrong answer. Possibilities may include IllegalMonitorStateException, may be biased locking malfunction (although i have not worked this through in detail) or other such weird behaviour.
14-08-2007

WORK AROUND This CR includes two bugs, one in parallel reference processing (in 7, 6 and 5) for which the workaround is -XX:-ParallelRefProcEnabled; and another in parallel remark (in 7, 6, 5 and 1.4.2_14+) for which the workaround is -XX:-CMSParallelRemarkEnabled. Note that -ParallelRefProcEnabled is in fact the default, while +CMSParallelRemarkEnabled is the default. Turning off parallelism in either case can adversely affect CMS parallel remark pauses.
08-08-2007

EVALUATION This bug is fixed (see suggested fix section). In the course of stress testing related to this bug, a new as-yet-undiagnosed bug came to light. That's being tracked under 6578335. This bug fix should be backported to 6, 5 and 1.4.2; see subCR's.
12-07-2007

SUGGESTED FIX From View message header detail "Y. S. Ramakrishna" <###@###.###> Sent Thursday, July 12, 2007 11:36 am To ###@###.### Subject Code Manager notification (putback-to) Event: putback-to Parent workspace: /net/jano2.sfbay/export2/hotspot/ws/main/gc_baseline (jano2.sfbay:/export2/hotspot/ws/main/gc_baseline) Child workspace: /net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace (prt-web:/net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace) User: ysr Comment: --------------------------------------------------------- Job ID: 20070712093851.ysr.mustang Original workspace: neeraja:/net/jano2.sfbay/export2/hotspot/users/ysr/mustang Submitter: ysr Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/ Webrev: http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/workspace/webrevs/webrev-2007.07.12/index.html Fixed 6558100: CMS crash when -XX:+ParallelRefProcEnabled is set Partial 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses http://analemma.sfbay/net/jano/export/disk05/hotspot/users/ysr/mustang/webrev (6558100) When CMS marking (either during parallel rescan or parallel reference processing) runs out of space on the per-worker work queues, the overflown grey objects are tracked by chaining through their mark word. In this case, we had two bugs: firstly, the method that took a prefix of the overflow list was not re-attaching the intended suffix correctly (this affects all JVM's going back to 1.4.2_14); secondly, the parallel reference processing code was entirely neglecting to process the overflow list (this affects JVM's going back to 5.0). The crucial debugging breakthrough came when Poonam used the SA to track down the objects that CMS remark was declaring as unreachable but unmarked, and found that they occurred in long chains linked via their mark word (but with the promoted bit not set, which helped distinguish them from the promoted chains that ParNew uses, and identified them as broken fragments of an erstwhile overflow list). Many thanks to Poonam Bajaj and Thomas Viessmann for crucial debugging help. The customer has since run with a version of 6u2 with the fix (thanks Poonam) and verified that the previous crash does not reproduce in > 2 days (previously the crash would happen in about 4 hours). Some debugging code was added as well as some asserts relaxed to allow for the possibility of examining an object lying at the end of the overflow list. This latter issue will be more thoroughly revisited and cleaned up under a separate bug id. (6572569) When CMSScavengeBeforeRemark is set, we were assuming that a scavenge would have necessarily preceded a remark and that therefore the heap would already be in a parsable state. However, it is possible that the scavenge may not have been done because, for instance, a JNI critical section was held. The main CR here will need other work to deal with the issue found at the customer, but this is a fix for the problem with CMSScavengeBeforeRemark which is a temporary workaround to this customer's performance issue as described in the bug report. Thanks to Chris Phillips for testing and backport help with 5uXX where the problem manifested most readily. Reviewed by: Jon Masamitsu & Andrey Petrusenko Fix Verified: y Verification Testing: 6558100: GCBasher on CMS with CMSMarkStackOverflowALot enabled 6572569: GCBasher on CMS with CMSScavengeBeforeRemark & no survivor spaces Other testing: PRT (also with CMS stress options) refworkload, runThese -quick and -testbase Note added in proof: Some late breaking big apps testing using the stress flags yesterday revealed an as-yet-undiagnosed issue when running Tomcat and ATG. Thanks to Ashwin for finding this issue, which is being tracked under CR 6578335. Files: update: src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp Examined files: 3991 Contents Summary: 3 update 3988 no action (unchanged)
12-07-2007

EVALUATION A "day one" bug in the handling of the overflow list used in parallel rescan and in parallel reference processing has been found and fixed. This fix applies to CMS with parallel remark, even in the absence of parallel reference processing, and needs to be backported to 6.0, 5.0 and 1.4.2_XX (XX >= 14) as well.
03-07-2007

WORK AROUND Do not use -XX:+ParallelRefProcEnabled, or explicitly switch it off -XX:-ParallelRefProcEnabled.
22-06-2007

EVALUATION Incomplete marking during parallel work queue overflow (overflow list was being ignored) during parallel reference processing (marking) phase. Simple fix, will need to be verified by customer.
22-06-2007

SUGGESTED FIX As in evaluation/comments section. Watch this space for diffs and regression test case. In light of this, one will probably also need to run some of the older performance tests anew to assess efficacy of the parallel reference processing code.
22-06-2007