United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6558100 : CMS crash following parallel work queue overflow

Details
Type:
Bug
Submit Date:
2007-05-16
Status:
Closed
Updated Date:
2011-03-09
Project Name:
JDK
Resolved Date:
2011-03-07
Component:
hotspot
OS:
solaris_9,solaris,solaris_8,generic,solaris_10,linux_2.6
Sub-Component:
gc
CPU:
x86,sparc,generic
Priority:
P1
Resolution:
Fixed
Affected Versions:
1.4.2,1.4.2_04,1.4.2_05,1.4.2_06,1.4.2_14,1.4.2_15,5.0,5.0u2,5.0u7,5.0u8,5.0u9,5.0u10,5.0u11,5.0u12,6,6u1
Fixed Versions:
hs11 (b03)

Related Reports
Backport:
Backport:
Backport:
Backport:
Duplicate:
Duplicate:
Duplicate:
Duplicate:
Duplicate:
Duplicate:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
When -XX:+ParallelRefProcEnabled is set, then the JVM crashes always after
3-4 hours with a customer application. -XX:-ParallelRefProcEnabled will
always prevent this crash.
Synopsis changed from:

CMS crash when -XX:+ParallelRefProcEnabled is set

to:

CMS crash following parallel work queue overflow

to better reflect the fact that this bug is not
just limited to +ParallelRefProcEnabled, but rather
is of a much more sever nature.

                                    

Comments
EVALUATION

Background:-
  6558100 happens when there is task queue overflow in CMS || remark.
  In this case the overflowed objects are chained via their
  mark words. The bug is that later we forget to process
  these overflowed objects. In effect, the reachability
  closure is not computed past these objects. So any
  objects that are reached only through these overflown objects
  will not be marked and will be collected. These collected
  objects will end up on the CMS free list. When the next
  collection starts, the marking phase might reach these
  now-freed objects, and the marker barfs trying to scan
  these free blocks as though they were objects.

Note:-
  The symptom of the collector thread barfing while marking
  could potentially come from a host of possible root causes
  in the VM, including missing card-marks, a bug in the CMS
  precleaning or, in this case, a bug in the CMS remark (or
  parallel reference processing).

Identifying an instance of 6263371 as a duplicate of 6558100:-
  The key to the identification is that although the overflown
  objects are not scanned, so the objects that they point to
  were collected prematurely, the overflow objects themselves
  are not collected because they had been marked before placing
  on the overflow list. But the reason the marker barfs is that
  it is trying to scan one of these prematurely collected objects
  that is referenced by some field of an overflown object.

  If you connect to the core file using the SA, and give it the
  address of the oop which we were trying to scan when we segv'd
  and ask it to find all locations in the heap that contain that
  address. There are likely to be only a few (for reasons that
  will become clear below). Look at each of these locations
  in turn. Each will be a normal object, except that it will
  have a strange looking mark word. The mark word will have the
  address of another object, which will likewise be a normal
  looking object with a strange looking mark word and so on.
  You have just found part of the overflow list from the
  previous remark, and have identified an instance of 6558100.

Epilogue:-
  The above is just one symptom of 6558100. There are likely to
  be others. In partricular, since the mark word has been, in effect,
  clobbered and might have contained locking information or identity
  hash code, any computation on an object that involves synchronization
  or hashcode use could return the wrong answer. Possibilities
  may include IllegalMonitorStateException, may be biased locking
  malfunction (although i have not worked this through in detail)
  or other such weird behaviour.
                                     
2007-08-14
WORK AROUND

This CR includes two bugs, one in parallel reference processing
(in 7, 6 and 5) for which the workaround is -XX:-ParallelRefProcEnabled;
and another in parallel remark (in 7, 6, 5 and 1.4.2_14+) for
which the workaround is -XX:-CMSParallelRemarkEnabled.

Note that -ParallelRefProcEnabled is in fact the default,
while +CMSParallelRemarkEnabled is the default.

Turning off parallelism in either case can adversely affect
CMS parallel remark pauses.
                                     
2007-08-08
EVALUATION

This bug is fixed (see suggested fix section). In the course of
stress testing related to this bug, a new as-yet-undiagnosed
bug came to light. That's being tracked under 6578335.

This bug fix should be backported to 6, 5 and 1.4.2; see subCR's.
                                     
2007-07-12
SUGGESTED FIX

From  	View message header detail "Y. S. Ramakrishna" <###@###.###> 
Sent  	Thursday, July 12, 2007 11:36 am
To  	###@###.### 
Subject  	Code Manager notification (putback-to)

Event:            putback-to
Parent workspace: /net/jano2.sfbay/export2/hotspot/ws/main/gc_baseline
                  (jano2.sfbay:/export2/hotspot/ws/main/gc_baseline)
Child workspace:  /net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace
                  (prt-web:/net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace)
User:             ysr

Comment:

---------------------------------------------------------

Job ID:                 20070712093851.ysr.mustang
Original workspace:     neeraja:/net/jano2.sfbay/export2/hotspot/users/ysr/mustang
Submitter:              ysr
Archived data:          /net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/
Webrev:                 http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/workspace/webrevs/webrev-2007.07.12/index.html

Fixed   6558100: CMS crash when -XX:+ParallelRefProcEnabled is set
Partial 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses

   http://analemma.sfbay/net/jano/export/disk05/hotspot/users/ysr/mustang/webrev


(6558100)

When CMS marking (either during parallel rescan or parallel reference processing)
runs out of space on the per-worker work queues, the overflown grey objects
are tracked by chaining through their mark word. In this case, we had two
bugs: firstly, the method that took a prefix of the overflow list was not
re-attaching the intended suffix correctly (this affects all JVM's going
back to 1.4.2_14); secondly, the parallel reference processing code was
entirely neglecting to process the overflow list (this affects JVM's going
back to 5.0). The crucial debugging breakthrough came when Poonam used
the SA to track down the objects that CMS remark was declaring as
unreachable but unmarked, and found that they occurred in long chains
linked via their mark word (but with the promoted bit not set, which
helped distinguish them from the promoted chains that ParNew uses, and
identified them as broken fragments of an erstwhile overflow list).
Many thanks to Poonam Bajaj and Thomas Viessmann for crucial
debugging help. The customer has since run with a version of 6u2
with the fix (thanks Poonam) and verified that the previous crash
does not reproduce in > 2 days (previously the crash would happen in
about 4 hours).

Some debugging code was added as well as some asserts relaxed
to allow for the possibility of examining an object lying at the end
of the overflow list. This latter issue will be more thoroughly revisited
and cleaned up under a separate bug id.

(6572569)

When CMSScavengeBeforeRemark is set, we were assuming that a scavenge
would have necessarily preceded a remark and that therefore the heap
would already be in a parsable state. However, it is possible that
the scavenge may not have been done because, for instance, a JNI
critical section was held. The main CR here will need other work to
deal with the issue found at the customer, but this is a fix for
the problem with CMSScavengeBeforeRemark which is a temporary workaround
to this customer's performance issue as described in the bug report.
Thanks to Chris Phillips for testing and backport help with 5uXX where
the problem manifested most readily.

Reviewed by: Jon Masamitsu & Andrey Petrusenko

Fix Verified: y

Verification Testing:
 6558100: GCBasher on CMS with CMSMarkStackOverflowALot enabled
 6572569: GCBasher on CMS with CMSScavengeBeforeRemark & no survivor spaces

Other testing:
 PRT (also with CMS stress options)
 refworkload, runThese -quick and -testbase

Note added in proof: Some late breaking big apps testing using the
stress flags yesterday revealed an as-yet-undiagnosed issue when
running Tomcat and ATG. Thanks to Ashwin for finding this issue,
which is being tracked under CR 6578335.

Files:
update: src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp
update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp
update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp

Examined files: 3991

Contents Summary:
       3   update
    3988   no action (unchanged)
                                     
2007-07-12
EVALUATION

A "day one" bug in the handling of the overflow list used in
parallel rescan and in parallel reference processing
has been found and fixed. This fix applies to CMS with
parallel remark, even in the absence of parallel reference
processing, and needs to be backported to 6.0, 5.0 and 
1.4.2_XX (XX >= 14) as well.
                                     
2007-07-03
WORK AROUND

Do not use -XX:+ParallelRefProcEnabled, or explicitly switch it off
-XX:-ParallelRefProcEnabled.
                                     
2007-06-22
EVALUATION

Incomplete marking during parallel work queue overflow (overflow
list was being ignored) during parallel reference processing
(marking) phase. Simple fix, will need to be verified by customer.
                                     
2007-06-22
SUGGESTED FIX

As in evaluation/comments section. Watch this space for diffs
and regression test case.

In light of this, one will probably also need to run some of the older
performance tests anew to assess efficacy of the parallel reference
processing code.
                                     
2007-06-22



Hardware and Software, Engineered to Work Together