United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6572569 CMS: consistently skewed work distribution indicated in (long) re-mark pauses
JDK-6572569 : CMS: consistently skewed work distribution indicated in (long) re-mark pauses

Details
Type:
Bug
Submit Date:
2007-06-21
Status:
Resolved
Updated Date:
2010-12-02
Project Name:
JDK
Resolved Date:
2010-10-07
Component:
hotspot
OS:
generic,solaris_10,linux_2.6
Sub-Component:
gc
CPU:
x86,sparc,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
5.0u8,5.0u11,5.0u12,6u2
Fixed Versions:

Related Reports
Backport:
Backport:
Backport:
Backport:
Backport:
Relates:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
See comments section.

                                    

Comments
SUGGESTED FIX

(6572569)

When CMSScavengeBeforeRemark is set, we were assuming that a scavenge would have necessarily preceded a remark and that therefore the heap would already be in a parsable state. However, it is possible that the scavenge may not have been done because, for instance, a JNI critical section was held. The main CR here will need other work to deal with the issue found at the customer, but this is a fix for the problem with CMSScavengeBeforeRemark? which is a temporary workaround to this customer's performance issue as described in the bug report. 

See:  http://analemma.sfbay/net/jano/export/disk05/hotspot/users/ysr/mustang/webrev

(which also includes other fixes which you should elide in your reading for this CR).
                                     
2007-07-05
EVALUATION

The heap shape and workload are such that a CMS cycle starts and
finishes between two scavenges. Under these circumstances it
is possible for the Eden space parallelization to not work very
well. This can be partially worked around by means of
-XX:+CMSScavengeBeforeRemark.

Other heuristics to deal with this are also possible and will
be investigated while we await customer feedback on the efficacy
of +CMScavengeBeforeRemark in their case.

SubCR's have been filed on releases earlier to 6.0 (when
CMSScavengeBeforeRemark became a product flag) to make
CMSScavengeBeforeRemark a product flag. See the subCR's
for relevant diffs (also with ###@###.###).

For the case of 7.0 a bug in CMSScavengeBeforeRemark
needed to be fixed. See the Suggested Fix section of the
details. That latter fix needs to be mae in 7.0 and 6u3,
so an appropriate subCR for 6u3 has also been created.
                                     
2007-07-05
EVALUATION

See suggested fix section for the fix putback.

The CR is being kept open for remaining performance work
including heuristically determining the situation in
which CMSScavengeBeforeRemark is likely to help and/or
of dynamcially toggling it as necessary.

That work will however happen at lower urgency, so the
priority of this bug will be lowered, based on some
preliminary performance numbers made available from
customer that indicates the efficacy of this
flag as a workaround to the long remark problem.
                                     
2007-07-12
SUGGESTED FIX

Of the above, the diffs relevant to this bug are merely the following:-

*** src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp-       Sun Jun 10 16:38:11 2007
--- src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp        Thu Jul 12 09:38:58 2007

*** 4688,4702 ****
    }
    assert(haveFreelistLocks(), "must have free list locks");
    assert_lock_strong(bitMapLock());
  
    if (!init_mark_was_synchronous) {
!     if (CMSScavengeBeforeRemark) {
!       // Heap already made parsable as a result of scavenge
!     } else {
        gch->ensure_parsability(false);  // fill TLAB's, but no need to retire them
-     }
      // Update the saved marks which may affect the root scans.
      gch->save_marks();
    
      {
        COMPILER2_PRESENT(DerivedPointerTableDeactivate dpt_deact;)
--- 4724,4746 ----
    }
    assert(haveFreelistLocks(), "must have free list locks");
    assert_lock_strong(bitMapLock());
  
    if (!init_mark_was_synchronous) {
!     // We might assume that we need not fill TLAB's when
!     // CMSScavengeBeforeRemark is set, because we may have just done
!     // a scavenge which would have filled all TLAB's -- and besides
!     // Eden would be empty. This however may not always be the case --
!     // for instance although we asked for a scavenge, it may not have
!     // happened because of a JNI critical section. We probably need
!     // a policy for deciding whether we can in that case wait until
!     // the critical section releases and then do the remark following
!     // the scavenge, and skip it here. In the absence of that policy,
!     // or of an indication of whether the scavenge did indeed occur,
!     // we cannot rely on TLAB's having been filled and must do
!     // so here just in case a scavenge did not happen.
      gch->ensure_parsability(false);  // fill TLAB's, but no need to retire them
      // Update the saved marks which may affect the root scans.
      gch->save_marks();
    
      {
        COMPILER2_PRESENT(DerivedPointerTableDeactivate dpt_deact;)
                                     
2007-07-12
SUGGESTED FIX

From  	View message header detail "Y. S. Ramakrishna" <###@###.###> 
Sent  	Thursday, July 12, 2007 11:36 am
To  	###@###.### 
Subject  	Code Manager notification (putback-to)

Event:            putback-to
Parent workspace: /net/jano2.sfbay/export2/hotspot/ws/main/gc_baseline
                  (jano2.sfbay:/export2/hotspot/ws/main/gc_baseline)
Child workspace:  /net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace
                  (prt-web:/net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace)
User:             ysr

Comment:

---------------------------------------------------------

Job ID:                 20070712093851.ysr.mustang
Original workspace:     neeraja:/net/jano2.sfbay/export2/hotspot/users/ysr/mustang
Submitter:              ysr
Archived data:          /net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/
Webrev:                 http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/workspace/webrevs/webrev-2007.07.12/index.html

Fixed   6558100: CMS crash when -XX:+ParallelRefProcEnabled is set
Partial 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses

   http://analemma.sfbay/net/jano/export/disk05/hotspot/users/ysr/mustang/webrev


(6558100)

When CMS marking (either during parallel rescan or parallel reference processing)
runs out of space on the per-worker work queues, the overflown grey objects
are tracked by chaining through their mark word. In this case, we had two
bugs: firstly, the method that took a prefix of the overflow list was not
re-attaching the intended suffix correctly (this affects all JVM's going
back to 1.4.2_14); secondly, the parallel reference processing code was
entirely neglecting to process the overflow list (this affects JVM's going
back to 5.0). The crucial debugging breakthrough came when Poonam used
the SA to track down the objects that CMS remark was declaring as
unreachable but unmarked, and found that they occurred in long chains
linked via their mark word (but with the promoted bit not set, which
helped distinguish them from the promoted chains that ParNew uses, and
identified them as broken fragments of an erstwhile overflow list).
Many thanks to Poonam Bajaj and Thomas Viessmann for crucial
debugging help. The customer has since run with a version of 6u2
with the fix (thanks Poonam) and verified that the previous crash
does not reproduce in > 2 days (previously the crash would happen in
about 4 hours).

Some debugging code was added as well as some asserts relaxed
to allow for the possibility of examining an object lying at the end
of the overflow list. This latter issue will be more thoroughly revisited
and cleaned up under a separate bug id.

(6572569)

When CMSScavengeBeforeRemark is set, we were assuming that a scavenge
would have necessarily preceded a remark and that therefore the heap
would already be in a parsable state. However, it is possible that
the scavenge may not have been done because, for instance, a JNI
critical section was held. The main CR here will need other work to
deal with the issue found at the customer, but this is a fix for
the problem with CMSScavengeBeforeRemark which is a temporary workaround
to this customer's performance issue as described in the bug report.
Thanks to Chris Phillips for testing and backport help with 5uXX where
the problem manifested most readily.

Reviewed by: Jon Masamitsu & Andrey Petrusenko

Fix Verified: y

Verification Testing:
 6558100: GCBasher on CMS with CMSMarkStackOverflowALot enabled
 6572569: GCBasher on CMS with CMSScavengeBeforeRemark & no survivor spaces

Other testing:
 PRT (also with CMS stress options)
 refworkload, runThese -quick and -testbase

Note added in proof: Some late breaking big apps testing using the
stress flags yesterday revealed an as-yet-undiagnosed issue when
running Tomcat and ATG. Thanks to Ashwin for finding this issue,
which is being tracked under CR 6578335.

Files:
update: src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp
update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.cpp
update: src/share/vm/gc_implementation/concurrentMarkSweep/concurrentMarkSweepGeneration.hpp

Examined files: 3991

Contents Summary:
       3   update
    3988   no action (unchanged)
                                     
2007-07-12
SUGGESTED FIX

One simple approach towards fixing this problem is to not operate the
phase timeout until at least one scavenge suring the phase, i.e.
something along the lines of:

    if (time_spent_in_phase > MAX(max_default,2*recent_inter_scavenge_duration)
        && at_least_one_scavenge_during_phase)
    then abort_phase.

We should see if one of the customers (or a suitable in-house configuration)
can test/verify the efficacy of such a heuristic across a range of
conditions.
                                     
2007-10-17
EVALUATION

One simple approach towards fixing this problem is to not operate the
phase timeout until at least one scavenge suring the phase, i.e.
something along the lines of:

    if (time_spent_in_phase > MAX(max_default,2*recent_inter_scavenge_duration)
        && at_least_one_scavenge_during_phase)
    then abort_phase.

We should see if one of the customers (or a suitable in-house configuration)
can test/verify the efficacy of such a heuristic across a range of
conditions.
                                     
2007-10-17
WORK AROUND

-XX:+CMSScavengeBeforeRemark is a partial workaround.
                                     
2009-11-30
EVALUATION

The balance of the work remaining to be done here has been transferred to
shadow CR 6990419  CMS: Remaining work for 6572569: consistently skewed work distribution in (long) re-mark pauses.

I am closing this as fixed in 7. I would have to do a bit of archeology to
determine the exact build of JDK 7 in which the fix for CMSScavengeBeforeRemark
integrated, but until then here's the JPRT archive link (and i am using
7b01 as the build we fixed in, which is almost certainly a lie) :-

http://prt-web.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/workspace/webrevs/webrev-2007.07.12/index.html
                                     
2010-10-07



Hardware and Software, Engineered to Work Together