JDK-6316605 : atg server crashed with CMS collector
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 6
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: solaris
  • CPU: sparc
  • Submitted: 2005-08-26
  • Updated: 2010-05-11
  • Resolved: 2005-09-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6
6 b52Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
atg server crashed on j2se-b.west with CMS collector with fastdebug build from main baseline.
j2se-b# uname -a
SunOS j2se-b 5.10 Generic_118822-15 sun4u sparc SUNW,Sun-Fire
j2se-b# /usr/j2se/bin/java -version
java version "1.6.0-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-ea-b48)
Java HotSpot(TM) Server VM (build 20050825070300.mingyao.rt_merge-debug, mixed mode)

flags used: -server -XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled
-XX:+DeoptimizeALot -XX:+SafepointALot

Error message:
# after -XX: or in .hotspotrc:  SuppressErrorAt=/mutexLocker.cpp:99]
==============================================================================
Unexpected Error
------------------------------------------------------------------------------
Internal Error at mutexLocker.cpp:99, pid=3978, tid=20

Stack retrace:
(dbx) where
current thread: t@20
=>[1] ___nanosleep(0xf9d7f420, 0xf9d7f418, 0x10, 0x0, 0xff0a3c9b,
0x6dc), at 0xff33f6c8
  [2] sleep(0x64, 0xf9d7f4a8, 0x0, 0x0, 0x0, 0x0), at 0xff3328f0
  [3] os::message_box(0xfef6e866, 0xff0a3ba8, 0xf9d7f4a8, 0x4e,
0xf9d7f4b8, 0x64), at 0xfe034d00
  [4] VMError::show_message_box(0xf9d7f698, 0xff0a3ba8, 0x7d0, 0x2a800,
0x36, 0xfeff0494), at 0xfe326ca0
  [5] VMError::report_and_die(0xf9d7f698, 0x30c00, 0x31c00, 0xff0a4380,
0xff05ffb4, 0xff0a4378), at 0xfe3256c0
  [6] report_fatal_vararg(0xfec4f63a, 0x63, 0xff05ff8d, 0xfeff0494,
0x30e80, 0x30c00), at 0xfd85c7b8
  [7] assert_locked_or_safepoint(0x2153c, 0x0, 0xff0602dc, 0xfeff0494,
0x2ead4, 0x21400), at 0xfdfd2f24
  [8] ConcurrentMarkSweepGeneration::grow_by(0x17c318, 0x980000,
0xff07af34, 0x17c33c, 0xfeff0494, 0x33000), at 0xfd808efc
  [9] ConcurrentMarkSweepGeneration::compute_new_size(0x97fcf2,
0x17c318, 0xa000, 0x2e7c4, 0xa000, 0xfd7febe0), at 0xfd7ff0e8
  [10] CMSCollector::collect_in_background(0xff07e9b4, 0x1fca08,
0x57798, 0x578e0, 0xfbf98, 0xfbe10), at 0xfd804074
  [11] ConcurrentMarkSweepThread::run(0x1, 0x39e80, 0x31800, 0x39e80,
0x28db00, 0xfeff0494), at 0xfd828c78
  [12] java_start(0x28db00, 0x2, 0xff07f2fc, 0x22400, 0xfeff0494,
0x28e8f0), at 0xfe02d958

I also ran atg with ParallelGC on a similiar solaris sparc machine
j2se-a.west, the test went well with Parallel GC.

I am rerunning the test on a similiar machine to find out how easy or difficult to reproduce the bug.


How to reproduce the bug:
1. log on to j2se-b.west
2. export JAVA_HOME=<your java home>
3. export STARTLOOP=20 ( default value is 10 )
3. /bs/runatg.ksh -server  -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -XX:+DeoptimizeALot -XX:+SafepointALot 
4. log files will be located under /bt/atg*, server log is "atgserver.log"

Comments
EVALUATION This bug has been fixed in Tiger U7b01 as well, as a result of the fixes needed for 6319688.
03-01-2006

SUGGESTED FIX Event: putback-to Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline (jano.sfbay:/export/disk05/hotspot/ws/main/gc_baseline) Child workspace: /net/prt-web.sfbay/prt-workspaces/20050906233404.ysr.MT/workspace (prt-web:/net/prt-web.sfbay/prt-workspaces/20050906233404.ysr.MT/workspace) User: ysr Comment: --------------------------------------------------------- Original workspace: karachi:/net/spot/scratch/ysr/MT Submitter: ysr Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050906233404.ysr.MT/ Webrev: http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050906233404.ysr.MT/workspace/webrevs/webrev-2005.09.07/index.html Fixed 6316605: atg server crashed with CMS collector http://analemma.sfbay/net/spot/scratch/ysr/MT/webrev The immediate assertion failure was because of some code restructuring done in a previous CMS/Ergo putback where the call to resize the CMS generation was moved to a new site where the CMS token was not held by the CMS thread when calling the resize method. Because this code could now run concurrent with a safepoint, we opened up a small window that allowed the CMS thread to attempt to expand the heap without holding the ExpandHeap_lock, triggering the assertion when the safepoint was finished. It so happens, however, that the ExpandHeap_lock is in some sense obsolete and heap expansion should in fact be protected by the Heap_lock (which is held on behalf of the VM thread by the thread initiating a collection). It's also the case that, when using CMS, the CMS token is a proxy for the Heap_lock, and can be used to protect heap resizing. This change re-establishes the last invariant which had been compromised by the code movement, and changes the asserts so that this condition is explicitly called out. The use of ExpandHeap_lock in CMS is completely eliminated. As a result, the CMS thread cannot now resize the heap while a scavenge is in progress, thus fixing the original problem. In two subsequent putbacks, under two separate bugs that we have filed for the work, we shall, respectively, : (a) eliminate ExpandHeap_lock completely from the JVM and correct all the assertions that currently use it, so that they would instead use an appropriate condition on the Heap_lock (b) change CMS so that the use of the CMS token proxy is eliminated (if it's possible to do so without any increase in code complexity or degradation in performance) for the pruposes of heap resizing (this would have the beneficial effect of using the same mechanism across all heap configurations). During the course of this investigation, we also found a coding flaw in CMSPermGen::mem_allocate() which we fixed. This bug fix needs to be backported to Tiger and possibly Mantis (appropriate subCR's have been filed). Reviewed by: Jon Masamitsu Fix Verified: (verification runs in progress) Verification Testing: ATG script with stress options from June; see bug report Other testing: refworkload (with stress options) prt Files: update: src/share/vm/memory/concurrentMarkSweepGeneration.cpp update: src/share/vm/memory/concurrentMarkSweepGeneration.hpp update: src/share/vm/memory/permGen.cpp update: src/share/vm/memory/permGen.hpp Examined files: 3680 Contents Summary: 4 update 3676 no action (unchanged)
14-09-2005

EVALUATION The main bug has been fixed and will be putback to Mustang. The balance is being retargeted to Dolphin under a separate CR. See suggested fix & comments sections for some details.
02-09-2005

SUGGESTED FIX Mustang diffs: (under testing and review) =============== The following fixes are "partial" in the sense that they re-establish the invariant that the CMS token protects heap expansion in the case of CMS. The redundant and useless ExpandHeap_lock has been eliminated from CMS and will be from the other collectors in a second putback. Uniformly using the Heap_lock to protect heap expansion everywhere will be done in a third putback. Diffs set #1 follows: ===================== ------- permGen.hpp ------- 67a68,69 > > HeapWord* mem_allocate_work(size_t size); ------- permGen.cpp ------- 117d116 < HeapWord* obj = NULL; 118a118,128 > if (lock_owned) { > MutexUnlocker mul(lock); > return mem_allocate_work(size); > } else { > return mem_allocate_work(size); > } > } > > HeapWord* CMSPermGen::mem_allocate_work(size_t size) { > MutexLocker ml(Heap_lock); > HeapWord* obj = NULL; 120c130 < obj = check_lock_and_allocate(lock_owned, size); --- > obj = _gen->allocate(size, false); 135,137c145,146 < check_lock_and_collect(lock_owned, < GCCause::_permanent_generation_full); < obj = check_lock_and_allocate(lock_owned, size); --- > SharedHeap::heap()->collect(GCCause::_permanent_generation_full); > obj = _gen->allocate(size, false); 145,146c154,155 < check_lock_and_collect(lock_owned, GCCause::_last_ditch_collection); < obj = check_lock_and_allocate(lock_owned, size); --- > SharedHeap::heap()->collect(GCCause::_last_ditch_collection); > obj = _gen->allocate(size, false); ------- concurrentMarkSweepGeneration.hpp ------- 686,689d685 < // locking checks < bool vm_thread_has_cms_token(); < bool cms_thread_has_cms_token(); < 716a713,715 > // locking checks > NOT_PRODUCT(static bool have_cms_token();) > ------- concurrentMarkSweepGeneration.cpp ------- 837a838 > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); 847d847 < 1721c1721,1722 < // under the freelist lock. --- > // after obtaining the free list locks for the > // two generations. 1722a1724 > assert(have_cms_token(), "Proxy for Heap_lock"); 2149a2152,2171 > // At this point the background collection has completed. > // Don't move the call to compute_new_size() down > // into code that might be executed if the background > // collection was preempted. > // Note: When using CMS, and manipulating/resizing the CMS > // collected heap, it turns out that the CMS token is a > // strong proxy for the Heap_lock that we would otherwise be > // required to hold (or be held on our behalf). As we saw > // above, though, because the Heap_lock is held on behalf of > // the VM thread trying to initiate a foreground collection, > // the protocol for obtaining the Heap_lock here by the > // CMS thread would become quite complicated, a complication > // we'd rather avoid. An alternative would be to use a > // new CMS state machine state "Resizing", and explicitly > // take the Heap_lock before acquiring the CMS token. > // Perhaps that's the way to go? > { > CMSTokenSync x(true); > compute_new_size(); > } 2151,2156d2172 < // At this point the background collection has completed. < // Don't move the call to compute_new_size() down < // into code that might be executed if the background < // collection was preempted. < compute_new_size(); < 2539a2556,2570 > #ifndef PRODUCT > bool CMSCollector::have_cms_token() { > Thread* thr = Thread::current(); > if (thr->is_VM_thread()) { > return ConcurrentMarkSweepThread::vm_thread_has_cms_token(); > } else if (thr->is_ConcurrentGC_thread()) { > return ConcurrentMarkSweepThread::cms_thread_has_cms_token(); > } else if (thr->is_GC_task_thread()) { > return ConcurrentMarkSweepThread::vm_thread_has_cms_token() && > ParGCRareEvent_lock->owned_by_self(); > } > return false; > } > #endif > 2552,2556c2583 < assert( ( Thread::current()->is_VM_thread() < && ConcurrentMarkSweepThread::vm_thread_has_cms_token()) < || ( Thread::current()->is_ConcurrentGC_thread() < && ConcurrentMarkSweepThread::cms_thread_has_cms_token()), < "Else there may be mutual interference in use of CMS data structures "); --- > assert(have_cms_token(), "Should hold cms token"); 2610,2613c2637 < assert( ( Thread::current()->is_VM_thread() < && ConcurrentMarkSweepThread::vm_thread_has_cms_token()) < || ( Thread::current()->is_ConcurrentGC_thread() < && ConcurrentMarkSweepThread::cms_thread_has_cms_token()), --- > assert(have_cms_token(), 2929c2953,2954 < GCMutexLocker x(ExpandHeap_lock); --- > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); > 3016c3041 < GCMutexLocker x(ExpandHeap_lock); --- > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); 3024c3049 < assert_locked_or_safepoint(ExpandHeap_lock); --- > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); 3055c3080 < assert_locked_or_safepoint(ExpandHeap_lock); --- > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); 3064c3089 < assert_locked_or_safepoint(ExpandHeap_lock); --- > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); 3755,3756d3779 < assert(ConcurrentMarkSweepThread::cms_thread_has_cms_token(), < "CMS thread should hold CMS token"); 5357c5380 < _permGen->freelistLock(), ExpandHeap_lock); --- > _permGen->freelistLock()); 5501a5525 > assert(have_cms_token(), "Should hold cms token"); 5507d5530 < 5509d5531 < 8039a8062,8063 > assert(CMSCollector::have_cms_token(), "Proxy for Heap_lock"); > 8078c8102 < assert_locked_or_safepoint(ExpandHeap_lock); --- > assert_locked_or_safepoint(Heap_lock);
31-08-2005

EVALUATION Fix in progress; see comments section.
26-08-2005