JDK-6578335 : CMS: BigApps failure with -XX:CMSInitiatingOccupancyFraction=1 -XX:+CMSMarkStackOverflowALot ...
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 1.4.2,6u3,6u4
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • OS: solaris_9,solaris_10
  • CPU: x86,sparc
  • Submitted: 2007-07-09
  • Updated: 2011-12-15
  • Resolved: 2008-10-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Related Reports
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Ran into CMS crash when testing fix for CR#6558100

HotSpot : 20070705083138.ysr.mustang-fastdebug
JDK : 6u3 b01

Relevant Flags : -d64 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=1 -XX:+CMSMarkStackOverflowALot -XX:CMSMarkStackOverflowInterval=20 -XX:SuppressErrorAt=/referenceProcessor.cpp:488 -XX:+VerifyBeforeGC -XX:+VerifyDuringGC -XX:+VerifyAfterGC -XX:+ShowMessageBoxOnError -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+DisableExplicitGC -Xmx192m 

Error message :
Internal Error at concurrentMarkSweepGeneration.cpp:2861, pid=14414, tid=8
Error:  ... aborting

Do you want to debug the problem?

To debug, run 'dbx - 14414'; then switch to thread 8
Enter 'yes' to launch dbx automatically (PATH must include dbx)
Otherwise, press RETURN to abort... 

Stack trace : 
(dbx) where
current thread: t@8
=>[1] ___nanosleep(0xfffffd7fe71fcf70, 0xfffffd7fe71fcf60, 0xfffffd7fff2ae191, 0xfffffd7fff2bdaaa, 0x0, 0x1), at 0xfffffd7fff2bcd6a
  [2] sleep(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2ae1a5
  [3] os::message_box(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffdfb0e9a
  [4] VMError::show_message_box(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe38904b
  [5] VMError::report_and_die(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe38875f
  [6] report_fatal(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd739b81
  [7] CMSCollector::verify_after_remark_work_1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd6cef06
  [8] CMSCollector::verify_after_remark(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd6ce5a6
  [9] CMSCollector::checkpointRootsFinalWork(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd6df314
  [10] CMSCollector::checkpointRootsFinal(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd6de7bc
  [11] CMSCollector::do_CMS_operation(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffd6e8451
  [12] VM_CMS_Final_Remark::doit(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe385a14
  [13] VM_Operation::evaluate(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe3b5492
  [14] VMThread::evaluate_operation(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe3b3923
  [15] VMThread::loop(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe3b4097
  [16] VMThread::run(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe3b354c
  [17] java_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffdfa789d
  [18] _thr_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2baa4b
  [19] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2bac80 

stack trace when GC verification not turned on :
(dbx) where
current thread: t@5
dbx: read of 8 bytes at address bad1b000 failed -- No such file or directory
dbx: attempt to read frame failed -- cannot derive frame pointer
  [1] ___nanosleep(0xbad1a958, 0xbad1a960), at 0xbff30327
  [2] _sleep(0x64), at 0xbff24a23
  [3] os::message_box(0xbfb43c9a, 0xbfc7d678), at 0xbeccb816
  [4] VMError::show_message_box(0xbad1aaec, 0xbfc7d678, 0x7d0), at 0xbef58c09
  [5] VMError::report_and_die(0xbad1aaec), at 0xbef5825d
  [6] report_assertion_failure(0xbf37fbf0, 0x1c73, 0xbf37fbb0), at 0xbe6ede57
  [7] PushOrMarkClosure::do_oop(0xbad1abc4, 0xb08d5010), at 0xbe6c11a6
  [8] objArrayKlass::oop_oop_iterate_nv(0xb6c8e4f8, 0xbad1ac1c, 0xbad1abc4), at 0xbecac6ed
  [9] MarkFromRootsClosure::scanOopsInOop(0xbad1ad1c, 0xb08d4fe0), at 0xbe6bf240
  [10] MarkFromRootsClosure::do_bit(0xbad1ad1c, 0x3353f8), at 0xbe6be88f
  [11] BitMap::iterate(0x80ecd80, 0xbad1ad1c, 0x411e, 0x2c00000), at 0xbe47312a
  [12] CMSCollector::do_marking_st(0x80ecc68, 0x1), at 0xbe6b1dfe
  [13] CMSCollector::markFromRootsWork(0x80ecc68, 0x1), at 0xbe6afbab
  [14] CMSCollector::markFromRoots(0x80ecc68, 0x1), at 0xbe6af90a
  [15] CMSCollector::collect_in_background(0x80ecc68, 0x0), at 0xbe6aab03
  [16] ConcurrentMarkSweepThread::run(0x810d800), at 0xbe6c92f2
  [17] java_start(0x810d800), at 0xbecc5722
  [18] _thr_setup(0xbde70800), at 0xbff2fd36
=>[19] _lwp_start(), at 0xbff30020
  [20] 0x0(), at 0xffffffffffffffff 

EVALUATION This bug should be closed as a duplicate of 6722112, 6722113, 6722116. Of these 6722112 and 6722116 are fixed, and 6722113 will be fixed soon (a workaround has been checked in). As a result I am closing this as a duplicate of 6722113.

EVALUATION For logistical and process reasons, the three bugs mentioned above are being fixed under the following CRs: 6722112 CMS: Incorrect encoding of overflown object arrays during concurrent precleaning 6722113 CMS: Incorrect overflow handling during precleaning of Reference lists 6722116 CMS: Incorrect overflow handling when using parallel concurrent marking

EVALUATION There was a third bug found which relates to the handling of "second ring overflow" when using parallel concurrent marking -- the overflow of the global overflow stack (which itself handles the overflow from the local work queues). The intention was that this second ring overflow should use the "restart mechanism" to restart marking from the least overflown address. That mechanism was not completely extended to the parallel concurrent marking case. The restart_addr was not pushed all the way through to the parallel concurrent marking task that controls the parallel concurrent marking. Because of the partial change to the state of the parallel concurrent marking task, we can and often will end up missing the scan of some of the addresses at the higher extremes of the CMS-collected generations. Because second-ring overflow is a very rare event in practice, this appears to have not been detected before (or at least not until the first two bugs mentioned above were moved out of our way). The obvious workaround is to switch off parallel concurrent marking via -XX:-CMSConcurrentMTEnabled.

EVALUATION There was a second bug in the overflow handling encountered during the precleaning of reference lists. Because the same closure was used for this work during the remark and during the preclean stage, we ended up with incorrect overflow handling (correct for the remark phase, incorrect for preclean phase). This needs to be fixed. The temporary workaround is to disable CMSPrecleanRefLists{1,2}.

EVALUATION What we were doing is that if we overflowed the marking stack when trying to push a newly marked (now grey) object encountered during precleaning, we would just dirty the card the object (now marked) lay on, with the expectation that a later precleaning pass or the final remark phase (which would pick up all remaining dirty cards) would deal with the object. But of course in the case of an object _array_, preclean/remark would just scrub the dirty pages not the entire array, so the part of the object array that protruded off the dirtied page on to a possibly clean page would not be scanned and if that part contained references to white objects, those would be lost. The fix of course is in the case of overflown object arrays, to dirty all the pages that the newly marked overflown object array lies on when encoding its greyness for the purposes of rescan (by a later preclean pass or the subsequent final remark).