United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4985197 : 1.4.2_03-b02 Crashes during Concurrent collections

Details
Type:
Bug
Submit Date:
2004-01-29
Status:
Resolved
Updated Date:
2004-06-15
Project Name:
JDK
Resolved Date:
2004-04-07
Component:
hotspot
OS:
solaris_9,solaris_8,generic
Sub-Component:
gc
CPU:
sparc,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
1.4.2_01,1.4.2_03,1.4.2_04
Fixed Versions:
1.4.2_05 (05)

Related Reports
Duplicate:
Duplicate:
Duplicate:
Duplicate:
Relates:
Relates:
Relates:

Sub Tasks

Description
Server crashes after long performance tests using 1.4.2_03. See stack traces
below. The full traces are attached along with the hotspot error messages.

First Crash
-----------
core 'core.java.11552.1074817152' of 11552:	java -DORB.OrbName=ProdFE01v20Frontendsbtf0ai -DORB.PortNum=22023 -Dpr
-----------------  lwp# 10 / thread# 10  --------------------
 ff31efd0 _lwp_kill (6, 0, 0, ffffffff, ff3403bc, 0) + 8
 ff2b595c abort    (ff33c000, a977ed28, 0, 4, 0, a977ed49) + 100
 ff098260 __1cCosFabort6Fi_v_ (1, ff15323a, a977edd8, ff17e000, ff1c58bc, 3e93f4) + 80
 ff096574 __1cCosbBhandle_unexpected_exception6FpnGThread_ipCpv_v_ (0, a, fef754bc, a977fb40, fedd86c8, 0) + 2d4
 fedd8f9c JVM_handle_solaris_signal (fef754bc, a977fb40, a977f888, 3400, 35ec, 0) + 91c
 ff384cc8 __sighndlr (a, a977fb40, a977f888, fedd864c, 0, 0) + c
 ff37fb00 call_user_handler (fea31000, a, ff3978c0, a977f888, a977fb40, a) + 254
 ff37fccc sigacthandler (fea31000, a977fb40, a977f888, ff396000, a977fb40, a) + 64
 --- called from signal handler with signal -22867968 (SIG Unknown) ---
 fef754bc __1cUMarkFromRootsClosureNscanOopsInOop6MpnIHeapWord__v_ (a977fd18, bbd6de20, 0, a7000000, 1, a977fd18) + 188
 fef30d9c __1cGBitMapHiterate6MpnNBitMapClosure_II_v_ (5eb6f0, a977fd18, 0, 12700000, 1, 0) + 8c
 fef703e0 __1cMCMSCollectorRmarkFromRootsWork6Mi_v_ (f9420, 1, ff135bee, d429994e, 4b42c8, 0) + 154
 fef70158 __1cMCMSCollectorNmarkFromRoots6Mi_v_ (f9420, 1, 9999999a, 2f648, 4b42c8, 0) + 120
 fef6daa0 __1cMCMSCollectorVcollect_in_background6Mi_v_ (ff135bf3, fef6d828, 7d0, 4000, 417c, 0) + 238
 fef77e34 __1cZConcurrentMarkSweepThreadDrun6M_v_ (3c00, 4c00, 5400, 55f0, 3c00, 3ffc) + 438
 fee65600 _start   (101a00, fea31000, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

Second Crash
-------------
core 'core.java.1584.1074817435' of 1584:	java -DORB.OrbName=ProdFE01v20Frontendsbtf0bi -DORB.PortNum=22023 -Dpr
-----------------  lwp# 3 / thread# 3  --------------------
 ff096564 __1cCosbBhandle_unexpected_exception6FpnGThread_ipCpv_v_ (0, a, fef76bd0, fc67fcd8, fedd86c8, 0) + 2c4
 fedd8f9c JVM_handle_solaris_signal (fef76bd0, fc67fcd8, fc67fa20, 3400, 35ec, 0) + 91c
 ff384cc8 __sighndlr (a, fc67fcd8, fc67fa20, fedd864c, 0, 0) + c
 ff37fb00 call_user_handler (fea30200, 3, ff3978c0, fc67fa20, fc67fcd8, a) + 254
 ff37fccc sigacthandler (fea30200, fc67fcd8, fc67fa20, ff396000, fc67fcd8, a) + 64
 --- called from signal handler with signal -22871552 (SIG Unknown) ---
 fef76bd0 __1cbEPar_MarkRefsIntoAndScanClosureKtrim_queue6MI_v_ (fc67fe34, 0, fc67fe34, fc67fe28, 1, 0) + d8
 fef718f4 __1cQCMSParRemarkTaskEwork6Mi_v_ (a977fb84, 4, 0, 4000, 417c, 1) + 278
 ff0fb4c8 __1cKGangWorkerDrun6M_v_ (9b518, 3, 40, 0, 40, 0) + ac
 fee65600 _start   (9b518, fea30200, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)



###@###.### 2004-01-29: the following are the c++filt'd versions of the
above for greater clarity:

-----------------  lwp# 10 / thread# 10  --------------------
 ff31efd0 _lwp_kill (6, 0, 0, ffffffff, ff3403bc, 0) + 8
 ff2b595c abort    (ff33c000, a977ed28, 0, 4, 0, a977ed49) + 100
 ff098260 void os::abort(int) (1, ff15323a, a977edd8, ff17e000, ff1c58bc, 3e93f4) + 80
 ff096574 void os::handle_unexpected_exception(Thread*,int,unsigned char*,void*) (0, a, fe
f754bc, a977fb40, fedd86c8, 0) + 2d4
 fedd8f9c JVM_handle_solaris_signal (fef754bc, a977fb40, a977f888, 3400, 35ec, 0) + 91c
 ff384cc8 __sighndlr (a, a977fb40, a977f888, fedd864c, 0, 0) + c
 ff37fb00 call_user_handler (fea31000, a, ff3978c0, a977f888, a977fb40, a) + 254
 ff37fccc sigacthandler (fea31000, a977fb40, a977f888, ff396000, a977fb40, a) + 64
 --- called from signal handler with signal -22867968 (SIG Unknown) ---
 fef754bc void MarkFromRootsClosure::scanOopsInOop(HeapWord*) (a977fd18, bbd6de20, 0, a700
0000, 1, a977fd18) + 188
 fef30d9c void BitMap::iterate(BitMapClosure*,unsigned,unsigned) (5eb6f0, a977fd18, 0, 127
00000, 1, 0) + 8c
 fef703e0 void CMSCollector::markFromRootsWork(int) (f9420, 1, ff135bee, d429994e, 4b42c8,
 0) + 154
 fef70158 void CMSCollector::markFromRoots(int) (f9420, 1, 9999999a, 2f648, 4b42c8, 0) + 1
20
 fef6daa0 void CMSCollector::collect_in_background(int) (ff135bf3, fef6d828, 7d0, 4000, 41
7c, 0) + 238
 fef77e34 void ConcurrentMarkSweepThread::run() (3c00, 4c00, 5400, 55f0, 3c00, 3ffc) + 438

 fee65600 _start   (101a00, fea31000, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

The second one is more interesting since multiple worker threads appear to have run
into problems:

-----------------  lwp# 3 / thread# 3  --------------------
 ff096564 void os::handle_unexpected_exception(Thread*,int,unsigned char*,void*) (0, a, fe
f76bd0, fc67fcd8, fedd86c8, 0) + 2c4
 fedd8f9c JVM_handle_solaris_signal (fef76bd0, fc67fcd8, fc67fa20, 3400, 35ec, 0) + 91c
 ff384cc8 __sighndlr (a, fc67fcd8, fc67fa20, fedd864c, 0, 0) + c
 ff37fb00 call_user_handler (fea30200, 3, ff3978c0, fc67fa20, fc67fcd8, a) + 254
 ff37fccc sigacthandler (fea30200, fc67fcd8, fc67fa20, ff396000, fc67fcd8, a) + 64
 --- called from signal handler with signal -22871552 (SIG Unknown) ---
 fef76bd0 void Par_MarkRefsIntoAndScanClosure::trim_queue(unsigned) (fc67fe34, 0, fc67fe34
, fc67fe28, 1, 0) + d8
 fef718f4 void CMSParRemarkTask::work(int) (a977fb84, 4, 0, 4000, 417c, 1) + 278
 ff0fb4c8 void GangWorker::run() (9b518, 3, 40, 0, 40, 0) + ac
 fee65600 _start   (9b518, fea30200, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 6 / thread# 6  --------------------
 ff384c0c __sigprocmask (2, fea30800, 0, fa27f740, fa27f690, fa27f710) + 8
 ff380250 sigprocmask (2, fa27f740, 0, 6, ff396000, ff340430) + 20
 ff2ce8e8 sigrelse (6, 0, 0, ffffffff, ff3403bc, 0) + 5c
 ff2b5954 abort    (ff33c000, fa27f840, 0, 4, 0, fa27f861) + f8
 ff098260 void os::abort(int) (1, 0, 0, 0, 0, 0) + 80
 ff09a260 exception_handler_during_fatal_error (a, 0, fa27faa0, 0, 0, 0) + 14
 ff384cc8 __sighndlr (a, 0, fa27faa0, ff09a24c, 0, 0) + c
 ff37fb00 call_user_handler (fea30800, 6, ff3978c0, fa27faa0, 0, a) + 254
 ff37fccc sigacthandler (fea30800, 0, fa27faa0, ff396000, 0, a) + 64
 --- called from signal handler with signal -22870016 (SIG Unknown) ---
 fef76bd0 void Par_MarkRefsIntoAndScanClosure::trim_queue(unsigned) (fa27fe34, 0, fa27fe34
, fa27fe28, 1, 0) + d8
 fef718f4 void CMSParRemarkTask::work(int) (a977fb84, 6, 0, 4000, 417c, 1) + 278
 ff0fb4c8 void GangWorker::run() (9d770, 6, 40, 0, 40, 0) + ac
 fee65600 _start   (9d770, fea30800, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 8 / thread# 8  --------------------
 ff31f798 _write   (14, ff34029c, ff33c000, 0, ff33fca0, ff152d1a) + c
 ff095a00 void report_fatal_error_simple() (0, ff343a4c, ff33fca0, 0, ff33fca0, ff1530be) 
+ 178
 ff096264 void os::handle_recursive_fatal_error(int) (a, 0, 0, 0, 0, 0) + 70
 ff09a260 exception_handler_during_fatal_error (a, 0, fa07fb10, 0, 0, 0) + 14
 ff384cc8 __sighndlr (a, 0, fa07fb10, ff09a24c, 0, 0) + c
 ff37fb00 call_user_handler (fea30c00, 8, ff3978c0, fa07fb10, 0, a) + 254
 ff37fccc sigacthandler (fea30c00, 0, fa07fb10, ff396000, 0, a) + 64
 --- called from signal handler with signal -22868992 (SIG Unknown) ---
 fef71928 void CMSParRemarkTask::work(int) (a977fb84, 3, 0, 4000, 417c, 1) + 2ac
 ff0fb4c8 void GangWorker::run() (9ea50, 8, 40, 0, 40, 0) + ac
 fee65600 _start   (9ea50, fea30c00, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)


The CMS thread and the remaining GC worker threads appear to be doing OK:

-----------------  lwp# 10 / thread# 10  --------------------
 ff31f170 ___lwp_cond_wait (9ab88, 9ab70, 0, 2f580, 4b42c8, ff00) + 8
 fed952cc int Monitor::wait(int,long) (9ab40, 0, 0, 4000, 417c, 21c070) + 104
 ff0fb260 void WorkGang::run_task(AbstractGangTask*) (9ab10, a977fb84, ff164208, 1, a977fb
a4, ff17e000) + 6c
 fef71dc0 void CMSCollector::do_remark_parallel() (f93b0, ff135c52, 1, 0, 0, ff1c9680) + d
8
 fef71594 void CMSCollector::checkpointRootsFinalWork(int,int,int) (f93b0, 1, 0, 0, 2a878,
 fede1c34) + 1a4
 fef73d40 void CMSCollector::doCMSOperation(CMSCollector::CMS_op_type) (ff1bd218, 1, 5000,
 50dc, 5000, 0) + 2d4
 fef740a8 int CMSCollector::stopWorldAndDo(CMSCollector::CMS_op_type) (f93b0, 1, ff135bf3,
 3fa709ab, adf6d161, 0) + 180
 fef6dd1c void CMSCollector::collect_in_background(int) (ff135bf3, fef6d828, ff1bd16c, ff1
bd00c, 417c, 0) + 4b4
 fef77e34 void ConcurrentMarkSweepThread::run() (3c00, 4c00, 5400, 55f0, 3c00, 3ffc) + 438

 fee65600 _start   (101990, fea31000, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 4 / thread# 4  --------------------
 ff384b6c lwp_yield (9c490, 0, 0, 0, 20, 8) + 8
 ff0d6ce0 int ParallelTaskTerminator::offer_termination() (a977fba4, 7, a0c84, fc57fe28, 1
, 0) + 40
 fef71940 void CMSParRemarkTask::work(int) (a977fb84, 7, 0, 4000, 417c, 0) + 2c4
 ff0fb4c8 void GangWorker::run() (9c490, 4, 40, 0, 40, 0) + ac
 fee65600 _start   (9c490, fea30400, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 5 / thread# 5  --------------------
 fefa86dc int instanceKlass::oop_oop_iterate_v(oopDesc*,OopClosure*) (e9fc9b30, a9fe8db8, 
fa37fe34, fa37f9a8, 0, 4) + 130
 ff0ce554 void ContiguousSpace::oop_iterate(OopClosure*) (a1010, fa37fe34, 5b66e30, 0, a9c
00000, aa4991d0) + 48
 ff0ce618 void ContiguousSpace::oop_iterate(MemRegion,OopClosure*) (a1010, fa37fc88, fa37f
e34, fa37fd50, 45b900, fed5a50c) + a0
 fef9c1a0 void GenerationOopIterateClosure::do_space(Space*) (fa37fd50, a1010, fa37fab8, 0
, 1, 0) + 2c
 fee611e0 void DefNewGeneration::space_iterate(SpaceClosure*,int) (9fdc8, fa37fd50, 0, 9fd
c8, 8, 0) + 14
 fef9b578 void Generation::oop_iterate(OopClosure*) (9fdc8, fa37fe34, 0, 0, 0, 0) + 4c
 fee25db4 void GenCollectedHeap::process_strong_roots(int,int,int,GenCollectedHeap::ClassS
canningOption,OopsInGenClosure*,OopsInGenClosure*) (ff17e000, fa37fe34, ff1d1c84, 1, 1, 0)
 + 280
 fef7186c void CMSParRemarkTask::work(int) (a977fb84, 1, 0, 4000, 417c, 1) + 1f0
 ff0fb4c8 void GangWorker::run() (9ce00, 5, 40, 0, 40, 0) + ac
 fee65600 _start   (9ce00, fea30600, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 7 / thread# 7  --------------------
 ff384b6c lwp_yield (9e0e0, 0, 0, 0, 20, 8) + 8
 ff0d6ce0 int ParallelTaskTerminator::offer_termination() (a977fba4, 2, a0c70, fa17fe28, 1
, 0) + 40
 fef71940 void CMSParRemarkTask::work(int) (a977fb84, 2, 0, 4000, 417c, 1) + 2c4
 ff0fb4c8 void GangWorker::run() (9e0e0, 7, 40, 0, 40, 0) + ac
 fee65600 _start   (9e0e0, fea30a00, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 9 / thread# 9  --------------------
 ff384b6c lwp_yield (9f3c0, 0, 0, 0, 20, 8) + 8
 ff0d6ce0 int ParallelTaskTerminator::offer_termination() (a977fba4, 0, a0c68, f9f7fe28, 1
, 0) + 40
 fef71940 void CMSParRemarkTask::work(int) (a977fb84, 0, 0, 4000, 417c, 1) + 2c4
 ff0fb4c8 void GangWorker::run() (9f3c0, 9, 40, 0, 40, 0) + ac
 fee65600 _start   (9f3c0, fea30e00, 0, 0, 0, 0) + 134
 ff384970 _lwp_start (0, 0, 0, 0, 0, 0)

                                    

Comments
SUGGESTED FIX

The following converts the previously-silent work queue overflow into
an abrupt error:

------- concurrentMarkSweepGeneration.cpp -------
3981,3982c3981,3986
<       _work_queue->push(thisOop);
<       trim_queue(CMSWorkQueueDrainThreshold * ParallelGCThreads);
---
>       if (_work_queue->push(thisOop)) {
>         trim_queue(CMSWorkQueueDrainThreshold * ParallelGCThreads);
>       } else { // need fix for 4615723
>         vm_exit_out_of_memory(sizeof(oop),
>           "CMS: Work queue overflow; try -XX:-CMSParallelRemarkEnabled");
>       }
4243c4247,4252
<     _work_queue->push(thisOop);
---
>     if (!_work_queue->push(thisOop)) { // need fix for 4615723
>       vm_exit_out_of_memory(sizeof(oop),
>         "CMS: Work queue overflow; try -XX:-CMSParallelRemarkEnabled");
>     }

The following makes the overflow less likely to occur (or at
least requires more adversarial structures and mutation for
the error to occur):

------- taskqueue.hpp -------
17c17
<     Log_n = 10
---
>     Log_n = 15     // until bug 4615723 is fixed (Tiger)

The real bug 4615723 is fixed correctly in Tiger. The above
is just a temporary palliative.

An unfortunate consequence of the latter is that the
"footprint" will increase (by 31K per processor). This,
i think, is a reasonable tradeoff for reducing the
probability of abrupt stoppage.
                                     
2004-07-28
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
1.4.2_05
generic

FIXED IN:
1.4.2_05

INTEGRATED IN:
1.4.2_05


                                     
2004-07-28
WORK AROUND

if our hunch is correct (see comments and suggested fix section) then:
  -XX:-CMSParallelRemarkEnabled
                                     
2004-07-28
EVALUATION

see comments section for the current status of the investigation.

###@###.### 2004-02-10: Adding keyword tiger-na
based on our analysis, and pending results from customer
testing (with a tiger libjvm.so).
                                     
2004-02-10



Hardware and Software, Engineered to Work Together