Java 1.4.2_05 with ParNewGC and CMS is "locking up" on us, in 64-bit mode with a 10gb heap... We are not really locked, just busy, with no feedback or progress (although we do respond to kill -QUIT for threaed dumps). It's possible we have a "serial" part of the safepoint process which doesn't scale particularly well. Installing _07 was planned for testing since before _08 was out, although the VM code mentioned in the stacks below looks the same in _08 and 1.5. The heap is filling up, and CMS is blocked waiting for the VMThread, which is shown in the pstack below performing a safepoint. CMS never gets to run in the time (half an hour?!) that the customer was able to wait. ----------------- lwp# 20 / thread# 20 -------------------- ffffffff7efa81bc lwp_cond_wait (10011afd8, 10011afc0, 0) ffffffff7efa0090 _lwp_cond_wait (10011afd8, 10011afc0, 5594c0, ffffffff7e9510a4, 0, 0) + c ffffffff7e654054 int Monitor::wait(int,long) (ffffffff7eb70590, 1, 0, 0, 9000, 91a8) + f4 ffffffff7e7e3038 void ConcurrentMarkSweepThread::synchronize(int) (1, ffffffff7efa54d4, 0, 10011afc0, ffffffff7e7e0448, 0) + 110 ffffffff7e7e0520 void MarkFromRootsClosure::do_yield_work() (fffffffcb26ffb88, 37, 1b7, 1, 100710ba0, 100710c48) + f8 ffffffff7e7e06e8 void MarkFromRootsClosure::scanOopsInOop(HeapWord*) (fffffffcb26ffb88, fffffffe88856190, fffffffd4ec00000, 0, 0, fffffffcb26ff650) + 1a8 ffffffff7e7a3af0 void BitMap::iterate(BitMapClosure*,unsigned long,unsigned long) (4e71580, fffffffcb26ffb88, 0, 3cf80000, 100710ba0, 100710c48) + 80 ffffffff7e7dc718 void CMSCollector::markFromRootsWork(int) (100710970, 1, ffffffff7e9ac085, 10011afc0, ffffffff7e7dc52c, 0) + 140 ffffffff7e7dc564 void CMSCollector::markFromRoots(int) (100710970, 1, 0, 10011afc0, ffffffff7e7da134, 0) + 64 ffffffff7e7da23c void CMSCollector::collect_in_background(int) (100710d68, 0, 0, 10011afc0, ffffffff7e7e2c78, 0) + 20c ffffffff7e7e2c78 void ConcurrentMarkSweepThread::run() (100710dc0, 0, b800, bac0, 8c00, 8ea0) + 258 ffffffff7e6cc4fc _start (100710dc0, 0, 0, 0, 0, 0) + ec ffffffff7f3181a8 _lwp_start (0, 0, 0, 0, 0, 0) ----------------- lwp# 21 / thread# 21 -------------------- ffffffff7e5c4d94 void RelocIterator::set_limits(unsigned char*,unsigned char*) (fffffffcafeff858, ffffffff3a73d060, ffffffff3a73d061, 0, 0, 0) + 70 ffffffff7e5c4ce0 void RelocIterator::initialize(long,CodeBlob*,unsigned char*,unsigned char*) (fffffffcafeff858, 0, ffffffff3a73c750, ffffffff3a73d060, ffffffff3a73d061, 0) + 120 ffffffff7e5e1100 RelocIterator virtual_call_Relocation::parse_ic(CodeBlob*&,unsigned char*&,unsigned char*&,oopDesc**&,int*) (fffffffcafeff960, fffffffcafeff9b8, fffffffcafeff9c8, fffffffcafeff9c0, 100a31f38, 100a31f98) + b8 ffffffff7e5e536c CompiledIC::CompiledIC #Nvariant 1(Relocation*) (100a31f30, fffffffcafeffa88, ffffffff3a73c750, ffffffff3a73c8b4, 0, 0) + 34 ffffffff7e625f84 void nmethod::cleanup_inline_caches() (ffffffff3a73c750, ffffffff3a73eec0, 0, 0, 9000, 91a8) + 1bc ffffffff7e691e54 void NMethodSweeper::sweep() (10011ecd0, 0, ffffffffffffffff, ffffffff7eb9b108, ffffffff7eb8b270, 0) + 1b4 ffffffff7e692104 void SafepointSynchronize::begin() (10011ecd0, ffffffff7eb76224, b000, b088, b000, b388) + 224 ffffffff7e74a428 void VMThread::loop() (7400, 9400, 9510, 9400, 9420, 8400) + 2c8 ffffffff7e749c64 void VMThread::run() (10076f4f0, 40, 0, 0, 0, 0) + 94 ffffffff7e6cc4fc _start (10076f4f0, 0, 0, 0, 0, 0) + ec ffffffff7f3181a8 _lwp_start (0, 0, 0, 0, 0, 0) Analysed a gcore from a previous lockup, and cleanup_inline_caches was going through a for... loop, so it should complete, but takes a long, long time for some reason. We can get Thread Dumps, so maybe the VMThread finishes the "sweep", does the thread dump and then does another sweep, not giving CMS a chance to finish it's mark? Although this sounds unlikely as all calls look like the same NMethodSweeper::sweep call (identical pstack line). The last thing SafepointSynchronize::begin() does is to call do_cleanup_tasks, which includes the call to NMethodSweeper::sweep() . So the sweep happens at every safepoint. So it must have comleted fine before, but now takes "forever"... sweep contains the for..loop which calls process_nmethod, which calls void nmethod::cleanup_inline_caches() . )$ grep cleanup_inline_caches psta* pstack1_1.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3a73c750, ffffffff3a73eec0, 0, 0, 9000, 91a8) + 1bc pstack1_2.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3b3d9210, ffffffff3b407880, 0, 0, 9000, 91a8) + 1bc pstack1_3.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3b298b90, ffffffff3b29cd80, 0, 0, 9000, 91a8) + 1bc pstack1_5.log: ffffffff7e625f64 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3a3151d0, ffffffff3a3156c0, 0, 0, 9000, 91a8) + 19c pstack1_6.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3b4c5450, ffffffff3b4cce00, 0, 0, 9000, 91a8) + 1bc .. so that shows 5 different NMethods. Sounds like we are iterating as designed, but it's taking so long that 6 sequential pstacks catch the same loop, and the application never appears to progress. Certainly, the CMS would log GC information if it ever got to run again... ###@###.### 2005-04-27 15:33:06 GMT
|