Bug ID: JDK-6262235 NMethodSweeper::sweep takes a long time (mostly in nmethod::cleanup_inline

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 1.4.2_05

Priority: P1
Status: Closed
Resolution: Fixed
OS: solaris_9
CPU: sparc

Submitted: 2005-04-27
Updated: 2010-05-11
Resolved: 2005-08-10

Other	JDK 6
1.4.2_09Fixed	6 b47Fixed

Java 1.4.2_05 with ParNewGC and CMS is "locking up" on us, in 64-bit mode with a 10gb heap...  We are not really locked, just busy, with no feedback or progress (although we do respond to kill -QUIT for threaed dumps).

It's possible we have a "serial" part of the safepoint process which doesn't scale particularly well.  Installing _07 was planned for testing since before _08 was out, although the VM code mentioned in the stacks below looks the same in _08 and 1.5.  

The heap is filling up, and CMS is blocked waiting for the VMThread, which is shown in the pstack below performing a safepoint.

CMS never gets to run in the time (half an hour?!) that the customer was able to wait.

-----------------  lwp# 20 / thread# 20  --------------------
 ffffffff7efa81bc lwp_cond_wait (10011afd8, 10011afc0, 0)
 ffffffff7efa0090 _lwp_cond_wait (10011afd8, 10011afc0, 5594c0, ffffffff7e9510a4, 0, 0) + c
 ffffffff7e654054 int Monitor::wait(int,long) (ffffffff7eb70590, 1, 0, 0, 9000, 91a8) + f4
 ffffffff7e7e3038 void ConcurrentMarkSweepThread::synchronize(int) (1, ffffffff7efa54d4, 0, 10011afc0, ffffffff7e7e0448, 0) + 110
 ffffffff7e7e0520 void MarkFromRootsClosure::do_yield_work() (fffffffcb26ffb88, 37, 1b7, 1, 100710ba0, 100710c48) + f8
 ffffffff7e7e06e8 void MarkFromRootsClosure::scanOopsInOop(HeapWord*) (fffffffcb26ffb88, fffffffe88856190, fffffffd4ec00000, 0, 0, fffffffcb26ff650) + 1a8
 ffffffff7e7a3af0 void BitMap::iterate(BitMapClosure*,unsigned long,unsigned long) (4e71580, fffffffcb26ffb88, 0, 3cf80000, 100710ba0, 100710c48) + 80
 ffffffff7e7dc718 void CMSCollector::markFromRootsWork(int) (100710970, 1, ffffffff7e9ac085, 10011afc0, ffffffff7e7dc52c, 0) + 140
 ffffffff7e7dc564 void CMSCollector::markFromRoots(int) (100710970, 1, 0, 10011afc0, ffffffff7e7da134, 0) + 64
 ffffffff7e7da23c void CMSCollector::collect_in_background(int) (100710d68, 0, 0, 10011afc0, ffffffff7e7e2c78, 0) + 20c
 ffffffff7e7e2c78 void ConcurrentMarkSweepThread::run() (100710dc0, 0, b800, bac0, 8c00, 8ea0) + 258
 ffffffff7e6cc4fc _start (100710dc0, 0, 0, 0, 0, 0) + ec
 ffffffff7f3181a8 _lwp_start (0, 0, 0, 0, 0, 0)

-----------------  lwp# 21 / thread# 21  --------------------
 ffffffff7e5c4d94 void RelocIterator::set_limits(unsigned char*,unsigned char*) (fffffffcafeff858, ffffffff3a73d060, ffffffff3a73d061, 0, 0, 0) + 70
 ffffffff7e5c4ce0 void RelocIterator::initialize(long,CodeBlob*,unsigned char*,unsigned char*) (fffffffcafeff858, 0, ffffffff3a73c750, ffffffff3a73d060, ffffffff3a73d061, 0) + 120
 ffffffff7e5e1100 RelocIterator virtual_call_Relocation::parse_ic(CodeBlob*&,unsigned char*&,unsigned char*&,oopDesc**&,int*) (fffffffcafeff960, fffffffcafeff9b8, fffffffcafeff9c8, fffffffcafeff9c0, 100a31f38, 100a31f98) + b8
 ffffffff7e5e536c CompiledIC::CompiledIC #Nvariant 1(Relocation*) (100a31f30, fffffffcafeffa88, ffffffff3a73c750, ffffffff3a73c8b4, 0, 0) + 34
 ffffffff7e625f84 void nmethod::cleanup_inline_caches() (ffffffff3a73c750, ffffffff3a73eec0, 0, 0, 9000, 91a8) + 1bc
 ffffffff7e691e54 void NMethodSweeper::sweep() (10011ecd0, 0, ffffffffffffffff, ffffffff7eb9b108, ffffffff7eb8b270, 0) + 1b4
 ffffffff7e692104 void SafepointSynchronize::begin() (10011ecd0, ffffffff7eb76224, b000, b088, b000, b388) + 224
 ffffffff7e74a428 void VMThread::loop() (7400, 9400, 9510, 9400, 9420, 8400) + 2c8
 ffffffff7e749c64 void VMThread::run() (10076f4f0, 40, 0, 0, 0, 0) + 94
 ffffffff7e6cc4fc _start (10076f4f0, 0, 0, 0, 0, 0) + ec
 ffffffff7f3181a8 _lwp_start (0, 0, 0, 0, 0, 0)



Analysed a gcore from a previous lockup, and cleanup_inline_caches was going through a for... loop, so it should complete, but takes a long, long time for some reason.

We can get Thread Dumps, so maybe the VMThread finishes the "sweep", does the thread dump and then does another sweep, not giving CMS a chance to finish it's mark?  Although this sounds unlikely as all calls look like the same NMethodSweeper::sweep call  (identical pstack line).

The last thing SafepointSynchronize::begin()  does is to call do_cleanup_tasks, which includes the call to  NMethodSweeper::sweep() .  So the sweep happens at every safepoint.  So it must have comleted fine before, but now takes "forever"...

sweep contains the for..loop which calls process_nmethod, which calls void nmethod::cleanup_inline_caches()  .

)$ grep cleanup_inline_caches psta*
pstack1_1.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3a73c750, ffffffff3a73eec0, 0, 0, 9000, 91a8) + 1bc
pstack1_2.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3b3d9210, ffffffff3b407880, 0, 0, 9000, 91a8) + 1bc
pstack1_3.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3b298b90, ffffffff3b29cd80, 0, 0, 9000, 91a8) + 1bc
pstack1_5.log: ffffffff7e625f64 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3a3151d0, ffffffff3a3156c0, 0, 0, 9000, 91a8) + 19c
pstack1_6.log: ffffffff7e625f84 __1cHnmethodVcleanup_inline_caches6M_v_ (ffffffff3b4c5450, ffffffff3b4cce00, 0, 0, 9000, 91a8) + 1bc


.. so that shows 5 different NMethods.  Sounds like we are iterating as designed, but it's taking so long that 6 sequential pstacks catch the same loop, and the application never appears to progress.  Certainly, the CMS would log GC information if it ever got to run again...


###@###.### 2005-04-27 15:33:06 GMT

SUGGESTED FIX http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/c2_baseline/2005/20050801180907.never.c2/workspace/webrevs/webrev-2005.08.01/index.html

02-08-2005

SUGGESTED FIX http://javaweb.sfbay/~never/webrev/6262235 ###@###.### 2005-04-29 22:35:00 GMT

27-04-2005

EVALUATION This appears to be a long standing bug with the maintenance of counters in the sweeper. See comments. ###@###.### 2005-04-29 22:35:00 GMT

27-04-2005