JDK-8040804 : G1: Concurrent mark stuck in loop calling os::elapsedVTime()
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8u20,9
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2014-04-17
  • Updated: 2017-07-26
  • Resolved: 2014-05-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8 JDK 9
8u20Fixed 9 b15Fixed
Related Reports
Blocks :  
Blocks :  
Relates :  
Relates :  
Description
Concurrent mark threads use os::elapsedVTime() to regulate its CPU usage over time. However, it seems that these threads can get stuck in a loop calling os::elapsedVTime() over and over without making any progress, which causes the whole marking concurrent phase to hang.

Comments
Regression test exists.
26-07-2017

Some information from the core file for the b4700099 run on Linux i586. All worker threads are "stuck" here: #0 0xf772c430 in __kernel_vsyscall () #1 0x41ff1d76 in getrusage () from /lib/libc.so.6 #2 0xf6d9fd70 in os::elapsedVTime () at /opt/jprt/T/P1/185537.stefank/s/src/os/linux/vm/os_linux.cpp:1329 #3 0xf67881dd in regular_clock_call (this=0xbed72bd8, time_target_ms=10, do_termination=true, is_serial=false) at /opt/jprt/T/P1/185537.stefank/s/src/share/vm/gc_implementation/g1/concurrentMark.cpp:3577 #4 CMTask::do_marking_step (this=0xbed72bd8, time_target_ms=10, do_termination=true, is_serial=false) at /opt/jprt/T/P1/185537.stefank/s/src/share/vm/gc_implementation/g1/concurrentMark.cpp:4247 #5 0xf67904b7 in CMConcurrentMarkingTask::work(unsigned int) () from /tmp/pliden/linux_i586_2.6-fastdebug/jre/lib/i386/server/libjvm.so #6 0xf701f88b in GangWorker::loop (this=0xe3fdf400) at /opt/jprt/T/P1/185537.stefank/s/src/share/vm/utilities/workgroup.cpp:327 #7 0xf701e0d8 in GangWorker::run (this=0xe3fdf400) at /opt/jprt/T/P1/185537.stefank/s/src/share/vm/utilities/workgroup.cpp:240 #8 0xf6d9b569 in java_start (thread=0xe3fdf400) at /opt/jprt/T/P1/185537.stefank/s/src/os/linux/vm/os_linux.cpp:825 #9 0x420e9a49 in start_thread () from /lib/libpthread.so.0 #10 0x41ffae1e in clone () from /lib/libc.so.6 I suspect that they aren't really stuck inside getrusage(), but rather stuck looping in the outer loop and getrusage() just happens that the most time consuming operation which would mean it's highly likely that we see the threads in there at any given time. The outer loop looks like this: concurrentMark.cpp: 4218 while (!has_aborted() && _curr_region == NULL && !_cm->out_of_regions()) { 4219 // We are going to try to claim a new region. We should have 4220 // given up on the previous one. 4221 // Separated the asserts so that we know which one fires. 4222 assert(_curr_region == NULL, "invariant"); 4223 assert(_finger == NULL, "invariant"); 4224 assert(_region_limit == NULL, "invariant"); 4225 if (_cm->verbose_low()) { 4226 gclog_or_tty->print_cr("[%u] trying to claim a new region", _worker_id); 4227 } 4228 HeapRegion* claimed_region = _cm->claim_region(_worker_id); 4229 if (claimed_region != NULL) { 4230 // Yes, we managed to claim one 4231 statsOnly( ++_regions_claimed ); 4232 4233 if (_cm->verbose_low()) { 4234 gclog_or_tty->print_cr("[%u] we successfully claimed " 4235 "region "PTR_FORMAT, 4236 _worker_id, claimed_region); 4237 } 4238 4239 setup_for_region(claimed_region); 4240 assert(_curr_region == claimed_region, "invariant"); 4241 } 4242 // It is important to call the regular clock here. It might take 4243 // a while to claim a region if, for example, we hit a large 4244 // block of empty regions. So we need to call the regular clock 4245 // method once round the loop to make sure it's called 4246 // frequently enough. 4247 regular_clock_call(); 4248 } (gdb) p _has_aborted $29 = false (gdb) p _curr_region $30 = (HeapRegion *) 0x0 _cm->out_of_regions() looks like this: bool out_of_regions() { return _finger == _heap_end; } (gdb) p _cm->_finger $31 = (HeapWord * volatile) 0xd0600000 (gdb) p _cm->_heap_end $32 = (HeapWord *) 0xd0500000 which would mean that out_of_regions() returns false, but the _finger is past the _heap_end. This looks like a potential bug! Some of the other task local state in this loop (note that this _finger is the local finger and not the _cm->_finger above): (gdb) p _finger $34 = (HeapWord *) 0x0 (gdb) p _region_limit $35 = (HeapWord *) 0x0 (gdb) p claimed_region $33 = (HeapRegion *) 0x0 When the _cm->_finger is past the _cm->_heap_end, then _cm->claim_region() will return NULL, which means that this loop will basically end up doing nothing except calling regular_clock_call() over and over. So, all conditions for the loop are true and nothing inside the loop will make any progress. At some point the regular_clock_call() will detect that we have spent our time quota (_time_target = 9.03ms) and abort the task. However, since _cm->_has_aborted is still false and _cm->_sleep_factor is 0 we will eventually end up right back in the loop starting at concurrentMark.ccp:4218 again and spend another 9.03ms calling os:elapsedVTime() -> getrusage() over and over. So, it seems that the main bug here is that _cm->_finger have gone past _cm->_heap_end and that _cm->out_of_regions() only check if the finger is equal to the heap_end (and not greater than or equal to the heap_end). It's a bit unclear for me if this is an ok to advance the finger past the heap end, and even if that's ok I can't see how that could happen. The heap is only able to shrink at a Full GC, and since _cm->_has_aborted = false we know that has not happened. The code advancing the _cm->_finger looks sane, but I might be missing something there. For completeness, here's all task's states: (gdb) p *this $36 = {<TerminatorTerminator> = {<CHeapObj<1792u>> = {<AllocatedObj> = {_vptr.AllocatedObj = 0xf70a6488}, <No data fields>}, <No data fields>}, _worker_id = 5, _g1h = 0xf60140c0, _cm = 0xe3fd5ff8, _nextMarkBitMap = 0xe3fd6054, _task_queue = 0xbed62310, _task_queues = 0xe3fd6500, _claimed = true, _calls = 180539, _time_target_ms = 9.0334619913384859, _start_time_ms = 1793564.3359999999, _cm_oop_closure = 0xbf0feee4, _curr_region = 0x0, _finger = 0x0, _region_limit = 0x0, _words_scanned = 0, _words_scanned_limit = 12288, _real_words_scanned_limit = 12288, _refs_reached = 0, _refs_reached_limit = 384, _real_refs_reached_limit = 384, _hash_seed = 1362830082, _has_aborted = false, _has_timed_out = false, _draining_satb_buffers = false, _step_times_ms = {<AbsSeq> = {<CHeapObj<1792u>> = {<AllocatedObj> = { _vptr.AllocatedObj = 0xf721a108}, <No data fields>}, _num = 180540, _sum = 1793554.3369999996, _sum_of_squares = 17828980.480897442, _davg = 9.9466015867034407, _dvariance = 0.034413660330439334, _alpha = 0.69999999999999996}, _last = 9.9979999999050051, _maximum = 10.99800000002142}, _elapsed_time_ms = 289.00764000000004, _termination_time_ms = 0, _termination_start_time_ms = 0, _concurrent = true, _marking_step_diffs_ms = {<AbsSeq> = {<CHeapObj<1792u>> = {<AllocatedObj> = { _vptr.AllocatedObj = 0xf721a0c8}, <No data fields>}, _num = 10, _sum = 8.4461990126265061, _sum_of_squares = 7.9392560142806934, _davg = 0.8796130682819312, _dvariance = 0.030223781039975618, _alpha = 0.69999999999999996}, _sequence = 0xbed72d08, _length = 10, _next = 8}, _marked_bytes_array = 0xbed723a8, _card_bm = 0xe3fe0bd0} Locals in frame #4: (gdb) info locals claimed_region = 0x0 g1_policy = <value optimized out> do_stealing = true diff_prediction_ms = 0.8796130682819312 bitmap_closure = {<BitMapClosure> = {_vptr.BitMapClosure = 0xf70a6500}, _nextMarkBitMap = 0xe3fd6054, _cm = 0xe3fd5ff8, _task = 0xbed72bd8} cm_oop_closure = {<ExtendedOopClosure> = {<OopClosure> = {<Closure> = {<StackObj> = {<AllocatedObj> = { _vptr.AllocatedObj = 0xf70a6548}, <No data fields>}, <No data fields>}, <No data fields>}, _ref_processor = 0xbe76f930}, _g1h = 0xf60140c0, _cm = 0xe3fd5ff8, _task = 0xbed72bd8} elapsed_time_ms = 0
21-04-2014

The support for os::elapsedVTime() was added in JDK-8013895. It's unclear if os::elapsedVTime() is the real problem here. It seems very strange that we should get stuck there. There might be a problem with the outer loop or something which results in us calling os::elapsedVTime() a lot and hence shows up in the stack traces.
17-04-2014

I=Hanging concurrent mark and no mixed GCs -> H L=Unclear, but since this has gone unnoticed for a long time -> L W = None known -> H ILW=HLH=P2
17-04-2014