JDK-7013718 : G1: small fixes for two assert/guarantee failures
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs20
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2011-01-20
  • Updated: 2013-09-18
  • Resolved: 2011-04-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u25Fixed 7Fixed hs20Fixed
Related Reports
Relates :  
Relates :  
Description
The changes for the zero filling thread removal (6977804) introduced two minor assert failures:

Internal Error at g1CollectorPolicy.cpp:1771, pid=28818, tid=12
guarantee(hr->is_young() && hr->age_in_surv_rate_group() != -1) failed: invariant

and

Internal Error (/tmp/jprt/P3/B/003718.ap31282/source/src/share/vm/gc_implementation/g1/heapRegionSeq.cpp:201), pid=6070, tid=139783754409744
assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length()) failed: _alloc_search_start: 0 should be valid

Comments
EVALUATION http://hg.openjdk.java.net/hsx/hsx20/baseline/rev/bd2e08334e84
09-02-2011

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/a672e43650cc
22-01-2011

SUGGESTED FIX For the assert failure: Only check the assert if find_contiguous() returns a value that's not -1 (i.e., it found an entry that satisfies the allocation request): --- a/src/share/vm/gc_implementation/g1/heapRegionSeq.cpp +++ b/src/share/vm/gc_implementation/g1/heapRegionSeq.cpp @@ -195,10 +195,10 @@ assert(0 <= res && res < _regions.length(), err_msg("res: %d should be valid", res)); _alloc_search_start = res + (int) num; + assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length(), + err_msg("_alloc_search_start: %d should be valid", + _alloc_search_start)); } - assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length(), - err_msg("_alloc_search_start: %d should be valid", - _alloc_search_start)); return res; }
20-01-2011

EVALUATION For the guarantee failure: First, I should point out that this happens _very_ infrequently. Also note that this failure is not directly caused by the new free list changes but it's uncovered by them (before it would never happen as a pause would always wait for the concurrent cleanup operation to complete; now it can carry on through the pause). The stack trace is this: current thread: t@12 [1] __read(0x0, 0xfffffd7ff35fe310, 0x10, 0x0, 0x1, 0x7fe0), at 0xfffffd7fff293eda [2] read(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff279f21 =>[3] os::message_box(title = ???, message = ???) (optimized), at 0xfffffd7fe8442a59 (line ~4161) in "os_solaris.cpp" [4] VMError::show_message_box(this = ???, buf = ???, buflen = ???) (optimized), at 0xfffffd7fe8a07d1e (line ~50) in "vmError_solaris.cpp" [5] VMError::report_and_die(this = ???) (optimized), at 0xfffffd7fe8a0748d (line ~786) in "vmError.cpp" [6] report_vm_error(file = ???, line = ???, error_msg = ???, detail_msg = ???) (optimized), at 0xfffffd7fe73226d7 (line ~216) in "debug.cpp" [7] G1CollectorPolicy::predict_survivor_regions_evac_time(this = ???) (optimized), at 0xfffffd7fe750fddc (line ~1771) in "g1CollectorPolicy.cpp" [8] G1CollectorPolicy::calculate_young_list_target_length(this = ???, rs_lengths = ???) (optimized), at 0xfffffd7fe750dd32 (line ~499) in "g1CollectorPolicy.cpp" [9] G1CollectorPolicy::record_concurrent_mark_cleanup_completed(this = ???) (optimized), at 0xfffffd7fe7512664 (line ~470) in "g1CollectorPolicy.cpp" [10] ConcurrentMarkThread::run(this = ???) (optimized), at 0xfffffd7fe729f6d1 (line ~254) in "concurrentMarkThread.cpp" [11] java_start(thread_addr = ???) (optimized), at 0xfffffd7fe843349e (line ~1067) in "os_solaris.cpp" [12] _thrp_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff28acf5 [13] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff28afb0 So, at the end of concurrent cleanup we call record_concurrent_mark_cleanup_completed() which updates some of the prediction fields. However, given that the concurrent mark thread does not join the Suspendible Thread Set before it does that, that operation could happen during a pause. This is the log fragment that hints that this is what might be happening: [GC concurrent-cleanup-start] [GC pause (young) 257M->257M(2032M), 0.0652511 secs] [GC pause (partial)# To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/g1CollectorPolicy.cpp:1771 ============================================================================== Unexpected Error ------------------------------------------------------------------------------ Internal Error at g1CollectorPolicy.cpp:1771, pid=28818, tid=12 guarantee(hr->is_young() && hr->age_in_surv_rate_group() != -1) failed: invariant Do you want to debug the problem? To debug, run 'dbx - 28818'; then switch to thread 12 Enter 'yes' to launch dbx automatically (PATH must include dbx) Otherwise, press RETURN to abort... ============================================================================== 259M->254M(2032M), 0.1267242 secs]
20-01-2011

SUGGESTED FIX For the guarantee failure: Make sure the marking thread joins the SuspendibleThreadSet before it calls record_concurrent_mark_cleanup_completed(): --- a/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp +++ b/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp @@ -251,7 +251,9 @@ // Now do the remainder of the cleanup operation. _cm->completeCleanup(); + _sts.join(); g1_policy->record_concurrent_mark_cleanup_completed(); + _sts.leave(); double cleanup_end_sec = os::elapsedTime(); if (PrintGC) {
20-01-2011

EVALUATION For the assert failure: The assert in question: assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length(), err_msg("_alloc_search_start: %d should be valid", _alloc_search_start)); is too strong. If _alloc_search_start is 0 when we first call the find_contiguous() method and we fail to find a series of regions that satisfies the allocation request, then it will still be 0 at the end of the method where the assert is (in fact, this is what's happening; the error message prints out its value: "_alloc_search_start: 0"). The assert only holds if find_contiguous() returns a valid reesult and _alloc_search_start += num (so that it cannot be 0).
20-01-2011