Bug ID: JDK-6843694 G1: assert(index < _vs.committed_size(),"bad index"), g1BlockOffsetTable.inline.hpp:55

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 7

Priority: P3
Status: Resolved
Resolution: Fixed
OS: solaris_10
CPU: sparc

Submitted: 2009-05-21
Updated: 2013-09-18
Resolved: 2009-06-30

JDK 6	JDK 7	Other
6u18Fixed	7Fixed	hs16Fixed

Spoke to TonyP about this overflow bug and he suggested making _region_ind an unsigned short in the 32 bit JVM and an unsigned int in the 64 bit JVM.

I tried this and ran into the following assert:

assert(e->num_valid_cards() > 0, "Precondition.")

in RSHashTable::add_entry. The root cause of this problem was making the type unsigned. By making the type unsigned the routing SparsePRTEntry::valid_entry() performed an unsigned compare _region_ind with 0. For an empty entry (i.e. one with _region_ind == NullEntry - which is -1) this routine returned true causing us to (incorrectly) attempt to add the empty entry to an expanded hash table.

Currently trying to use the existing signed short for the 32 bit JVM and a signed int for the 64 bit JVM. This allows us to index 32767 heap regions (i.e. enough for an unrealizable 32Gb heap) in the 32 bit JVM and 21474836457 heap regions (is there enough memory in the world) with the 64 bit JVM.
The assertion that trips is in the following routine:

inline HeapWord*
G1BlockOffsetSharedArray::address_for_index(size_t index) const {
  assert(index < _vs.committed_size(), "bad index");
  HeapWord* result = _reserved.start() + (index << LogN_words);
  assert(result >= _reserved.start() && result < _reserved.end(),
         "bad address from index");
  return result;
}

Attaching to the process with the debugger shows that the value of index coming in is huge.

From the debugger back trace, the caller of this routine is: ScanRSClosure::doHeapRegion:

  bool doHeapRegion(HeapRegion* r) {
    assert(r->in_collection_set(), "should only be called on elements of CS.");
    HeapRegionRemSet* hrrs = r->rem_set();
    if (hrrs->iter_is_complete()) return false; // All done.
    if (!_try_claimed && !hrrs->claim_iter()) return false;
    // If we didn't return above, then
    //   _try_claimed || r->claim_iter()
    // is true: either we're supposed to work on claimed-but-not-complete
    // regions, or we successfully claimed the region.
    HeapRegionRemSetIterator* iter = _g1h->rem_set_iterator(_worker_i);
    hrrs->init_iterator(iter);
    size_t card_index;
    size_t skip_distance = 0, current_card = 0, jump_to_card = 0;
    while (iter->has_next(card_index)) {
      if (current_card < jump_to_card) {
        ++current_card;
        continue;
      }
      HeapWord* card_start = _g1h->bot_shared()->address_for_index(card_index);
                                                 ^^^^^^^^^^^^^^^^^
Again from the debugger we can see that skip_distance, current_card, jump_to_card are all 0. From the above code the value for card_index is obtained from the has_next routine in the HeapRegionRemSetIterator class.

Printing the value of "iter" indicates that the _is field is "Sparse" so the value for card_index is obtained from the SparsePRTIter::has_next method.

Printing _sparse_iter showed that _card_index == 0, _bl_ind == 7, and _card_ind == 1.

Examining the contents of the sparse PRT hash table showed that:-
  _capacity == 16
  _capacity_mask == 15
  _occupied_entries == 1
  _occupied_cards == 3

In the hash table, _buckets[7] pointed to _entries[0]. When we examine the contents of _entries[0] we see that the value of _region_ind == -28697. This value is obviously incorrect (it should be {-1 | [0...36864)}. The number of regions spanned by the heap. If we convert this value to an unsigned value then we get something around 36834 (which is in the correct range).

The type of _region_ind in the SparsePRTEntry class is "short" (i.e. a signed 16 bit value) so the maximum positive value it can hold before overflowing is 2^15 - 1 (32767) which is less than the number of heap regions spanned by the heap (36864).
While running SPECjbb2005 on a batoka system with the following flags:

-d64 -stats -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -Xmx36g -Xms36g -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

I experienced the following crash:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/jc234399/ws/6819077c/hotspot/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.inline.hpp:55), pid=1768, tid=85
#  Error: assert(index < _vs.committed_size(),"bad index")
#
# JRE version: 7.0-b59
# Java VM: OpenJDK 64-Bit Server VM (16.0-b02-internal-jvmg mixed mode solaris-sparc )
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

during the first or second G1 evacuation pause.

With a product-build JVM the same test case fails at approximately the same point with a SIGSEGV.

SUGGESTED FIX The suggested fix is to widen the type of the region index field in the SparsePRT entry data structure.
16-06-2009
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/d44bdab1c03d
12-06-2009
PUBLIC COMMENTS Something to be careful about is that, as part of CR 6819085, we are going to have variable region sizes. This might mean that we can potentially have more than 32K regions in a 32-bit JVM. But, I don't think we want to (too many regions!). So, having signed short and int for 32-bit and 64-bit JVMs respectively is fine and I'll make sure that, when the region size is set in the 32-bit JVM, it should be set to a number that will not cause more than 32K regions to be created.
21-05-2009
EVALUATION See Description.
21-05-2009