During the testing of the PermGen removal changes, the developement engineers started to see assertion failures and crashes in G1's implementation of the BlockOffsetTable (BOT).
Investigation indicated that these crashes:
* were the result of incorrect and inconsistent values in the BOT's offset array,
* always seemed to be happening on one particular platform type (that is part of the testing and integration infrastructure),
* seemed to have increased in frequency since the inclusion of the changes for 6818524 into the PermGen removal code.
Specifically the assert being seen was:
# Internal Error (/tmp/jprt/P1/173804.cphillim/s/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp:552), pid=5210, tid=27
# assert(_array->offset_array(j) > 0 && _array->offset_array(j) <= (u_char) (N_words+BlockOffsetArray::N_powers-1)) failed: offset array should have been set
The assert was extended to print out the values being compared:
> --- a/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp
> +++ b/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp
> @@ -546,7 +546,10 @@
> assert(_array->offset_array(j) > 0 &&
> _array->offset_array(j) <=
> (u_char) (N_words+BlockOffsetArray::N_powers-1),
> - "offset array should have been set");
> + err_msg("offset array should have been set "
> + SIZE_FORMAT " not > 0 OR " SIZE_FORMAT " not <= "
> + SIZE_FORMAT, _array->offset_array(j),
> _array->offset_array(j),
> + (N_words+BlockOffsetArray::N_powers-1)));
> }
> #endif
> }
which yielded the following:
# Internal Error (/tmp/jprt/P1/173804.cphillim/s/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp:552), pid=5210, tid=27
# assert(_array->offset_array(j) > 0 && _array->offset_array(j) <= (u_char) (N_words+BlockOffsetArray::N_powers-1)) failed: offset array should have been set 65 not > 0 OR 65 not <= 77
So in the above: the value in the offset array was printed as 65 but it failed a comparison that checks it is strictly greater than zero and no more than 77. So how could this assertion be failing with a value of 65?
An investigation the G1 BOT code and a comparison against the BOT for the other collectors indicated that G1 (as a result of the increased size of old-gen PLABS) might be running into the same issue as 6948537. Namely concurrent readers of the G1 BOT (concurrent refine threads) were seeing spurious zeros in BOT entries. The error message was printing 65 but 65 should have passed the check.
I believe that the issue described in 6948537 matches the behavior being seen in the failing assert. I also believe that resize able PLABs increases the likelihood of hitting the problem.
Prior to the resize able PLABs, the size of PLABs for old regions was set to 1Kb (note, in G1, we only refine cards in old regions concurrently) - which is a span of 2 cards. With resize able PLABs, this can increase. When we allocate a PLAB, we record its start in the BOT. Now suppose we allocate an old-gen PLAB that spans 10 cards and we see updates to card 10 and card 2, which end up in two different update buffers. Now suppose that a CR thread gets one buffer and starts to process card 10, it will cause the BOT to be updated from the start of the PLAB to the object spanned by card 10. Now let's suppose another CR thread gets the other buffer and starts to process card 2 while the BOT is being updated (or vice-versa) The issue reported by Ramki may come into play.
By resizing the PLABs, we potentially have more BOT refinement going on.