JDK-7192128 : G1: Extend fix for 6948537 to G1's BOT
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 7u6
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2012-08-16
  • Updated: 2013-09-18
  • Resolved: 2012-08-27
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 Other
7u40Resolved 8Fixed hs24Fixed
Related Reports
Relates :  
Description
During the testing of the PermGen removal changes, the developement engineers started to see assertion failures and crashes in G1's implementation of the BlockOffsetTable (BOT).

Investigation indicated that these crashes:

* were the result of incorrect and inconsistent values in the BOT's offset array, 
* always seemed to be happening on one particular platform type (that is part of the testing and integration infrastructure),
* seemed to have increased in frequency since the inclusion of the changes for 6818524 into the PermGen removal code.

Specifically the assert being seen was:

#  Internal Error (/tmp/jprt/P1/173804.cphillim/s/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp:552), pid=5210, tid=27
#  assert(_array->offset_array(j) > 0 && _array->offset_array(j) <= (u_char) (N_words+BlockOffsetArray::N_powers-1)) failed: offset array should have been set

The assert was extended to print out the values being compared:

> --- a/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp
> +++ b/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp
> @@ -546,7 +546,10 @@
>      assert(_array->offset_array(j) > 0 &&
>             _array->offset_array(j) <=
>               (u_char) (N_words+BlockOffsetArray::N_powers-1),
> -           "offset array should have been set");
> +           err_msg("offset array should have been set "
> +           SIZE_FORMAT " not > 0 OR " SIZE_FORMAT " not <= "
> +           SIZE_FORMAT, _array->offset_array(j), 
> _array->offset_array(j),
> +           (N_words+BlockOffsetArray::N_powers-1)));
>    }
>  #endif
>  }

which yielded the following:

#  Internal Error (/tmp/jprt/P1/173804.cphillim/s/src/share/vm/gc_implementation/g1/g1BlockOffsetTable.cpp:552), pid=5210, tid=27
#  assert(_array->offset_array(j) > 0 && _array->offset_array(j) <= (u_char) (N_words+BlockOffsetArray::N_powers-1)) failed: offset array should have been set 65 not > 0 OR 65 not <= 77

So in the above: the value in the offset array was printed as 65 but it failed a comparison that checks it is strictly greater than zero  and no more than 77. So how could this assertion be failing with a value of 65?

An investigation the G1 BOT code and a comparison against the BOT for the other collectors indicated that G1 (as a result of the increased size of old-gen PLABS) might be running into the same issue as 6948537. Namely concurrent readers of the G1 BOT (concurrent refine threads) were seeing spurious zeros in BOT entries. The error message was printing 65 but 65 should have passed the check.

I believe that the issue described in 6948537 matches the behavior being seen in the failing assert. I also believe that resize able PLABs increases the likelihood of hitting the problem.

Prior to the resize able PLABs, the size of PLABs for old regions was set to 1Kb (note, in G1, we only refine cards in old regions concurrently) - which is a span of 2 cards. With resize able PLABs, this can increase. When we allocate a PLAB, we record its start in the BOT.  Now suppose we allocate an old-gen PLAB that spans 10 cards and we see updates to card 10 and card 2, which end up in two different update buffers. Now suppose that a CR thread gets one buffer and starts to process card 10, it will cause the BOT to be updated from the start of the PLAB to the object spanned by card 10. Now let's suppose another CR thread gets the other buffer and starts to process card 2 while the BOT is being updated (or vice-versa) The issue reported by Ramki may come into play.

By resizing the PLABs, we potentially have more BOT refinement going on.

Comments
GCBasher test stood for 2 hours on sparc machine. Marking bug as verified. bash-3.00$ ./jdk/fastdebug/bin/java -version java version "1.7.0_40-ea-fastdebug" Java(TM) SE Runtime Environment (build 1.7.0_40-ea-fastdebug-b26) Java HotSpot(TM) Server VM (build 24.0-b45-fastdebug, mixed mode) bash-3.00$ /usr/bin/uname -a SunOS jtg-t2000-7 5.10 Generic_142909-17 sun4v sparc SUNW,Sun-Fire-T200 bash-3.00$ time ./jdk/fastdebug/bin/java -XX:+UseG1GC -XX:OldPLABSize=3000 -jar ./GCBasher.jar -time:7200000 PASS: 24439317 iterations, runTime: 7200000ms real 120m3.492s user 92m40.733s sys 29m22.599s
21-06-2013

EVALUATION http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/f99a36499b8c
21-08-2012

SUGGESTED FIX Propagate fix for 6948537 to G1's BOT.
16-08-2012

EVALUATION Please see 6948537.
16-08-2012