JDK-7014261 : G1: RSet-related failures
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs20
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2011-01-24
  • Updated: 2013-09-18
  • Resolved: 2011-03-07
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u25Fixed 7Fixed hs20Fixed
Related Reports
Relates :  
Description
Since the push of 6977804: G1: remove the zero-filling thread we are seeing intermittent JPRT failures with GCBasher / G1, mainly with product builds. The failure usually complains about a double free, like this one:

*** glibc detected ***
> >>>>> /tmp/jprt/P2/T/111410.et151817/testproduct/linux_x64_2.4-product/bin/java:
> >>>>>
> >>>>> double free or corruption (!prev): 0x00007f59c8524ce0 ***

I also saw (once) a failure in fastdebug complaining about an apparent inconsistency in the RSets:

#  Internal Error (/tmp/jprt/P3/B/164311.ap31282/source/src/share/vm/gc_implementation/g1/sparsePRT.cpp:172),
  pid=20790, tid=1083717968
#  guarantee(_entries != NULL) failed: INV

Comments
EVALUATION http://hg.openjdk.java.net/hsx/hsx20/baseline/rev/4e66274b6bb3
09-02-2011

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/97ba643ea3ed
26-01-2011

SUGGESTED FIX The fix is to purge the expanded list from entries that correspond to regions that are being cleaned up (those will just be dealt with by the concurrent cleanup process). The way I chose to implement it is to actually null the expanded list at the beginning of cleanup and recreate it during cleanup, ignoring regions that were freed. Each cleanup thread creates a local expanded "sublist" (so that no locks / atomics are needed while creating those) and all the sublists are merged right at the end.
25-01-2011

EVALUATION We know what the race is: A heap region's RSet comprises several tables including a "sparse" table. Sparse tables have two RSHashTables: cur and next. Those two usually point to the same physical table. When we want to expand a sparse table we create a new next RSHashTable, which is larger than the old cur, and we copy the contents from cur into next. For a while the sparse table has two RSHashTables: next where new entries are added, cur which is used for iterations. (Note: when we add new entries to an RSet during a pause we generally have the make sure we scan those specially; so we only need to iterate over cur while scanning the RSet and we can safely ignore next.) Expanded sparse tables are added on a list (the "expanded list") so that we process them before we iterate over the RSets at the beginning of a pause. "Processing" them involves freeing the old cur and replacing it with next. The race is as follows: We reclaim several regions during cleanup that have expanded sparse tables and those tables are on the expanded list. Those regions are added on the cleanup list. Thread 1: the concurrent cleanup start processing the cleanup list and clears the RSet of every region on it, including its sparse table. Thread 2: the VM thread that's processing the expanded list; it frees up the old cur RSHashTable of each sparse table and replaces it with next. Given that the concurrent cleanup operation can now work through a pause, Threads 1 and 2 can now race and reach the same sparse table. This can result in the two failures we're seeing: - one deleting the cur entry first, while another trying to delete it and finding that it's already been deleted (that's the guarantee, the destructor is the only place where _entries is set to NULL) - both threads trying to delete the same entry, which explains the double-free. The race happens due to the increased concurrency that was introduced by 6977804. Before, the concurrent cleanup operation and a pause were mutually exclusive, which is why we never hit the issue.
24-01-2011