JDK-8017070 : G1: assert(_card_counts[card_num] <= G1ConcRSHotCardLimit) failed
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs24,hs25
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2013-06-19
  • Updated: 2013-10-23
  • Resolved: 2013-07-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 Other
7u40Fixed 8Fixed hs24Fixed
#  Internal Error (C:\jprt\T\P1\132921.brutisso\s\src\share\vm\gc_implementation\g1\g1CardCounts.cpp:160), pid=8372, tid=16560
#  assert(_card_counts[card_num] <= G1ConcRSHotCardLimit) failed: Refinement count overflow? new count: 5

Stack: [0x0000000009a90000,0x0000000009b90000]
[error occurred during error reporting (printing stack bounds), id 0xe0000000]

[error occurred during error reporting (printing native stack), id 0xe0000000]

 garbage-first heap   total 197632K, used 90724K [0x00000000eea00000, 0x00000000fab00000, 0x00000000fae00000)
  region size 1024K, 71 young (72704K), 4 survivors (4096K)
 compacting perm gen  total 20480K, used 9813K [0x00000000fae00000, 0x00000000fc200000, 0x0000000100000000)
   the space 20480K,  47% used [0x00000000fae00000, 0x00000000fb795510, 0x00000000fb795600, 0x00000000fc200000)
No shared spaces configured.

ILW = HLL => P2
Nothing to verify: fix is to remove assertion.


Need SQE-OK prior to approval

The weblogic+medrec bigapps test exercises the modified code.

I've had a hard time reproducing the issue. On the SQE windows machine it didn't happen after a 10 hour run. Even then because of the slow network IO most of that time was spent starting the weblogic server and only 25 of the client tests ran. A 5 hour on an Intel Haswell based machine also didn't show the problem. I am pretty certain about my diagnosis though and the assert has got to go.

I think the assert is too strong - actually shouldn't be there at all. I _think_ a card has been enqueued twice into two different update buffers. Reads from the card table are not protected by barriers and are not atomic so a thread can see a stale value for a card. I then _think_ that the two buffers containing the card are being processed by two refinement threads at the same time. Each thread reads the refinement count for the card and both see 3. They both then increment the count: one increments it from 3 to 4 and passes the assert; the other increments it from 4 to 5 and fails the assert. I don't want to use atomic loads and stores for this code as they would slow it down and still won't fix the problem. A fix is to increment the local "count" and assign that value into the table but that would miss the fact that card has been enqueued twice and may delay the hotness of cards. The proper fix is to remove the assert. Added the following instrumentation: if ((count + 1) >= G1ConcRSHotCardLimit) { gclog_or_tty->print_cr("# Possible Refinement count overflow:- "); gclog_or_tty->print_cr("# card_ptr: "PTR_FORMAT, card_ptr); gclog_or_tty->print_cr("# card_num: "SIZE_FORMAT, card_num); gclog_or_tty->print_cr("# max card_num: "SIZE_FORMAT, _committed_max_card_num); gclog_or_tty->print_cr("# old count: "UINT32_FORMAT, count); gclog_or_tty->print_cr("# new count: "UINT32_FORMAT, _card_counts[card_num]); HeapWord* addr = _ct_bs->addr_for(card_ptr); HeapRegion* hr = _g1h->heap_region_containing(addr); gclog_or_tty->print_cr("# Region for card: "HR_FORMAT, HR_FORMAT_PARAMS(hr)); } and changed the assert to: assert(_card_counts[card_num] <= G1ConcRSHotCardLimit, err_msg("Refinement count overflow? " "old count: "UINT32_FORMAT, "new count: "UINT32_FORMAT, count, (uint) _card_counts[card_num])); This additional output should confirm the theory.

I think I know what's going on here. Just having a difficult time starting the weblogic server on the SQE machine with a jprt bundle.

Triaged. hs24.