United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-8017070 : G1: assert(_card_counts[card_num] <= G1ConcRSHotCardLimit) failed

Details
Type:
Bug
Submit Date:
2013-06-19
Status:
Closed
Updated Date:
2013-10-23
Project Name:
JDK
Resolved Date:
2013-07-01
Component:
hotspot
OS:
Sub-Component:
gc
CPU:
Priority:
P2
Resolution:
Fixed
Affected Versions:
hs24,hs25
Fixed Versions:
hs24,hs25 (team)

Related Reports
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:

Sub Tasks

Description
#  Internal Error (C:\jprt\T\P1\132921.brutisso\s\src\share\vm\gc_implementation\g1\g1CardCounts.cpp:160), pid=8372, tid=16560
#  assert(_card_counts[card_num] <= G1ConcRSHotCardLimit) failed: Refinement count overflow? new count: 5

Stack: [0x0000000009a90000,0x0000000009b90000]
[error occurred during error reporting (printing stack bounds), id 0xe0000000]

[error occurred during error reporting (printing native stack), id 0xe0000000]

Heap
 garbage-first heap   total 197632K, used 90724K [0x00000000eea00000, 0x00000000fab00000, 0x00000000fae00000)
  region size 1024K, 71 young (72704K), 4 survivors (4096K)
 compacting perm gen  total 20480K, used 9813K [0x00000000fae00000, 0x00000000fc200000, 0x0000000100000000)
   the space 20480K,  47% used [0x00000000fae00000, 0x00000000fb795510, 0x00000000fb795600, 0x00000000fc200000)
No shared spaces configured.

ILW = HLL => P2
                                    

Comments
Nothing to verify: fix is to remove assertion.
                                     
2013-07-09
SQE OK
                                     
2013-07-03
Need SQE-OK prior to approval
                                     
2013-07-02
URL:   http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/5ea20b3bd249
User:  johnc
Date:  2013-07-01 18:43:44 +0000

                                     
2013-07-01
I've had a hard time reproducing the issue. On the SQE windows machine it didn't happen after a 10 hour run. Even then because of the slow network IO most of that time was spent starting the weblogic server and only 25 of the client tests ran. A 5 hour on an Intel Haswell based machine also didn't show the problem.

I am pretty certain about my diagnosis though and the assert has got to go.
                                     
2013-06-28
The weblogic+medrec bigapps test exercises the modified code.
                                     
2013-06-28
I think the assert is too strong - actually shouldn't be there at all.

I _think_ a card has been enqueued twice into two different update buffers. Reads from the card table are not protected by barriers and are not atomic so a thread can see a stale value for a card. I then _think_ that the two buffers containing the card are being processed by two refinement threads at the same time. Each thread reads the refinement count for the card and both see 3. They both then increment the count: one increments it from 3 to 4 and passes the assert; the other increments it from 4 to 5 and fails the assert.

I don't want to use atomic loads and stores for this code as they would slow it down and still won't fix the problem. A fix is to increment the local "count" and assign that value into the table but that would miss the fact that card has been enqueued twice and may delay the hotness of cards. The proper fix is to remove the assert.

Added the following instrumentation:



      if ((count + 1) >= G1ConcRSHotCardLimit) {
        gclog_or_tty->print_cr("# Possible Refinement count overflow:- ");
        gclog_or_tty->print_cr("#  card_ptr: "PTR_FORMAT, card_ptr);
        gclog_or_tty->print_cr("#  card_num: "SIZE_FORMAT, card_num);
        gclog_or_tty->print_cr("#  max card_num: "SIZE_FORMAT, _committed_max_card_num);
        gclog_or_tty->print_cr("#  old count: "UINT32_FORMAT, count);
        gclog_or_tty->print_cr("#  new count: "UINT32_FORMAT, _card_counts[card_num]);

        HeapWord* addr = _ct_bs->addr_for(card_ptr);
        HeapRegion* hr = _g1h->heap_region_containing(addr);
        gclog_or_tty->print_cr("#  Region for card: "HR_FORMAT, HR_FORMAT_PARAMS(hr));
      }

and changed the assert to:

      assert(_card_counts[card_num] <= G1ConcRSHotCardLimit,
             err_msg("Refinement count overflow? "
                     "old count: "UINT32_FORMAT,
                     "new count: "UINT32_FORMAT,
                     count,
                     (uint) _card_counts[card_num]));

This additional output should confirm the theory. 
                                     
2013-06-26
I think I know what's going on here. Just having a difficult time starting the weblogic server on the SQE machine with a jprt bundle.
                                     
2013-06-21
Triaged. hs24.
                                     
2013-06-20



Hardware and Software, Engineered to Work Together