United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6793828 G1: invariant: queues are empty when activated
JDK-6793828 : G1: invariant: queues are empty when activated

Details
Type:
Bug
Submit Date:
2009-01-14
Status:
Resolved
Updated Date:
2013-06-19
Project Name:
JDK
Resolved Date:
2009-02-27
Component:
hotspot
OS:
generic
Sub-Component:
compiler
CPU:
generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
hs15
Fixed Versions:
hs15 (b02)

Related Reports
Backport:
Backport:
Backport:
Backport:
Relates:
Relates:

Sub Tasks

Description
During testing with Tony's GC Test Suite, jbb2005 fails quite often with the following:

# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/ptrQueue.hpp:107
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/java/east/u2/ap31282/hotspot-gc-g1-queues/src/share/vm/gc_implementation/g1/ptrQueue.hpp:107), pid=12147, tid=16
#  Error: assert(_index == _sz,"invariant: queues are empty when activated.")
#
# JRE version: 6.0_04-b12
# Java VM: OpenJDK Server VM (14.0-b09-internal-fastdebug mixed mode solaris-sparc )
# An error report file with more information is saved as:
# /java/east/u2/ap31282/gc_test_suite_sparc/specjbb2005/hs_err_pid12147.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
Current thread is 16

I've seen this most often on sparc/Solaris, but also on other architectures too (e.g., amd64).

                                    

Comments
EVALUATION

I found an example that's slightly different to what I reported above (I'm using the same format as in the previous note). The status of a particular thread before the Full GC:

  thread 0x01f40c00 | 0x02c53c78 2324 4096

after the Full GC:

  thread 0x01f40c00 | 0x02c53c78 4096 4096

and at the beginning of the following young GC:

  thread 0x01f40c00 | 0x02c53c78 4092 4096

So, the index is updated incorrectly but it's not one slot below what it was at the beginning of the Full GC. I think this can still be caused by the queue being manipilated unsafely across a safepoint. A thread checks whether SATB barriers are active and then it's interrupted for the Full GC safepoint. Then, it adds an entry to the queue, even though the queue is supposed to be inactive.
                                     
2009-01-21
EVALUATION

As expected, the bug is indeed due to a safepoint happening half-way through an SATB write barrier. Below is an e-mail I sent to Tom Rodriguez that explains the issue.

------

I added some extra verboseness and I can now get the PC of the suspect thread at the Full GC safepoint. I used workshop to print the assembly which I have included below. The PC at the safepoint is at address 0xfa95a654. I've added some comments at the assembly marked with >>> ... <<< and here's some commentary:

In the thread structure, the SATB queue is at offset 680, so the _active flag is at offset 688 and the _index field (i.e, which points to where the last entry written in the buffer is) is at offset 696.

_active and _index are read at 0xfa95a5e0 and 0xfa95a600 respectively.

The safepoint happens at 0xfa95a654.

Then, _active is checked at 0xfa95a688, after the safepoint. Also, _index is checked whether it's 0 (i.e., whether the buffer is full) at 0xfa95a6a4 and manipulated a few instructions below.

So, I think this proves that a safepoint does indeed happen half-way through the barrier and, when we actually use the values of _active and _index, they are in fact stale (as they might have got reset during the safepoint).

Would you mind looking at this further? I have the failed JVM in a debugger if you want to take a peek. Anything else I can do to help out at this point?

Actually, I just thought of one more thing just before I pressed "Send". In the failing example I described, _index is found to be 944 some time after the Full GC (i.e., what was before the Full GC, 948, minus 4). And I've seen this scenario happen quite often. However, with the second patch you sent me (the one that constrained the reading of _buf and  _index), even though the bug still happened, I only saw indexes of 4092. This tells me that _index was correctly constrained correctly (as it always picked up the value it was reset to during the Full GC), but we missed constraining _active. Maybe doing that would solve the issue? I hope this makes sense!

Regards,

Tony

...
0xfa95a5d4:     bne,pn   %icc,0xfa95ba48        ! 0xfa95ba48
0xfa95a5d8:     nop
0xfa95a5dc:     ldsh     [%l2 + 12], %g4
0xfa95a5e0:     ldsb     [%g2 + 688], %g1       ! >>> read _active into %g1 <<<
0xfa95a5e4:     ldsh     [%i4 + 8], %o0
0xfa95a5e8:     ld       [%g2 + 692], %g3
0xfa95a5ec:     st       %i3, [%sp + 104]
0xfa95a5f0:     ld       [%l4], %g0
0xfa95a5f4:     mov      %i0, %i3
0xfa95a5f8:     cmp      %o0, %g4
0xfa95a5fc:     be,pn    %icc,0xfa95a674        ! 0xfa95a674
0xfa95a600:     ld       [%g2 + 696], %l7       ! >>> read _index into %l7 <<<
0xfa95a604:     cmp      %i1, 0
0xfa95a608:     be,pn    %icc,0xfa95a75c        ! 0xfa95a75c
0xfa95a60c:     nop
0xfa95a610:     ld       [%sp + 100], %i4
0xfa95a614:     st       %i1, [%sp + 100]
0xfa95a618:     ld       [%sp + 104], %i3
0xfa95a61c:     ba,pt    %icc,0xfa95a400        ! 0xfa95a400
0xfa95a620:     nop
0xfa95a624:     ld       [%sp + 100], %g4
0xfa95a628:     ld       [%g4 + 28], %i1
0xfa95a62c:     cmp      %i1, 0
0xfa95a630:     be,pn    %icc,0xfa95a754        ! 0xfa95a754
0xfa95a634:     nop
0xfa95a638:     nop
0xfa95a63c:     nop
0xfa95a640:     ld       [%i1 + 24], %o0
0xfa95a644:     cmp      %g4, %o0
0xfa95a648:     bne,pn   %icc,0xfa95a464        ! 0xfa95a464
0xfa95a64c:     nop
0xfa95a650:     ld       [%i1 + 28], %o0
0xfa95a654:     ld       [%l4], %g0             ! >>> safepoint! <<<
0xfa95a658:     cmp      %o0, 0
0xfa95a65c:     be,pn    %icc,0xfa95a754        ! 0xfa95a754
0xfa95a660:     nop
0xfa95a664:     mov      %i1, %g4
0xfa95a668:     mov      %o0, %i1
0xfa95a66c:     ba,pt    %icc,0xfa95a640        ! 0xfa95a640
0xfa95a670:     nop
0xfa95a674:     add      %i5, 1, %i0
0xfa95a678:     ld       [%sp + 96], %l5
0xfa95a67c:     cmp      %i0, %l5
0xfa95a680:     bcc,pn   %icc,0xfa95b54c        ! 0xfa95b54c
0xfa95a684:     add      %l3, 16, %i1
0xfa95a688:     cmp      %g1, 0                 ! >>> if (_active) <<<
0xfa95a68c:     be,pn    %icc,0xfa95a6c0        ! 0xfa95a6c0
0xfa95a690:     nop
0xfa95a694:     ld       [%i1], %o0
0xfa95a698:     cmp      %o0, 0
0xfa95a69c:     be,pn    %icc,0xfa95a6c0        ! 0xfa95a6c0
0xfa95a6a0:     nop
0xfa95a6a4:     cmp      %l7, 0                 ! >>> if (_index == 0) <<<
0xfa95a6a8:     be,pn    %icc,0xfa95b2bc        ! 0xfa95b2bc
0xfa95a6ac:     nop
0xfa95a6b0:     add      %g3, %l7, %g1
0xfa95a6b4:     add      %l7, -4, %l3
0xfa95a6b8:     st       %o0, [%g1 - 4]
0xfa95a6bc:     st       %l3, [%g2 + 696]       ! >>> _index = _index - 4 <<<
0xfa95a6c0:     mov      %i1, %l3
0xfa95a6c4:     mov      %i4, %g1
0xfa95a6c8:     xor      %g1, %l3, %l7
0xfa95a6cc:     srl      %l7, 20, %g1
0xfa95a6d0:     cmp      %g1, 0

...
                                     
2009-02-03
EVALUATION

Tom recommended to try running with -XX:-SplitIfBlocks. This makes the crash happen less often but it does still happen.
                                     
2009-02-04
EVALUATION

The problem looks to be an SATB queue being manipulated unsafely across a safepoint.

Here's a concrete example (and I can reproduce similar ones reasonably easily). There's a concurrent marking going on and it is aborted due to a Full GC. Here's the status of one particular thread at the start of the Full GC:

  thread 0x02a61800 | 0x072fe6c8 2808 4096

This means: thread 0x02a61800 has an SATB buffer with address 0x072fe6c8 of size 4096 and its current index is 2808 (remember: indexes start from the top, i.e., 4096, and decrease towards 0 in word size steps, i.e. 4 in the case of the 32-bit JVM; so the indexes are really offsets in this case).

During the Full GC the SATB queues on all threads are reset (as the concurrent marking has just been aborted), and this is the status of the above thread at the end of the Full GC:

  thread 0x02a61800 | 0x072fe6c8 4096 4096

So, clearly, its associated queue has been reset and its index set to end (4096). Now, here's the status of the same thread at the beginning of the following young GC:

  thread 0x02a61800 | 0x072fe6c8 2804 4096

So, clearly, someone just overwrote the index with a value that is its pre-Full GC value minus one slot. And I can reproduce the same scenario. The most common overwriting value is 4092 (i.e., 4096 minus one slot), but some smaller values appear too; and it's always one slot less tha the pre-Full GC value. To me this is a smoking gun that a thread is adding an entry to that SATB queue, caches the old index, then (incorrectly) is stopped at a safepoint, then completes the operation after the safepoint using the old index.

A obvious culprit for this would be JITed code. But I'll see if I can find any places in the runtime where this can happe too.
                                     
2009-01-20
EVALUATION

The root of the problem is that SafePointNode don't produce a new raw memory state so using memory dependence to ensure proper code emission for the g1 write barrier doesn't work correctly all the time.  The load of raw memory can be moved above a prior safepoint.  For now the fix is to assign control to the all the loads involved in the write barrier.  Additionally there's some suspicious code in LoadNode::Ideal that strips useless control that I've added a check for raw memory too.  Currently it won't trigger for these loads because the ThreadLocal that forms the base is TypeRawPtr::BOTTOM but I think the proper type for ThreadLocal is TypeRawPtr::NOTNULL.  There are other nodes which use TypeRawPtr::NOTNULL that might be at risk.
                                     
2009-02-04
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/82a980778b92
                                     
2009-02-05
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot/hotspot/rev/82a980778b92
                                     
2009-02-12



Hardware and Software, Engineered to Work Together