United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-8017065 : C2 allows safepoint checks to leak into G1 pre-barriers

Submit Date:
Updated Date:
Project Name:
Resolved Date:
Affected Versions:
Fixed Versions:
hs25 (b58)

Related Reports

Sub Tasks

#  Internal Error (/tmp/jprt/P1/165343.amurillo/s/src/share/vm/gc_implementation/g1/ptrQueue.hpp:118), pid=6473, tid=10
#  assert(_index == _sz) failed: invariant: queues are empty when activated.

Current thread (0x00287400):  VMThread [stack: 0xb6f70000,0xb6ff0000] [id=10]

Stack: [0xb6f70000,0xb6ff0000],  sp=0xb6fef098,  free space=508k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x102b754]  void VMError::report_and_die()+0x85c;;  __1cHVMErrorOreport_and_die6M_v_+0x85c
V  [libjvm.so+0x660aac]  void report_vm_error(const char*,int,const char*,const char*)+0x74;;  __1cPreport_vm_error6Fpkci11_v_+0x74
V  [libjvm.so+0xe6b454]  void SATBMarkQueueSet::set_active_all_threads(bool,bool)+0x11c;;  __1cQSATBMarkQdDueueSetWset_active_all_threads6Mbb_v_+0x11c
V  [libjvm.so+0x5ff80c]  void ConcurrentMark::checkpointRootsInitialPost()+0x68;;  __1cOConcurrentMarkbAcheckpointRootsInitialPost6M_v_+0x68
V  [libjvm.so+0x73ca2c]  bool G1CollectedHeap::do_collection_pause_at_safepoint(double)+0x81c;;  __1cPG1CollectedHeapbGdo_collection_pause_at_safepoint6Md_b_+0x81c
V  [libjvm.so+0x105d2dc]  void VM_G1IncCollectionPause::doit()+0x290;;  __1cXVM_G1IncCollectionPauseEdoit6M_v_+0x290
V  [libjvm.so+0x1059c98]  void VM_Operation::evaluate()+0xe0;;  __1cMVM_OperationIevaluate6M_v_+0xe0
V  [libjvm.so+0x1057448]  void VMThread::evaluate_operation(VM_Operation*)+0x1ec;;  __1cIVMThreadSevaluate_operation6MpnMVM_Operation__v_+0x1ec
V  [libjvm.so+0x1057bc4]  void VMThread::loop()+0x4c4;;  __1cIVMThreadEloop6M_v_+0x4c4
V  [libjvm.so+0x1056eb4]  void VMThread::run()+0x10c;;  __1cIVMThreadDrun6M_v_+0x10c
V  [libjvm.so+0xd438b4]  java_start+0x174;;  java_start+0x174

VM_Operation (0xacd0db10): G1IncCollectionPause, mode: safepoint, requested by thread 0x05645000

 garbage-first heap   total 186368K, used 107135K [0xb8600000, 0xc3c00000, 0xf4600000)
  region size 1024K, 3 young (3072K), 3 survivors (3072K)
 Metaspace total 95845K, used 92198K, reserved 106800K
  data space     84595K, used 82108K, reserved 94512K
  class space    11250K, used 10089K, reserved 12288K

ILW = HLH => P2

Hard to design a regression test, however the fix includes self-verification of the G1 pre-barrier consistency at compilation of each method.
Tentative fix to respect control dependencies for raw loads. Also added G1 pre-barrier consistency verification. Passes CTW.
Per says:

hsdb> findpc 0xf54bcfd8                                                       
Address 0xf54bcfd8: In code in NMethod for weblogic/servlet/internal/RequestDispatcherImpl.forward(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V content: [0xf54bb820, 0xf54c0310),  code: [0xf54bb820, 0xf54c0310),  data: [0xf54c0310, 0xf54c9ee0),  oops: [0xf54c0310, 0xf54c037c),  frame size: 160

8-defer-request: The SQE team was unable to reproduce the original failure with an adhoc UTE bigapps run.  While it's possible the assert can still occur, the lack of a reproducible test case makes it harder to test and fix this issue.  The issue is still there but the code path has changed sufficiently enough that whatever was causing the trigger previously no longer happens.  We'd like to defer this bug to track the issue.
We are not able to reproduce it on the newest fastdebug build as well as on product build.
Need SQE to confirm that this still happens.
I don't think telling a customer not to use G1 can be considered a low impact workaround. I.e. shouldn't be lowered to a P4.
I never managed to reproduce the actual assert, however that doesn't mean the incorrect code was still generated, i.e. we could have just been lucky enough to never hit a safepoint at that exact place. All information above was dug out from the core file from the nightly test.

To run the test use UTE with bigapps/Weblogic+medrec
What/how do I run to reproduce the problem?
Moving bug from GC to compiler.
Found the problem!

The C2 compiler can still allow safepoint checks to leak into the G1 pre-barrier. It seems the fix for JDK-6793828 didn't catch all cases.
Dumped the code cache from the core file and found the following code:

0xf54bcfd8:     ldsb     [%g2 + 728], %l7                            <<<<<<<<<<<<<<<<<<<<<< Read JavaThread->_satb_mark_queue._active
0xf54bcfdc:     brz,a,pn  %g1, 0xf54c44e8       ! 0xf54c44e8
0xf54bcfe0:     mov      1, %g3
0xf54bcfe4:     sethi    %hi(0xfd478000), %g1
0xf54bcfe8:     ld       [%g1 + 0x00000000], %g0        ! 0xfd478000  <<<<<<<<<<<<<<< Safepoint check!
0xf54bcfec:     ld       [%o1 + 32], %i5
0xf54bcff0:     brz,a,pn  %g3, 0xf54c6270       ! 0xf54c6270
0xf54bcff4:     cmp      %i0, 0
0xf54bcff8:     be,pn    %icc,0xf54bdb6c        ! 0xf54bdb6c
0xf54bcffc:     nop      
0xf54bd000:     cmp      %i3, %g0
0xf54bd004:     be,pn    %icc,0xf54bdb5c        ! 0xf54bdb5c
0xf54bd008:     nop      
0xf54bd00c:     brz,pn   %l7, 0xf54c550c        ! 0xf54c550c <<<<<<<<<<<<<<<<<<<< if (JavaThread->_satb_mark_queue._active == false) goto 0xf54c550c
0xf54bd010:     ld       [%l2], %o0                                       <<<<<<<<<<<<<<<<<<<< Read pre_value
0xf54bd014:     brz,pn   %o0, 0xf54bd414        ! 0xf54bd414 <<<<<<<<<<<<<<<<<<<< if (pre_value == null) goto 0xf54bd414
0xf54bd018:     ld       [%g2 + 736], %g1                             <<<<<<<<<<<<<<<<<<<< Read JavaThread->_satb_mark_queue._index
0xf54bd01c:     brz,pt   %g1, 0xf54c829c        ! 0xf54c829c <<<<<<<<<<<<<<<<<<<< if (JavaThread->_satb_mark_queue._index == 0) goto 0xf54c829c
0xf54bd020:     ld       [%g2 + 732], %l7                              <<<<<<<<<<<<<<<<<<<< Read JavaThread->_satb_mark_queue._buf
0xf54bd024:     add      %l7, %g1, %g3
0xf54bd028:     st       %o0, [%g3 - 4]                                 <<<<<<<<<<<<<<<<<<<< JavaThread->_satb_mark_queue._buf[_index - 4] = pre_value
0xf54bd02c:     add      %g1, -4, %l7
0xf54bd030:     st       %l7, [%g2 + 736]                             <<<<<<<<<<<<<<<<<<<< JavaThread->_satb_mark_queue._index = _index - 4;
0xf54bd034:     mov      %i3, %l7
0xf54bd038:     mov      %i2, %g3

Clearly that safepoint check is not safe. There are also other instructions between the load and the branch that probably shouldn't be there. Also found other cases where (potentially unsafe?) instructions leaked into the G1 pre barrier. I can't tell if that's ok or not, but it looks a bit dangerous. For example, if the following code can generate a NullPointerException we likely have a problem.

0xf4a46810:     ldsb     [%g2 + 728], %l7                             <<<<<<<<<<<<<<<<<<<< Read (current JavaThread)->_satb_mark_queue._active
0xf4a46814:     ld       [%l5 + 56], %i3                                 <<<<<<<<<<<<<<<<<<<< What if this read that could cause NullPointerException?
0xf4a46818:     brz,pn   %l7, 0xf4a4ee18        ! 0xf4a4ee18  <<<<<<<<<<<<<<<<<<<< if (JavaThread->_satb_mark_queue._active == false) goto 0xf4a4ee18
0xf4a4681c:     ld       [%i2 + 20], %o0
0xf4a46820:     brz,pn   %o0, 0xf4a46d20        ! 0xf4a46d20
0xf4a46824:     ld       [%g2 + 736], %l7
0xf4a46828:     cmp      %l7, 0
0xf4a4682c:     be,pn    %icc,0xf4a499f4        ! 0xf4a499f4
0xf4a46830:     nop      
0xf4a46834:     ld       [%g2 + 732], %l4
0xf4a46838:     add      %l4, %l7, %l4
0xf4a4683c:     st       %o0, [%l4 - 4]
0xf4a46840:     inc      -4, %l7
0xf4a46844:     st       %l7, [%g2 + 736]
0xf4a46848:     mov      %i2, %l4
0xf4a4684c:     mov      %i3, %l7

Should also add to the previous comment that thread 72 is in this case safepointed by a failed allocation, not because of reading the safepoint page. In any case is doesn't looks like this is where the problem is introduced. Instead it's seems that the corruption of the SATB queue was introduced in a previous safepoint (either a conc-mark-abort or a conc-mark-remark) and it's only now when we actually initiate a concurrent mark and active the queues that we discover the problem.

Went through the JIT:ed assembly of the first few frames of thread 72, but was not able to see the same pattern as described in JDK-6793828, where a read of the safepoint page is incorrectly inserted in the middle of the G1 pre-barrier. It's a lot of assembly to scan so I could have missed it. Also looked at the patch that was applied to C2 to fix JDK-6793828 and as far as I can tell that patch is till there and that code hasn't changed much since then.
This bug seems similar to JDK-6793828 in many ways. Some observations made so far:

The problematic JavaThread is thread 72 at 0x5645000 (i.e. first tread in Threads::_thread_list).

Thread 72 tries to allocate an array, which causes it to safepoint and start a GC. The VM thread (thread 10) starts a GC cycle and eventually comes to ConcurrentMark::checkpointRootsInitialPost() which in turn tries to activate the SATB buffers for all threads by calling SATBMarkQueueSet::set_active_all_threads(). At this time the assert(_index == _sz) in PtrQueue::set_active() checks that the buffer is empty, which it should be, when it's activated. This check fails. Looking at the PtrQueue in question we can see that the _index field has been updated and indicates that there's one entry in the buffer. The _satb_mark_queue for thread 72 looks like this:

(dbx) print -r ((JavaThread*)0x5645000)->_satb_mark_queue
((class JavaThread *) 0x5645000)->_satb_mark_queue = {
    ObjPtrQueue::PtrQueue::_qset   = 0xfef90b04
    ObjPtrQueue::PtrQueue::_active = 1U
    ObjPtrQueue::PtrQueue::_buf    = 0x5b43e68
    ObjPtrQueue::PtrQueue::_index  = 4092U
    ObjPtrQueue::PtrQueue::_sz     = 4096U
    ObjPtrQueue::PtrQueue::_perm   = 0
    ObjPtrQueue::PtrQueue::_lock   = (nil)

_index above should be 4096U when the queue is empty. 4092U indicates that a single entry (4 bytes in size) has been added. The _index is a byte offset, not an actual index. The real index is _index / oopSize, i.e. 4092 / 4 = 1023. The entry at _buf[1023] contains 0xbf501470 which looks like garbage, it points straight into an char[].

(dbx) print ((JavaThread*)0x5645000)->_satb_mark_queue->_buf[1023] 
((class JavaThread *) 0x5645000)->_satb_mark_queue._buf[1023] = 0xbf501470

(dbx) examine 0xbf501470 /100c
0xbf501470:	 '\0' 'm' '\0' '.' '\0' 's' '\0' 'u' '\0' 'n' '\0' '.' '\0' 'f' '\0' 'a'
0xbf501480:	 '\0' 'c' '\0' 'e' '\0' 's' '\0' '.' '\0' 'r' '\0' 'e' '\0' 'n' '\0' 'd'
0xbf501490:	 '\0' 'e' '\0' 'r' '\0' 'k' '\0' 'i' '\0' 't' '\0' '.' '\0' 'h' '\0' 't'
0xbf5014a0:	 '\0' 'm' '\0' 'l' '\0' '_' '\0' 'b' '\0' 'a' '\0' 's' '\0' 'i' '\0' 'c'
0xbf5014b0:	 '\0' '.' '\0' 'B' '\0' 'a' '\0' 's' '\0' 'e' '\0' 'T' '\0' 'a' '\0' 'b'
0xbf5014c0:	 '\0' 'l' '\0' 'e' '\0' 'R' '\0' 'e' '\0' 'n' '\0' 'd' '\0' 'e' '\0' 'r'
0xbf5014d0:	 '\0' 'e' '\0' 'r'

"m.sun.faces.renderkit....". This is most likely ok. This pointer is old (from when the queue was last active) and the _index is just incorrectly updated before the new pointer was written into the buffer, i.e. a safepoint happened in between, just like in JDK-6793828.
Triaged. hs25

Hardware and Software, Engineered to Work Together