Bug ID: JDK-8008301 G1: guarantee(satb_mq_set.completed_buffers

JDK-8008301 : G1: guarantee(satb_mq_set.completed_buffers_num() == 0) failure

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: hs25,8

Priority: P2
Status: Closed
Resolution: Fixed
OS: windows
CPU: x86

Submitted: 2013-02-15
Updated: 2014-01-14
Resolved: 2013-03-19

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 7	JDK 8	Other
7u40Fixed	8Fixed	hs24Fixed

Description

VM test: vm/gc/containers/TreeMap_Arrays	execute_positive 
VM option: -server -Xcomp -XX:+UseG1GC

Crash is localized only for G1 GC.

Crash description
;; Using jvm: "C:/local/aurora/sandbox/jdk/baseline/windows-i586/jre/bin/server/jvm.dll"
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (concurrentMark.cpp:2569), pid=12048, tid=135308
#  guarantee(satb_mq_set.completed_buffers_num() == 0) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b76) (build 1.8.0-ea-b76)
# Java VM: Java HotSpot(TM) Server VM (25.0-b17 compiled mode windows-x86 )
# Core dump written. Default location: C:\local\aurora\sandbox\results\ResultDir\TreeMap_Arrays\hs_err_pid12048.mdmp
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x4bc30400):  VMThread [stack: 0x092b0000,0x09300000] [id=135308]

Stack: [0x092b0000,0x09300000],  sp=0x092ff6c4,  free space=317k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0x194a91]  VMError::report_and_die+0x561;;  ?report_and_die@VMError@@QAEXXZ+0x561
V  [jvm.dll+0x18e6b5]  report_vm_error+0x35;;  ?report_vm_error@@YAXPBDH00@Z+0x35
V  [jvm.dll+0x25a80b]  ConcurrentMark::checkpointRootsFinalWork+0x16b;;  ?checkpointRootsFinalWork@ConcurrentMark@@QAEXXZ+0x16b
V  [jvm.dll+0x25a902]  ConcurrentMark::checkpointRootsFinal+0xb2;;  ?checkpointRootsFinal@ConcurrentMark@@QAEX_N@Z+0xb2
V  [jvm.dll+0x25ab3a]  CMCheckpointRootsFinalClosure::do_void+0xa;;  ?do_void@CMCheckpointRootsFinalClosure@@UAEXXZ+0xa
V  [jvm.dll+0x16e61b]  VM_Operation::evaluate+0x3b;;  ?evaluate@VM_Operation@@QAEXXZ+0x3b
V  [jvm.dll+0x16d7d4]  VMThread::evaluate_operation+0x64;;  ?evaluate_operation@VMThread@@AAEXPAVVM_Operation@@@Z+0x64
V  [jvm.dll+0x16db67]  VMThread::loop+0x2f7;;  ?loop@VMThread@@QAEXXZ+0x2f7
V  [jvm.dll+0x16e032]  VMThread::run+0x62;;  ?run@VMThread@@UAEXXZ+0x62
V  [jvm.dll+0x1a4496]  java_start+0x86;;  ?java_start@@YGIPAVThread@@@Z+0x86
C  [msvcr100.dll+0x10fac]
C  [msvcr100.dll+0x110b1]
C  [kernel32.dll+0x8f299]
C  [ntdll.dll+0x7d819]
C  [ntdll.dll+0x7da2b]

VM_Operation (0x0910f914): CGC_Operation, mode: safepoint, requested by thread 0x00483800


---------------  P R O C E S S  ---------------

Java Threads: ( => current thread )
  0x4c12c400 JavaThread "vm.gc.containers.ContainersTest$Mutator@131245a" [_thread_blocked, id=2304, stack(0x4d380000,0x4d3d0000)]
  0x4c12b800 JavaThread "vm.gc.containers.ContainersTest$Mutator@14ae5a5" [_thread_blocked, id=137900, stack(0x4c4d0000,0x4c520000)]
  0x4c12b000 JavaThread "vm.gc.containers.ContainersTest$Mutator@17327b6" [_thread_blocked, id=360, stack(0x4d680000,0x4d6d0000)]
  0x4c12a800 JavaThread "vm.gc.containers.ContainersTest$Mutator@140e19d" [_thread_blocked, id=132736, stack(0x4d2d0000,0x4d320000)]
  0x4c12a400 JavaThread "vm.gc.containers.ContainersTest$Mutator@a14482" [_thread_blocked, id=6456, stack(0x4d4f0000,0x4d540000)]
  0x4c129c00 JavaThread "vm.gc.containers.ContainersTest$Mutator@154617c" [_thread_blocked, id=136552, stack(0x4d400000,0x4d450000)]
  0x4c147800 JavaThread "vm.gc.containers.ContainersTest$Mutator@16d3586" [_thread_blocked, id=6368, stack(0x4d320000,0x4d370000)]
  0x4c146400 JavaThread "vm.gc.containers.ContainersTest$Mutator@d93b30" [_thread_blocked, id=39720, stack(0x4bd80000,0x4bdd0000)]
  0x4bf5bc00 JavaThread "Service Thread" daemon [_thread_blocked, id=137628, stack(0x4c880000,0x4c8d0000)]
  0x4be85c00 JavaThread "C1 CompilerThread2" daemon [_thread_blocked, id=137408, stack(0x4c6d0000,0x4c720000)]
  0x4be85400 JavaThread "C2 CompilerThread1" daemon [_thread_blocked, id=136056, stack(0x09110000,0x09160000)]
  0x4be81800 JavaThread "C2 CompilerThread0" daemon [_thread_blocked, id=137512, stack(0x4c5a0000,0x4c5f0000)]
  0x4be79000 JavaThread "Attach Listener" daemon [_thread_blocked, id=76056, stack(0x4c550000,0x4c5a0000)]
  0x4be78400 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=134368, stack(0x00bb0000,0x00c00000)]
  0x4be6d800 JavaThread "Surrogate Locker Thread (Concurrent GC)" daemon [_thread_blocked, id=10372, stack(0x4c370000,0x4c3c0000)]
  0x4bc36800 JavaThread "Finalizer" daemon [_thread_blocked, id=48660, stack(0x4c240000,0x4c290000)]
  0x4bc32400 JavaThread "Reference Handler" daemon [_thread_blocked, id=5476, stack(0x4bc80000,0x4bcd0000)]
  0x000bb800 JavaThread "main" [_thread_blocked, id=9936, stack(0x009c0000,0x00a10000)]

Other Threads:
=>0x4bc30400 VMThread [stack: 0x092b0000,0x09300000] [id=135308]
  0x4bf5f400 WatcherThread [stack: 0x4c970000,0x4c9c0000] [id=8316]

VM state:at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread:  ([mutex/lock_event])
[0x000ba1e0] Threads_lock - owner thread: 0x4bc30400
[0x000ba618] Heap_lock - owner thread: 0x00483800

Heap
 garbage-first heap   total 699392K, used 647680K [0x09600000, 0x34100000, 0x49600000)
  region size 1024K, 24 young (24576K), 4 survivors (4096K)
 Metaspace total 2627K, used 2302K, reserved 6528K
  data space     2408K, used 2095K, reserved 4480K
  class space    219K, used 206K, reserved 2048K

Comments

Workaround: -XX:MarkStackSize=128*16K (2m)
06-03-2013
Attaching log file with fairly complete instrumentation but here's the relevant part: [3] abort draining satb buffers (2), completed_buffers: 93, num_processed: 56, cm->has_overflown: true [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [1] task queue overflow, moving entries to the global stack [1] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [1] task queue overflow, moving entries to the global stack [1] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [1] task queue overflow, moving entries to the global stack [1] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [1] abort draining satb buffers (2), completed_buffers: 93, num_processed: 66, cm->has_overflown: true [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [2] task queue overflow, moving entries to the global stack [2] abort draining satb buffers (2), completed_buffers: 93, num_processed: 62, cm->has_overflown: true [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] task queue overflow, moving entries to the global stack [0] abort draining satb buffers (2), completed_buffers: 93, num_processed: 53, cm->has_overflown: true 2: Overflow after 1 iterations 2: starting queue size: 0, max: 16382 0: Overflow after 1 iterations 0: starting queue size: 0, max: 16382 3: Overflow after 1 iterations 3: starting queue size: 0, max: 16382 1: Overflow after 1 iterations 1: starting queue size: 0, max: 16382 ## Before:- completed_buffers: 330, markStack.overflow: false, markStack.size: 0, has_overflown: false ## After:- completed buffers: 93, markStack,overflow: false, has_overflown: true # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (concurrentMark.cpp:2631), pid=4452, tid=940 # Error: ShouldNotReachHere() # Prior to this point in the log, all of the workers were moving entries from their local queue to the global stack. The instrumentation in the previous comment was incorrect. We do actually process some completed SATB buffers during the remark phase: [0] abort draining satb buffers (2), completed_buffers: 93, num_processed: 53, cm->has_overflown: true [1] abort draining satb buffers (2), completed_buffers: 93, num_processed: 66, cm->has_overflown: true [2] abort draining satb buffers (2), completed_buffers: 93, num_processed: 62, cm->has_overflown: true [3] abort draining satb buffers (2), completed_buffers: 93, num_processed: 56, cm->has_overflown: true At the start of remark, the local task queues are empty and can hold 16K-2 entries. There were 330 completed SATB buffers. During the remark, the marking tasks drain the completed SATB buffers. As a result of filling up their local queues, they start moving entries from their local queues to the global mark stack. All is well and good. In the process of moving some entries the global marking stack overflows and the marking tasks abort while there are still unprocessed SATB buffers. Since the marking tasks aborted and the marking stack overflowed, the remark task itself terminates and we hit the guarantee. The guarantee is too strong. It should be: guarantee(has_overflown() \|\| satb_queue_set.completed_buffers_num() == 0, "...."); Since we have overflown the marking stack, we should also skip reference processing. Reference processing will not be able to push any discovered reference objects onto the marking stack. So we would processing them needlessly. Marking will be restarted and any discovered references will remain discovered until processed after a successful remark.
06-03-2013
Another failure with some more instrumentation: Starting Thread[vm.gc.containers.ContainersTest$Mutator@131245a,5,main] [3] abort draining satb buffers (2), completed_buffers: 105, has_timed_out: false [2] abort draining satb buffers (2), completed_buffers: 105, has_timed_out: false [0] abort draining satb buffers (2), completed_buffers: 105, has_timed_out: false [1] abort draining satb buffers (2), completed_buffers: 105, has_timed_out: false 0: abort after 1 iterations 1: abort after 1 iterations 3: abort after 1 iterations 2: abort after 1 iterations completed_buffers_before: 105, completed buffers_after: 105, has_overflow_before: false, has_overflown: true, has_aborted: false # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (concurrentMark.cpp:2619), pid=4284, tid=5500 # Error: ShouldNotReachHere() The instrumentation indicates that at the start of remark there were: * 105 completed SATB bufffers, * the marking stack had NOT overflown The instrumentation also indicates * that each worker in remark only performed one marking step before the marking stack overflowed. * that each worker aborted their own marking step while trying to drain the satb buffers. * no actual SATB buffers were processed.
05-03-2013
I was able to reproduce the problem on the SQE windows machine. Many thanks to Mikhail for all the help. The issue did not reproduce with a 64 bit JVM. With a 32-bit JVM it reproduces within 30 iterations. Instrumentation shows: [2] abort draining satb buffers, completed_buffers: 198 [0] abort draining satb buffers, completed_buffers: 198 [3] abort draining satb buffers, completed_buffers: 198 [1] abort draining satb buffers, completed_buffers: 198 completed_buffers_before: 198, completed buffers_after: 198, has_overflown: true, has_aborted: false # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (concurrentMark.cpp:2611), pid=2768, tid=4028 # Error: ShouldNotReachHere() # # JRE version: Java(TM) SE Runtime Environment (8.0-b79) (build 1.8.0-ea-b79) # Java VM: Java HotSpot(TM) Server VM (25.0-b18-internal-201303041824.jcuthber.8008301 compiled mode windows-x86 ) # Core dump written. Default location: C:\local\tmp\test\hs_err_pid2768.mdmp # # An error report file with more information is saved as: # C:\local\tmp\test\hs_err_pid2768.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # Bug has been reproduced (iter 24, exit code 1) The number of completed buffers at the start of remark was 198. It looks like the concurrent mark tasks aborted during (or before) draining the completed SATB buffers as the number of completed buffer after was still 198. The individual tasks aborted because the marking stack overflowed. It looks like the guarantee is too strong. Where did the marking stack overflow? Was it during remark or before?
04-03-2013
OK. After building the latest version of rdesktop on a lab linux machine and running from there I'm able to get on to the windows machine. Currently running test in a loop. Iteration #6 and and no failures (so far).
26-02-2013
OK. I need more information. The instructions in the link to rerun the test don't work on my laptop. The resulting rerun script still has nfs paths in it. When I attempt to map the drives given with username gtee, the drives will not map. I can't login to sc14160070 as either jcuthber or gtee using remote-desktop. I have no way of running this test either on my laptop or the machin on which the test failed.
15-02-2013
The test case just ran ~50 times on my solaris workstation without failure. Looking at the code we are here: SATBMarkQueueSet& satb_mq_set = JavaThread::satb_mark_queue_set(); guarantee(satb_mq_set.completed_buffers_num() == 0, "invariant"); The code immediately before this is the bug is the guts of the remark pause and we should have finished the remark task. If we look at the CMRemarkTask::work() method: void work(uint worker_id) { // Since all available tasks are actually started, we should // only proceed if we're supposed to be actived. if (worker_id < _cm->active_tasks()) { CMTask* task = _cm->task(worker_id); task->record_start_time(); do { task->do_marking_step(1000000000.0 /* something very large /, true / do_stealing /, true / do_termination */); } while (task->has_aborted() && !_cm->has_overflown()); // If we overflow, then we do not want to restart. We instead // want to abort remark and do concurrent marking again. task->record_end_time(); } and in do_marking_step() we drain the SATB buffers before scanning the bitmap and again after scanning the bitmap the task hasn't aborted. Draining the satb buffers applies the marking closure to the pointers in the completed buffers and the currently active buffers (if the task has not aborted). Since this is a remark pause we shouldn't be aborting the task due to time but could marking itself have been aborted due to an overflow? Then the task is aborted but one thread is in JNI code with an active buffer that's close to being retired? The thread in JNI writes a value which then enqueues the previous value, causing an SATB to be completed and enqueued? Is the guarantee in the wrong place? Shouldn't it be checked closer to the actual end of the remark pause?
15-02-2013
Please provide instructions on how to reproduce this failure.
15-02-2013