JDK-8062036 : ConcurrentMarkThread::slt may be invoked before ConcurrentMarkThread::makeSurrogateLockerThread causing intermittent crashes
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8u40,9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-10-24
  • Updated: 2015-06-03
  • Resolved: 2014-11-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
8u40Fixed 9 b42Fixed
GC threads initialized much earlier than ConcurrentMarkThread::makeSurrogateLockerThread call is done, so with +ScavengeALot GC may happened several times during VM initialization. In some rare cases G1's remark phase may be started before SLT is initialized and crash will happen.

I'm able to reproduce such scenario with following VM options on linux-x64:
-XX:+UseG1GC  -XX:+UnlockExperimentalVMOptions  -XX:+ScavengeALot -XX:SurvivorAlignmentInBytes=2k  -Xmx10m   -version

I tried to produce a regression test that didn't involve -XX:SurvivorAlignmentInBytes=2k, since that's not a sensible value for that option and probably won't work after JDK-8060463 is fixed. However, I was not able to come up with an alternative set of options that triggered this problem. To trigger for G1 we need to somehow cause the concurrent mark thread to perform a GC remark fairly early in VM initialization. My attempts to create that situation resulted in being too late to hit the problematic window, or blowing up for other reasons (such as heap size just being too small). So not adding a regression test as part of the fix for this bug.

The 2014-10-24 comment about CMS hitting a different bug refers to the wrong bug number. The correct bug is JDK-8060463.

To deal with -XX:+FullGCALot and -XX:+ScanvengeALot leading to problems because the VM is not yet fully initialized, change gc_alot() to use Threads::is_vm_compilete() instead of is_init_completed() when deciding whether to perform the collection. To deal with segfault when attempting to use SLT in other collections that might occur during initialization, change SLT access to check it the access and fatal error with a more specific message if SLT has not yet been created. This situation might arise due to overly restrictive options resulting in during-VM-initialization GCs that require the SLT; note that even for collectors that use SLT, some collections might not need to use it.

The problem is that we're attempting to perform a GC before the surrogate locker thread has been created. The CMS concurrent mark thread contains an early wait for its SLT, but the G1 concurrent mark thread does not. However, adding a similar wait by G1 doesn't really fix anything; instead of crashing on attempt to use the SLT, the thread instead just waits forever. The problem is that when using the -XX:+ScavengeALot or -XX:+FullGCALot options there are attempts to run the GC before the SLT has been created, resulting in a deadlock, since the GC can't proceed until SLT creation, but we won't get to SLT creation until the requested "alot" GC has completed. Note that a similar situation could happen if GC needed for other reasons before SLT creation. I think CMS will likely run into the same problem, but as noted in an earlier comment, can't test that because of another issue. I think the solution is going to involve somehow suppressing at least the "alot" GC requests until SLT creation (but only when SLT is needed, of course). And really that probably ought to be all GCs, not just "alot" GCs.

The same problem exists with -XX:+FullGCALot as with -XX:+ScavengeALot.

Here is example of crash: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f16ea7f2300, pid=1389, tid=139736031123200 # # JRE version: Java(TM) SE Runtime Environment (9.0-b35) (build 1.9.0-ea-fastdebug-b35) # Java VM: Java HotSpot(TM) 64-Bit Server VM (1.9.0-ea-fastdebug-b35 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0xd72300] Monitor::lock_without_safepoint_check()+0x20 # # Core dump written. Default location: /tmp/core or core.1389 # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # (gdb) where #0 0x00007f16eb287425 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f16eb28ab8b in __GI_abort () at abort.c:91 #2 0x00007f16ea85cf81 in os::abort(bool) () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #3 0x00007f16eab07d14 in VMError::report_and_die() () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #4 0x00007f16ea86acc9 in JVM_handle_linux_signal () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #5 0x00007f16ea859ac2 in signalHandler(int, siginfo*, void*) () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #6 <signal handler called> #7 0x00007f16ea7f2300 in Monitor::lock_without_safepoint_check() () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #8 0x00007f16ea156ee7 in SurrogateLockerThread::manipulatePLL(SurrogateLockerThread::SLT_msg_type) () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #9 0x00007f16eab3316e in VM_CGC_Operation::doit_prologue() () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #10 0x00007f16eab2ea17 in VMThread::execute(VM_Operation*) () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #11 0x00007f16ea19f26b in ConcurrentMarkThread::run() () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #12 0x00007f16ea85bb72 in java_start(Thread*) () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so #13 0x00007f16eba32e9a in start_thread (arg=0x7f16d47f5700) at pthread_create.c:308 #14 0x00007f16eb3453fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #15 0x0000000000000000 in ?? () (gdb) up 7 #7 0x00007f16ea7f2300 in Monitor::lock_without_safepoint_check() () from /tmp/jdk1.9.0/fastdebug/jre/lib/amd64/server/libjvm.so (gdb) info r rax 0x7f16e403f000 139736291471360 rbx 0x608 1544 ... Dump of assembler code for function _ZN7Monitor28lock_without_safepoint_checkEv: 0x00007f16ea7f22e0 <+0>: push %rbp 0x00007f16ea7f22e1 <+1>: lea 0x99c468(%rip),%rax # 0x7f16eb18e750 <_ZN18ThreadLocalStorage13_thread_indexE> 0x00007f16ea7f22e8 <+8>: mov %rsp,%rbp 0x00007f16ea7f22eb <+11>: push %r12 0x00007f16ea7f22ed <+13>: push %rbx 0x00007f16ea7f22ee <+14>: mov %rdi,%rbx 0x00007f16ea7f22f1 <+17>: mov (%rax),%edi 0x00007f16ea7f22f3 <+19>: callq 0x7f16e9cd7dc0 <pthread_getspecific@plt> 0x00007f16ea7f22f8 <+24>: test %rax,%rax 0x00007f16ea7f22fb <+27>: mov %rax,%r12 0x00007f16ea7f22fe <+30>: je 0x7f16ea7f2350 <_ZN7Monitor28lock_without_safepoint_checkEv+112> => 0x00007f16ea7f2300 <+32>: mov 0x10(%rbx),%rax Dump of assembler code for function _ZN21SurrogateLockerThread13manipulatePLLENS_12SLT_msg_typeE: 0x00007f16ea156ec0 <+0>: push %rbp 0x00007f16ea156ec1 <+1>: mov %rsp,%rbp 0x00007f16ea156ec4 <+4>: push %r13 0x00007f16ea156ec6 <+6>: mov %esi,%r13d 0x00007f16ea156ec9 <+9>: push %r12 0x00007f16ea156ecb <+11>: mov %rdi,%r12 0x00007f16ea156ece <+14>: push %rbx 0x00007f16ea156ecf <+15>: mov %rdi,%rbx 0x00007f16ea156ed2 <+18>: sub $0x8,%rsp 0x00007f16ea156ed6 <+22>: add $0x608,%rbx 0x00007f16ea156edd <+29>: je 0x7f16ea156ee7 <_ZN21SurrogateLockerThread13manipulatePLLENS_12SLT_msg_typeE+39> 0x00007f16ea156edf <+31>: mov %rbx,%rdi 0x00007f16ea156ee2 <+34>: callq 0x7f16ea7f22e0 <_ZN7Monitor28lock_without_safepoint_checkEv> => 0x00007f16ea156ee7 <+39>: mov 0x604(%r12),%edx So it seems like %rbx is a NULL-valued `this' ptr to SurrogateLockerThread and 0x608 is an offset to _monitor.

I guess that this issue may be reproduced w/ CMS too, but I was only able to reproduce crash w/ huge SurvivorAlignmentInBytes and w/ CMS I've hit JDK-8060467