JDK-8037112 : gc/g1/TestHumongousAllocInitialMark.java caused SIGSEGV
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8,9
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-03-11
  • Updated: 2014-07-29
  • Resolved: 2014-04-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8 JDK 9
8u20Fixed 9 b12Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Being run on jdk8 b132 on Linux amd64 with G1 and JFR gc/g1/TestHumongousAllocInitialMark.java produced the following output:


Allocating humongous object 99/101 of size 786432 bytes
Allocating humongous object 100/101 of size 786432 bytes
Allocating humongous object 101/101 of size 786432 bytes
[GC pause (G1 Humongous Allocation) (young) (initial-mark) 75M->75M(200M), 0.0020453 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0008162 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0002036 secs]
[GC remark, 0.0023905 secs]
[GC cleanup 76M->2102K(200M), 0.0006299 secs]
[GC concurrent-cleanup-start]
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7274e5baae, pid=17102, tid=140129199908624
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b132) (build 1.8.0-b132)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b70 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# ];
 stderr: [pure virtual method called
terminate called without an active exception
]
 exitValue = 134

java.lang.RuntimeException: Expected to get exit value of [0]

...
The full .jtr attached as well as gdb file.

Needless to say, that I haven't managed to reproduce this failure manually...
Comments
Release team: Approved for deferral to 8u20
21-03-2014

I've caught this bug in a debugger now. It looks like the VM thread has safepointed the VM and shut itself down: (gdb) p VMThread::_should_terminate $4 = true (gdb) p VMThread::_vm_thread $5 = (VMThread *) 0x0 However G1's ConcurrentMarkThread did not participate in the shut down safepoint and is now trying to log that it's done with concurrent-root-region-scan: #0 VMError::report_and_die (this=0x7fff81597d20) at /home/mgerdin/work/repos/hg/hotspot-gc/hotspot/src/share/vm/utilities/vmError.cpp:872 #1 0x00007ffff6d3194f in JVM_handle_linux_signal (sig=11, info=0x7fff81597fb0, ucVoid=0x7fff81597e80, abort_if_unrecognized=<optimized out>) at /home/mgerdin/work/repos/hg/hotspot-gc/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp:535 #2 0x00007ffff6d26cc8 in signalHandler (sig=11, info=0x7fff81597fb0, uc=0x7fff81597e80) at /home/mgerdin/work/repos/hg/hotspot-gc/hotspot/src/os/linux/vm/os_linux.cpp:4147 #3 <signal handler called> #4 outputStream::print_cr (this=0x0, format=<optimized out>) at /home/mgerdin/work/repos/hg/hotspot-gc/hotspot/src/share/vm/utilities/ostream.cpp:139 #5 0x00007ffff6967a94 in ConcurrentMarkThread::run (this=0x7ffff0082000) at /home/mgerdin/work/repos/hg/hotspot-gc/hotspot/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp:120 #6 0x00007ffff6d28532 in java_start (thread=0x7ffff0082000) at /home/mgerdin/work/repos/hg/hotspot-gc/hotspot/src/os/linux/vm/os_linux.cpp:865 #7 0x00007ffff7bc4f6e in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #8 0x00007ffff74d49cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 gclog_or_tty has been cleared and destroyed by the VM shutdown routines and the ConcurrentMarkThread gets a SEGV when it tries to do a vtable call to perform the write. What worries me about this bug is that the hs_err handler crashes when trying to report the error. The comments before VMError::log and VMError::out suggest that the error handler should be resilient to tty being uninitialized or destroyed but that statement does not seem to hold.
14-03-2014

8-defer-request justification: This is an extremely rare race condition which can only occur while the JVM is shutting down and all Java threads have been stopped or terminated. This is not a regression, the crash can be reproduced at least as far back as 7u4.
14-03-2014

ILW => HLM => P3 Impact: High, crash during shutdown looks bad even though no major data loss should occur. Likelihood: Low, even though it is easy to reproduce with this particular test. This has never been reported but the problem has been around for a very long time (at least 7u4). Workaround: Medium, turning of GC logging will avoid the crash but might not be good enough for everyone.
14-03-2014

Try these two scripts: bash -x doit.sh ==doit.sh== >> modify the 3rd line to point to your JDK and classpath ========= while true; do gdb -x gdb.cmd --args \ ${TESTBED}/bin/java -cp /home/iklam/tmp/jtreg/work/classes/gc/g1/ \ -XX:+UseG1GC -Xms200m -Xmx200m -XX:G1HeapRegionSize=1m -XX:InitiatingHeapOccupancyPercent=50 -XX:+PrintGC \ TestHumongousAllocInitialMark\$HumongousObjectAllocator done ===end ===gdb.cmd== >> Modify the line number on the 4th line to point to the end of ostream_init_log() >> i.e., THIS LINE-> defaultStream::instance->has_log_file(); =========== set breakpoint pending on b JNI_CreateJavaVM r b ostream.cpp:1094 c watch gclog_or_tty b ostream_exit c if gclog_or_tty != 0 quit endif ===end
13-03-2014

This reproduces on JDK7u4 as well. Not a regression for JDK8
13-03-2014

I can reproduce this on JDK8 b132 $ while /localhome/java/jdk8/bin/java -XX:+UseG1GC -Xms200m -Xmx200m -XX:G1HeapRegionSize=1m -XX:InitiatingHeapOccupancyPercent=50 -XX:+PrintGC -cp build/tc/ TestHumongousAllocInitialMark\$HumongousObjectAllocator >out; do : ; done pure virtual method called pure virtual method called pure virtual method called pure virtual method called
13-03-2014

Not crystal clear what subcomponent this issue should be on, but handing this over to runtime for triaging.
13-03-2014

Here is link to one more similar failure: aurora-ds3.us.oracle.com:9502/runs/437439.VMSQE.testabilization.regression-507/results/workDir/gc/g1/TestHumongousAllocInitialMark.jtr
13-03-2014

The [GC concurrent-cleanup] phase is concurrent in the sense that it's not run at safepoint. The current Thread has called SuspendibleThreadSet::leave() just before the [GC concurrent-cleanup-start] log record and will not participate in the safepoint protocol. Any and or all other threads may be executing concurrently. In the core file %rbx contains the address of gclog_or_tty and nothing seems odd about this: 0x7f7274a53e5c <ConcurrentMarkThread::run()+2444>: mov (%rbx),%rdi 0x7f7274a53e5f <ConcurrentMarkThread::run()+2447>: lea 0x5d98ba(%rip),%rsi # 0x7f727502d720: "[GC concurrent-cleanup-end, %1.7lf secs]" 0x7f7274a53e66 <ConcurrentMarkThread::run()+2454>: mov $0x1,%eax 0x7f7274a53e6b <ConcurrentMarkThread::run()+2459>: subsd -0x110(%rbp),%xmm0 0x7f7274a53e73 <ConcurrentMarkThread::run()+2467>: callq 0x7f7274e5b9f0 <outputStream::print_cr(char const*, ...)> The memory around gclog_or_tty does seem to contain some null's: 0x7f72754a95d0 <_ZZL24perf_context_switch_ratePdE8lastTime>: 0x0000000000000000 0x0000000000000000 0x7f72754a95e0 <_ZZL24perf_context_switch_ratePdE17contextSwitchLock>: 0x0000000000000000 0x0000000000000000 0x7f72754a95f0 <_ZZL24perf_context_switch_ratePdE17contextSwitchLock+16>: 0x0000000000000000 0x0000000000000000 0x7f72754a9600 <_ZZL24perf_context_switch_ratePdE17contextSwitchLock+32>: 0x0000000000000000 0x0000000000000000 0x7f72754a9610 <_ZN11X86FastTime14_tsc_frequencyE>: 0x0000000000000000 0x0000000000000000 0x7f72754a9620 <_ZGVZN11X86FastTime10initializeEvE7os_freq>: 0x0000000000000000 0x0000000000000000 0x7f72754a9630: 0x0000000000000000 0x0000000000000000 0x7f72754a9640 <xtty>: 0x0000000000000000 0x0000000000000000 0x7f72754a9650 <gclog_or_tty>: 0x0000000000000000 0x0000000000000000 0x7f72754a9660 <_ZN13defaultStream14_output_streamE>: 0x00007f727582d780 0x00007f727582d860 0x7f72754a9670: 0x0000000000000000 0x0000000000000000 But I'm pretty sure that several of these are expected to be null if not set up (such as xtty) Another weird thing is the __cxa_pure_virtual in the VMError handler. When I performed the vtable lookup for the call to staticBufferStream::write in the core file it looks correct. I have no explanation for how we ended up in __cxa_pure_virtual.
13-03-2014

The initial crash happens here at src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp:235 =============================== if (G1Log::fine()) { gclog_or_tty->date_stamp(PrintGCDateStamps); gclog_or_tty->stamp(PrintGCTimeStamps); gclog_or_tty->print_cr("[GC concurrent-cleanup-start]"); /// <<< output was observed in log file: (POINT A) } // Now do the concurrent cleanup operation. _cm->completeCleanup(); // Notify anyone who's waiting that there are no more free // regions coming. We have to do this before we join the STS // (in fact, we should not attempt to join the STS in the // interval between finishing the cleanup pause and clearing // the free_regions_coming flag) otherwise we might deadlock: // a GC worker could be blocked waiting for the notification // whereas this thread will be blocked for the pause to finish // while it's trying to join the STS, which is conditional on // the GC workers finishing. g1h->reset_free_regions_coming(); double cleanup_end_sec = os::elapsedTime(); if (G1Log::fine()) { gclog_or_tty->date_stamp(PrintGCDateStamps); /// is gclog_or_tty NULL here, unsure, line 233 gclog_or_tty->stamp(PrintGCTimeStamps); /// is gclog_or_tty NULL here, unsure, line 234 gclog_or_tty->print_cr("[GC concurrent-cleanup-end, %1.7lf secs]", //// <<<<<< HERE, line 235, crashed (POINT B) cleanup_end_sec - cleanup_start_sec); } =============================== gclog_or_tty clearly was not NULL when "[GC concurrent-cleanup-start]" was observed in the log file. However, PrintGCTimeStamps is false, so gclog_or_tty->date_stamp() and gclog_or_tty->stamp() did not do anything. It's unclear whether at line 234 and 233 gclog_or_tty is NULL or not. At line 235, NULL was passed as the <this> argument to the function gclog_or_tty::print_cr. Two possibilities: [a] gclog_or_tty is NULL [b] gclog_or_tty not NULL, but the loading of gclog_or_tty onto the x86 parameter stack was somehow corrupted (cosmic rays??) This can be checked by loading core file in gdb. Two questions for GC team: Between POINT A and POINT B (marked in source code above) (1) Could any other threads be executing? Or are they all waiting for GC? (2) Is it possible that a stray memset(0, ...) could have cleared gc_log_or_tty? This can be checked by loading the core file and checking the memory around &gc_log_or_tty to see if they are all zeros.
12-03-2014

The crash looks very strange. It looks like we are crashing because the gclog_or_tty log stream is null Then we crash in the crash handler because the staticBufferStream which the hs_err handler uses is incorrectly set up by the c++ runtime somehow so when the hs_err handler tries to call write() on the stream it ends up in __cxa_pure_virtual and the c++ runtime terminates the process. I highly doubt that this will ever reproduce.
11-03-2014