JDK-8274072 : Crash in ZStatPhaseCycle::register_start while executing com/sun/jdi/JITDebug
  • Type: Bug
  • Component: core-svc
  • Sub-Component: debugger
  • Affected Version: 18
  • Priority: P2
  • Status: New
  • Resolution: Unresolved
  • Submitted: 2021-09-21
  • Updated: 2021-09-22
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 18
18Unresolved
Related Reports
Relates :  
Relates :  
Description
#8  <signal handler called>
#9  0x00007f95c078da48 in GrowableArrayBase::clear (this=0x0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/growableArray.hpp:97
#10 TimePartitions::clear (this=0x7f95c1aeddb8 <ZStatPhase::_timer+56>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/shared/gcTimer.cpp:126
#11 GCTimer::register_gc_start (this=this@entry=0x7f95c1aedd80 <ZStatPhase::_timer>, time=...) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/shared/gcTimer.cpp:33
#12 0x00007f95c151ed3c in ZStatPhaseCycle::register_start (this=<optimized out>, start=...) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zStat.cpp:634
#13 0x00007f95c14b9dc8 in ZStatTimer::ZStatTimer (phase=..., this=0x7f95bc1bfcd0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zStat.hpp:309
#14 ZStatTimer::ZStatTimer (phase=..., this=0x7f95bc1bfcd0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zStat.hpp:304
#15 ZDriverGCScope::ZDriverGCScope (request=..., this=0x7f95bc1bfca0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zDriver.cpp:407
#16 ZDriver::gc (this=this@entry=0x7f95b805dd70, request=...) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zDriver.cpp:446
#17 0x00007f95c14bbc8f in ZDriver::run_service (this=0x7f95b805dd70) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zDriver.cpp:494
#18 0x00007f95c04cb84b in ConcurrentGCThread::run (this=0x7f95b805dd70) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/shared/concurrentGCThread.cpp:51 

Null here:
(gdb) p _phases 
$1 = (GrowableArray<GCPhase> *) 0x0

   125         void TimePartitions::clear() {                                                                                                                                                                                               
   126           _phases->clear();  

Test was executed with:
make test-vmTestbase_nsk_jdi         CONF=fast JTREG="RETAIN=all -javaoptions:\"-XX:+SuppressFatalErrorMessage -XX:+UseZGC -XX:ZFragmentationLimit=0 -XX:ZCollectionInterval=0.01\""
Comments
Aside: for a very long time the VM didn't use non-trivial static objects, so we didn't encounter these kinds of termination problems.
22-09-2021

Both those bugs relate to Java-level shutdown hooks not native at_exit hooks. But if you don't use _exit you are neither completely abruptly terminating the VM, nor cleanly terminating the VM and this middle road can causes crashes on termination. So solution is to either completely abruptly terminate (_exit()), or cleanly terminate (trigger System.exit). The latter is not really a compatible option now so that leaves _exit.
22-09-2021

It seems that both those bugs imply that _exit() should be called (and is called), yet the code calls exit().
22-09-2021

JDK-4401395 also set the precedent of assuming that text implied an abrupt process level shutdown, unfortunately.
22-09-2021

JDK-7157998 raised the issue of exit not doing an orderly shutdown but the evaluation of that issue is flawed in my opinion. It used the words from the specification: "Threads running in the VM are abruptly terminated. A thread death exception is not thrown and finally blocks are not run." to infer that this is an abrupt process (not Thread) termination and that therefore shutdown hooks should not be run. However those words IMO were intended to convey the fact that terminating the VM does not imply stopping each thread as if by use of Thread.stop() - hence no ThreadDeath exception to unwind the stack and cause finally blocks to execute etc etc.
22-09-2021

I would expect those functions to emulate the calling of System.exit() in the target VM not do a low-level process based termination! Making such a change now may have compatibility issues however.
22-09-2021

https://docs.oracle.com/en/java/javase/16/docs/specs/jdwp/jdwp-protocol.html#JDWP_VirtualMachine_Exit https://docs.oracle.com/en/java/javase/16/docs/api/jdk.jdi/com/sun/jdi/VirtualMachine.html#exit(int) There doesn't seem to be any requirement to call or not to call registered functions.
22-09-2021

The key difference for this problem is (man _exit): "The function _exit() is like exit(3), but does not call any functions registered with atexit(3) or on_exit(3)." so we don't run atexit handlers and don't process static destructors. Normally we would use abort() for this behaviour, but if we don't want a core file then we use _exit(). I don't know the JDWP reason for forcing termination so don't know whether abort() or _exit() is more appropriate.
21-09-2021

Can you explain exit() vs _exit()?
21-09-2021

I had no idea that JDWP would directly try to terminate a loaded VM by calling exit() directly! That is a bad idea - no process that loads the JVM should ever do that. I'm going to redirect this to serviceability folk. It may be that switching to _exit() is the right fix here as we did elsewhere in the VM itself.
21-09-2021

This problem stops happening if I change forceExit to use _Exit instead of exit.
21-09-2021

The code has two error handling functions: 1) jniFatalError 2) forceExit jniFatalError calls jni_FatalError, which calls the "safe" os::abort, while forceExit calls the unsafe exit function. I'm not sure what to do about this.
21-09-2021

So, this seems to be the path to exit: forceExit (exit_code=exit_code@entry=0x0) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c:628 void forceExit(int exit_code) { /* make sure the transport is closed down before we exit() */ transport_close(); exit(exit_code); }
21-09-2021

I've managed to reliably reproduce this by introducing a 1 second sleep in ~TimePartition. This is the path that exits the JVM: ``` #0 0x00007ffaea34f5df in __GI___clock_nanosleep (clock_id=clock_id@entry=0x0, flags=flags@entry=0x0, req=req@entry=0x7ffac86f89b0, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:73 #1 0x00007ffaea354ac7 in __GI___nanosleep (req=req@entry=0x7ffac86f89b0, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25 #2 0x00007ffae97638cf in os::naked_short_nanosleep (ns=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/os/posix/os_posix.cpp:780 #3 os::naked_short_sleep (ms=ms@entry=0x3e7) at /home/stefank/git/alt/open/src/hotspot/os/posix/os_posix.cpp:786 #4 0x00007ffae8ee77ef in TimePartitions::~TimePartitions (this=<optimized out>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.cpp:127 #5 TimePartitions::~TimePartitions (this=<optimized out>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.cpp:122 #6 0x00007ffae9c764a5 in GCTimer::~GCTimer (this=0x7ffaea273d80 <ZStatPhase::_timer>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.hpp:139 #7 ConcurrentGCTimer::~ConcurrentGCTimer (this=0x7ffaea273d80 <ZStatPhase::_timer>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.hpp:168 #8 0x00007ffaea2bb147 in __run_exit_handlers (status=status@entry=0x0, listp=0x7ffaea457738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=0x1, run_dtors=run_dtors@entry=0x1) at exit.c:108 #9 0x00007ffaea2bb2f0 in __GI_exit (status=status@entry=0x0) at exit.c:139 #10 0x00007ffae7f264ca in forceExit (exit_code=exit_code@entry=0x0) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c:628 #11 0x00007ffae7f2272d in doExit (in=0x7ffac86f8b10, out=0x7ffac86f8b40) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/VirtualMachineImpl.c:667 #12 0x00007ffae7f294a9 in debugLoop_run () at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/debugLoop.c:161 #13 0x00007ffae7f3cbc9 in connectionInitiated (t=t@entry=0x7ffae408f018 <single_env>) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/transport.c:312 #14 0x00007ffae7f3cfb2 in acceptThread (jvmti_env=<optimized out>, jni_env=0x7ffae02ec6d0, arg=0x7ffae02d1980) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/transport.c:376 #15 0x00007ffae93c2a0b in JvmtiAgentThread::call_start_function (this=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/share/prims/jvmtiImpl.cpp:88 #16 0x00007ffae9a72dcc in JavaThread::thread_main_inner (this=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/share/runtime/thread.hpp:1095 #17 0x00007ffae9a79400 in Thread::call_run (this=this@entry=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/share/runtime/thread.cpp:360 #18 0x00007ffae9757274 in thread_native_entry (thread=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/os/linux/os_linux.cpp:707 #19 0x00007ffaea473450 in start_thread (arg=0x7ffac86f9640) at pthread_create.c:473 #20 0x00007ffaea38ed53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ```
21-09-2021

There are some paths to vm_direct_exit and hence to os::exit and ::exit that should never expect to be taken, but it may be possible we hit an error that caused one to be taken e.g. in vm_exit if we were in fact trying to do an orderly shutdown but hit: Thread* thread = ThreadLocalStorage::is_initialized() ? Thread::current_or_null() : NULL; if (thread == NULL) { // very early initialization failure -- just exit vm_direct_exit(code); } if the thread state was corrupted and we got NULL then we would take this path. The fact the thread executing the global dtors has only a couple of frames suggests that it is not in a good state.
21-09-2021

Right. I saw the the difference between _exit and exit in os::abort, so I agree that that path shouldn't trigger the global dtors.
21-09-2021

The first part of the stack shows abort has been called: Thread 1 (Thread 0x7f95bc1c0640 (LWP 2095177)): #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49 #1 0x00007f95c1b178a4 in __GI_abort () at abort.c:79 #2 0x00007f95bfd4232f in os::abort (dump_core=<optimized out>, siginfo=siginfo@entry=0x0, context=context@entry=0x0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/os/posix/os_posix.cpp:1970 #3 0x00007f95c0feff3d in os::abort (dump_core=<optimized out>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/os.cpp:873 #4 0x00007f95c14116d0 in VMError::report_and_die (id=11, message=message@entry=0x0, detail_fmt=0x7f95c15f9360 "%s", detail_args=detail_args@entry=0x7f95bc1becf8, thread=thread@entry=0x7f95b805dd70, pc=0x7f95c078da48 <GCTimer::register_gc_start(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource> const&)+8> "\307@\030", siginfo=0x7f95bc1beff0, context=0x7f95bc1beec0, filename=0x0, lineno=0, size=0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/vmError.cpp:1384 #5 0x00007f95c141182b in VMError::report_and_die (thread=thread@entry=0x7f95b805dd70, sig=sig@entry=11, pc=<optimized out>, siginfo=siginfo@entry=0x7f95bc1beff0, context=context@entry=0x7f95bc1beec0, detail_fmt=detail_fmt@entry=0x7f95c15f9360 "%s") at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/vmError.cpp:1320 #6 0x00007f95c141185e in VMError::report_and_die (thread=thread@entry=0x7f95b805dd70, sig=sig@entry=11, pc=<optimized out>, siginfo=siginfo@entry=0x7f95bc1beff0, context=context@entry=0x7f95bc1beec0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/vmError.cpp:1326 #7 0x00007f95c11b0cd0 in JVM_handle_linux_signal (sig=11, info=0x7f95bc1beff0, ucVoid=0x7f95bc1beec0, abort_if_unrecognized=1) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/os/posix/signals_posix.cpp:661 #8 <signal handler called> #9 0x00007f95c078da48 in GrowableArrayBase::clear (this=0x0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/growableArray.hpp:97 Ah but this is in response to the segv that should not have happened. I can't explain the mystery thread that appears to be running global dtors.
21-09-2021

Note that we call _exit() not exit() to avoid this kind of problem on abnormal shutdowns - that was a recent fix.
21-09-2021

In the dumped threads we have one thread calling __do_global_dtors_aux. This indicates that the process is exiting and running destructors on static objects. The GCTimer is a static object, and hence the object is swept under the feet of the GC. Normally, the GC is notified that it is time to "stop and stand still" while the JVM is carefully exiting. See before_exit and the call to Universe::heap()->stop(); My guess is that there's a path taken that triggers the destruction of the static objects. However, I haven't been able to figure out what path that is. Robbin was running with -XX:+SuppressFatalErrorMessage, so that might be a special case. However, it looks like that flag only trigger this path in the error handling code: if (SuppressFatalErrorMessage) { os::abort(CreateCoredumpOnCrash); } and CreateCoredumpOnCrash is true, and os::aborted is implemented as: void os::abort(bool dump_core, void* siginfo, const void* context) { os::shutdown(); if (dump_core) { LINUX_ONLY(if (DumpPrivateMappingsInCore) ClassLoader::close_jrt_image();) ::abort(); // dump core } ::_exit(1); } So, that path should be calling ::abort, which I don't think should not be running the destructors of static objects. I do see a few places where we call vm_exit, vm_direct_exit, and os::exit. Those call ::exit, which I think should be running the destructors. Not all those paths notify the GC (and the rest of the JVM) that we are about to exit normally (and run the destructors of static objects), which could potentially lead to similar crashes.
21-09-2021