Bug ID: JDK-8274072 Crash in ZStatPhaseCycle::register_start while executing com/sun/jdi/JITDebug

JDK-8274072 : Crash in ZStatPhaseCycle::register_start while executing com/sun/jdi/JITDebug

Type: Bug
Component: core-svc
Sub-Component: debugger
Affected Version: 18

Priority: P5
Status: Open
Resolution: Unresolved

Submitted: 2021-09-21
Updated: 2024-07-22

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :	JDK-7157998 - com.sun.jdi.VirtualMachine.exit() does not process shutdown sequence.
Relates :	JDK-4401395 - JPDA: Allow more graceful shutdown of debuggee version for VirtualMachine.exit()
Relates :	JDK-8274136 - -XX:+ExitOnOutOfMemoryError calls exit while threads are running

Description

#8  <signal handler called>
#9  0x00007f95c078da48 in GrowableArrayBase::clear (this=0x0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/growableArray.hpp:97
#10 TimePartitions::clear (this=0x7f95c1aeddb8 <ZStatPhase::_timer+56>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/shared/gcTimer.cpp:126
#11 GCTimer::register_gc_start (this=this@entry=0x7f95c1aedd80 <ZStatPhase::_timer>, time=...) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/shared/gcTimer.cpp:33
#12 0x00007f95c151ed3c in ZStatPhaseCycle::register_start (this=<optimized out>, start=...) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zStat.cpp:634
#13 0x00007f95c14b9dc8 in ZStatTimer::ZStatTimer (phase=..., this=0x7f95bc1bfcd0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zStat.hpp:309
#14 ZStatTimer::ZStatTimer (phase=..., this=0x7f95bc1bfcd0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zStat.hpp:304
#15 ZDriverGCScope::ZDriverGCScope (request=..., this=0x7f95bc1bfca0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zDriver.cpp:407
#16 ZDriver::gc (this=this@entry=0x7f95b805dd70, request=...) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zDriver.cpp:446
#17 0x00007f95c14bbc8f in ZDriver::run_service (this=0x7f95b805dd70) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/z/zDriver.cpp:494
#18 0x00007f95c04cb84b in ConcurrentGCThread::run (this=0x7f95b805dd70) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/gc/shared/concurrentGCThread.cpp:51 

Null here:
(gdb) p _phases 
$1 = (GrowableArray<GCPhase> *) 0x0

   125         void TimePartitions::clear() {                                                                                                                                                                                               
   126           _phases->clear();  

Test was executed with:
make test-vmTestbase_nsk_jdi         CONF=fast JTREG="RETAIN=all -javaoptions:\"-XX:+SuppressFatalErrorMessage -XX:+UseZGC -XX:ZFragmentationLimit=0 -XX:ZCollectionInterval=0.01\""

Comments

ILW = LLL = P5 Impact is low because it crashes an already exiting JVM. Likelihood is low because it requires JDWP VirtualMachine.Exit combined with ZGC parameters that are not normally used. Workaround is don't use the ZGC parameters.
22-07-2024
Actually I'm going to keep it open because I don't like to close CRs that represent actual bugs. But there is no intent/plan to ever fix this issue.
22-07-2024
[~cjplummer] I agree with WNF.
18-07-2024
[~dholmes][~stefank] This bug has gone idle for a while now. I'm inclined at this point to close as WNF.
17-07-2024
So where does this leave us? It seems either we should decide to close as WNF or do the _exit() fix and hope we don't get complaints about compatibility issues.
28-09-2021
ILW=HLM=P3
28-09-2021
-XX:ZCollectionInterval=0.01 is added to run GCs back-to-back. Setting it to a very low value is done to stress the JVM and run GCs back-to-back. You typically don't want to run GCs back-to-back, but it could happen even without that flag.
23-09-2021
There is a compatibility risk here, though perhaps more so with JDK-8274136: an application may have installed its own at_exit handlers etc and expect them to continue to be executed when these API's/flags are used.
23-09-2021
I guess we need to weigh the compatibility risk of switching to _exit() with the likelihood of someone running into this issue. I assume it was reproduced because of using -XX:ZFragmentationLimit=0 -XX:ZCollectionInterval=0.01. Are these what you would consider experimental settings, or is this something a user would really want to do (and is somewhat likely to do).
22-09-2021
Similar problem in JDK-8274136
22-09-2021
Aside: for a very long time the VM didn't use non-trivial static objects, so we didn't encounter these kinds of termination problems.
22-09-2021
Both those bugs relate to Java-level shutdown hooks not native at_exit hooks. But if you don't use _exit you are neither completely abruptly terminating the VM, nor cleanly terminating the VM and this middle road can causes crashes on termination. So solution is to either completely abruptly terminate (_exit()), or cleanly terminate (trigger System.exit). The latter is not really a compatible option now so that leaves _exit.
22-09-2021
It seems that both those bugs imply that _exit() should be called (and is called), yet the code calls exit().
22-09-2021
JDK-4401395 also set the precedent of assuming that text implied an abrupt process level shutdown, unfortunately.
22-09-2021
JDK-7157998 raised the issue of exit not doing an orderly shutdown but the evaluation of that issue is flawed in my opinion. It used the words from the specification: "Threads running in the VM are abruptly terminated. A thread death exception is not thrown and finally blocks are not run." to infer that this is an abrupt process (not Thread) termination and that therefore shutdown hooks should not be run. However those words IMO were intended to convey the fact that terminating the VM does not imply stopping each thread as if by use of Thread.stop() - hence no ThreadDeath exception to unwind the stack and cause finally blocks to execute etc etc.
22-09-2021
I would expect those functions to emulate the calling of System.exit() in the target VM not do a low-level process based termination! Making such a change now may have compatibility issues however.
22-09-2021
https://docs.oracle.com/en/java/javase/16/docs/specs/jdwp/jdwp-protocol.html#JDWP_VirtualMachine_Exit https://docs.oracle.com/en/java/javase/16/docs/api/jdk.jdi/com/sun/jdi/VirtualMachine.html#exit(int) There doesn't seem to be any requirement to call or not to call registered functions.
22-09-2021
The key difference for this problem is (man _exit): "The function _exit() is like exit(3), but does not call any functions registered with atexit(3) or on_exit(3)." so we don't run atexit handlers and don't process static destructors. Normally we would use abort() for this behaviour, but if we don't want a core file then we use _exit(). I don't know the JDWP reason for forcing termination so don't know whether abort() or _exit() is more appropriate.
21-09-2021
Can you explain exit() vs _exit()?
21-09-2021
I had no idea that JDWP would directly try to terminate a loaded VM by calling exit() directly! That is a bad idea - no process that loads the JVM should ever do that. I'm going to redirect this to serviceability folk. It may be that switching to _exit() is the right fix here as we did elsewhere in the VM itself.
21-09-2021
This problem stops happening if I change forceExit to use _Exit instead of exit.
21-09-2021
The code has two error handling functions: 1) jniFatalError 2) forceExit jniFatalError calls jni_FatalError, which calls the "safe" os::abort, while forceExit calls the unsafe exit function. I'm not sure what to do about this.
21-09-2021
So, this seems to be the path to exit: forceExit (exit_code=exit_code@entry=0x0) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c:628 void forceExit(int exit_code) { /* make sure the transport is closed down before we exit() */ transport_close(); exit(exit_code); }
21-09-2021
I've managed to reliably reproduce this by introducing a 1 second sleep in ~TimePartition. This is the path that exits the JVM: ``` #0 0x00007ffaea34f5df in __GI___clock_nanosleep (clock_id=clock_id@entry=0x0, flags=flags@entry=0x0, req=req@entry=0x7ffac86f89b0, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:73 #1 0x00007ffaea354ac7 in __GI___nanosleep (req=req@entry=0x7ffac86f89b0, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25 #2 0x00007ffae97638cf in os::naked_short_nanosleep (ns=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/os/posix/os_posix.cpp:780 #3 os::naked_short_sleep (ms=ms@entry=0x3e7) at /home/stefank/git/alt/open/src/hotspot/os/posix/os_posix.cpp:786 #4 0x00007ffae8ee77ef in TimePartitions::~TimePartitions (this=<optimized out>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.cpp:127 #5 TimePartitions::~TimePartitions (this=<optimized out>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.cpp:122 #6 0x00007ffae9c764a5 in GCTimer::~GCTimer (this=0x7ffaea273d80 <ZStatPhase::_timer>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.hpp:139 #7 ConcurrentGCTimer::~ConcurrentGCTimer (this=0x7ffaea273d80 <ZStatPhase::_timer>, __in_chrg=<optimized out>) at /home/stefank/git/alt/open/src/hotspot/share/gc/shared/gcTimer.hpp:168 #8 0x00007ffaea2bb147 in __run_exit_handlers (status=status@entry=0x0, listp=0x7ffaea457738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=0x1, run_dtors=run_dtors@entry=0x1) at exit.c:108 #9 0x00007ffaea2bb2f0 in __GI_exit (status=status@entry=0x0) at exit.c:139 #10 0x00007ffae7f264ca in forceExit (exit_code=exit_code@entry=0x0) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/debugInit.c:628 #11 0x00007ffae7f2272d in doExit (in=0x7ffac86f8b10, out=0x7ffac86f8b40) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/VirtualMachineImpl.c:667 #12 0x00007ffae7f294a9 in debugLoop_run () at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/debugLoop.c:161 #13 0x00007ffae7f3cbc9 in connectionInitiated (t=t@entry=0x7ffae408f018 <single_env>) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/transport.c:312 #14 0x00007ffae7f3cfb2 in acceptThread (jvmti_env=<optimized out>, jni_env=0x7ffae02ec6d0, arg=0x7ffae02d1980) at /home/stefank/git/alt/open/src/jdk.jdwp.agent/share/native/libjdwp/transport.c:376 #15 0x00007ffae93c2a0b in JvmtiAgentThread::call_start_function (this=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/share/prims/jvmtiImpl.cpp:88 #16 0x00007ffae9a72dcc in JavaThread::thread_main_inner (this=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/share/runtime/thread.hpp:1095 #17 0x00007ffae9a79400 in Thread::call_run (this=this@entry=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/share/runtime/thread.cpp:360 #18 0x00007ffae9757274 in thread_native_entry (thread=0x7ffae02ec3e0) at /home/stefank/git/alt/open/src/hotspot/os/linux/os_linux.cpp:707 #19 0x00007ffaea473450 in start_thread (arg=0x7ffac86f9640) at pthread_create.c:473 #20 0x00007ffaea38ed53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ```
21-09-2021
There are some paths to vm_direct_exit and hence to os::exit and ::exit that should never expect to be taken, but it may be possible we hit an error that caused one to be taken e.g. in vm_exit if we were in fact trying to do an orderly shutdown but hit: Thread* thread = ThreadLocalStorage::is_initialized() ? Thread::current_or_null() : NULL; if (thread == NULL) { // very early initialization failure -- just exit vm_direct_exit(code); } if the thread state was corrupted and we got NULL then we would take this path. The fact the thread executing the global dtors has only a couple of frames suggests that it is not in a good state.
21-09-2021
Right. I saw the the difference between _exit and exit in os::abort, so I agree that that path shouldn't trigger the global dtors.
21-09-2021
The first part of the stack shows abort has been called: Thread 1 (Thread 0x7f95bc1c0640 (LWP 2095177)): #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49 #1 0x00007f95c1b178a4 in __GI_abort () at abort.c:79 #2 0x00007f95bfd4232f in os::abort (dump_core=<optimized out>, siginfo=siginfo@entry=0x0, context=context@entry=0x0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/os/posix/os_posix.cpp:1970 #3 0x00007f95c0feff3d in os::abort (dump_core=<optimized out>) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/os.cpp:873 #4 0x00007f95c14116d0 in VMError::report_and_die (id=11, message=message@entry=0x0, detail_fmt=0x7f95c15f9360 "%s", detail_args=detail_args@entry=0x7f95bc1becf8, thread=thread@entry=0x7f95b805dd70, pc=0x7f95c078da48 <GCTimer::register_gc_start(TimeInstant<CompositeCounterRepresentation, CompositeElapsedCounterSource> const&)+8> "\307@\030", siginfo=0x7f95bc1beff0, context=0x7f95bc1beec0, filename=0x0, lineno=0, size=0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/vmError.cpp:1384 #5 0x00007f95c141182b in VMError::report_and_die (thread=thread@entry=0x7f95b805dd70, sig=sig@entry=11, pc=<optimized out>, siginfo=siginfo@entry=0x7f95bc1beff0, context=context@entry=0x7f95bc1beec0, detail_fmt=detail_fmt@entry=0x7f95c15f9360 "%s") at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/vmError.cpp:1320 #6 0x00007f95c141185e in VMError::report_and_die (thread=thread@entry=0x7f95b805dd70, sig=sig@entry=11, pc=<optimized out>, siginfo=siginfo@entry=0x7f95bc1beff0, context=context@entry=0x7f95bc1beec0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/vmError.cpp:1326 #7 0x00007f95c11b0cd0 in JVM_handle_linux_signal (sig=11, info=0x7f95bc1beff0, ucVoid=0x7f95bc1beec0, abort_if_unrecognized=1) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/os/posix/signals_posix.cpp:661 #8 <signal handler called> #9 0x00007f95c078da48 in GrowableArrayBase::clear (this=0x0) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/utilities/growableArray.hpp:97 Ah but this is in response to the segv that should not have happened. I can't explain the mystery thread that appears to be running global dtors.
21-09-2021
Note that we call _exit() not exit() to avoid this kind of problem on abnormal shutdowns - that was a recent fix.
21-09-2021
In the dumped threads we have one thread calling __do_global_dtors_aux. This indicates that the process is exiting and running destructors on static objects. The GCTimer is a static object, and hence the object is swept under the feet of the GC. Normally, the GC is notified that it is time to "stop and stand still" while the JVM is carefully exiting. See before_exit and the call to Universe::heap()->stop(); My guess is that there's a path taken that triggers the destruction of the static objects. However, I haven't been able to figure out what path that is. Robbin was running with -XX:+SuppressFatalErrorMessage, so that might be a special case. However, it looks like that flag only trigger this path in the error handling code: if (SuppressFatalErrorMessage) { os::abort(CreateCoredumpOnCrash); } and CreateCoredumpOnCrash is true, and os::aborted is implemented as: void os::abort(bool dump_core, void* siginfo, const void* context) { os::shutdown(); if (dump_core) { LINUX_ONLY(if (DumpPrivateMappingsInCore) ClassLoader::close_jrt_image();) ::abort(); // dump core } ::_exit(1); } So, that path should be calling ::abort, which I don't think should not be running the destructors of static objects. I do see a few places where we call vm_exit, vm_direct_exit, and os::exit. Those call ::exit, which I think should be running the destructors. Not all those paths notify the GC (and the rest of the JVM) that we are about to exit normally (and run the destructors of static objects), which could potentially lead to similar crashes.
21-09-2021