Bug ID: JDK-8297588 Timeout during error reporting

JDK-8297588 : Timeout during error reporting

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 18,19,20

Priority: P3
Status: Open
Resolution: Unresolved

Submitted: 2022-11-24
Updated: 2024-02-09

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :	JDK-8271911 - improve support of replay compilations for methods which use JSR292
Relates :	JDK-8298213 - sun/nio/cs/StreamEncoderClose.java crashed with UT enabled
Relates :	JDK-8242181 - [Linux] Show source information when printing native stack traces in hs_err files

Description

Deliberately introducing a SIGSEGV during JIT compilation by the following change and running 'java', causes a timeout during error handling for me (release build on Linux x64, it requires some repetitions until it triggers):

diff --git a/src/hotspot/share/opto/loopopts.cpp b/src/hotspot/share/opto/loopopts.cpp
index 22e1a741981..ddbb595d69f 100644
--- a/src/hotspot/share/opto/loopopts.cpp
+++ b/src/hotspot/share/opto/loopopts.cpp
@@ -1500,7 +1500,7 @@ bool PhaseIdealLoop::try_merge_identical_ifs(Node* n) {
     //
 
     // clone pinned nodes thru the resulting regions
-    push_pinned_nodes_thru_region(dom_if, new_true_region);
+    push_pinned_nodes_thru_region(dom_if, NULL);
     push_pinned_nodes_thru_region(dom_if, new_false_region);

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f97f33439f5, pid=946306, tid=946335
#
# JRE version: Java(TM) SE Runtime Environment (20.0) (build 20-internal-2022-11-24-1516579.tobias...)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20-internal-2022-11-24-1516579.tobias..., mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xb629f5]  PhaseIdealLoop::push_pinned_nodes_thru_region(IfNode*, Node*)+0x15
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /oracle/jdk3/hs_err_pid946306.log
[thread 946319 also had an error]
[thread 946336 also had an error]
[thread 946307 also had an error]

------ Timeout during error reporting after 120 s. ------
# [ timer expired, abort... ]
Aborted



The same happens when adding a 'guarantee(false, ...)'

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (loopopts.cpp:1502), pid=1014961, tid=1014979
#  guarantee(false) failed: FAIL
#
# JRE version: Java(TM) SE Runtime Environment (20.0) (build 20-internal-2022-11-30-0744594.tohartma...)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20-internal-2022-11-30-0744594.tohartma..., mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xb6336c]  PhaseIdealLoop::try_merge_identical_ifs(Node*)+0x22c
#
# Core dump will be written. Default location: /scratch/tohartma/jdk/core.1014961
#
# An error report file with more information is saved as:
# /scratch/tohartma/jdk/hs_err_pid1014961.log
[thread 1014974 also had an error]
[thread 1014980 also had an error]
[thread 1014962 also had an error]

------ Timeout during error reporting after 120 s. ------
# [ timer expired, abort... ]
Aborted (core dumped)

Comments

Maybe I miss something but isn't the timeout caused by the deadlock during safepointing? So moving the compiler replay to VMError::report won't help.
13-11-2023
Yes - what Dean said. Working around the timeout/hang is one thing, but if you still end up with no replay file for real crashes then that is a concern.
12-11-2023
I agree, we should move compiler replay dump to VMError::report(). That should solve the timeout problem at least. It doesn't solve the problem of the replay dump failing because the VMThread was allowed to safepoint and got stuck.
10-11-2023
An alternative way to solve this would be moving the writing of the compiler replay file into VMError::report(). There, we have both a global timeout and a per-step timeout. Anything that hangs gets interrupted after a specific timeout, and the error reporting progresses. That is why I think we don't need any special solution for safepoints or locks just to prevent hanging error reporting. We already have a mechanism in place to prevent that, and that works for the lion's share of error reporting (everything in VMError::report()). Only the stuff in the fringes - inside VMError::report_and_die() but outside VMError::Report() - is susceptible to hangs. Also, I think we should not further complicate error reporting. Its already quite complex.
10-11-2023
I was going to suggest a compromise, where we explicitly allow blocking during error reporting, but only as the final step, so that we get all the information from earlier steps and leave things like compiler replay to the end. But it's not clear we urgently need that, as error reporting from the in_native state mostly works. So I'm OK with the workaround of calling thread->set_thread_state(_thread_in_vm); to solve the reply problem, but doing it late it in error reporting still allows the VMThread to sneak in a safepoint and mess things up. I suggest we force the in_native --> in_vm thread state transition early in error reporting. This prevents new safepoints, just like other crashes from in_vm state already do.
10-11-2023
Yes we have the same issue with locks but I disagree that it is "okay" to block for a short time as any use of a lock could prevent the error report from completing. We are lucky to get away with what we do in this regard. There is a trade-off between risk of secondary crashes if we accessed data without the lock versus risk of deadlock if we try to take the lock. The same with safepoints.
10-11-2023
We have the same issues with locks. Error reporting may need to access state that is normally protected by a lock/mutex. Should error reporting ignore the mutex and access the state anyway? In my opinion, it is OK for error reporting to block for a short period of time to grab a lock. That also goes for safepoints, which can be considered a specialized type of lock. Keeping the VMThread from reaching a new safepoint should solve the scenarios described in this bug. If a safepoint was already in progress, the compiler thread will wait a short time for it to finish. The compiler reply code accesses a lot of oops. I wouldn't recommend trying to run it when a GC is in progress.
10-11-2023
> If the problem is the VMThread calling SafepointSynchronize::begin() after we have already crashed The crash can happen whilst a safepoint is already in progress. The issue is the state-transitions as you note. Error reporting code must not block - which also means no thread-state transitions that could block for safepoints. The more we keep trying to do in crash reporting the more likely we will hit these kinds of errors. Accessing oops during crash reporting is inherently fragile. I don't think a "global suspend" is viable as there is no mechanism for that which would not in itself introduce complex interaction issues. All threads run concurrently whilst we are in error reporting - hence again the reason we should be completing error reporting as soon as practical. Though we have the "thread XXX also got an error" mechanism to handle any cascading failures.
10-11-2023
If the problem is the VMThread calling SafepointSynchronize::begin() after we have already crashed, it seems like the easiest solution is to prevent the VMThread from doing that. Make it crash and/or sleep forever. There is still the question of what to do if we crash when already in a safepoint. It seems like threads in _thread_in_native state should have a way to politely ask to enter _thread_in_vm state, with a short timeout, to give the safepoint a chance to finish. If they timeout waiting for the safepoint, then maybe we want them to force the state to _thread_in_vm, but this should be a last resort. In general, I don't think we should allow arbitrary threads to perform thread-state transitions when we are crashing. Instead, we should try to simulate a "global suspend" on all threads except the one doing error reporting.
10-11-2023
This change would probably be OK if no GC threads were running, but if there are GC threads running then it's not really safe for replay to access oops. But if we are crashing, maybe that's the best we can do.
08-11-2023
[~dholmes] Right, this code is only ever called from VMError::report_and_die and I've seen that we do the same in other "fatal" error reporting situations like MacroAssembler::debug64.
08-11-2023
[~thartmann] that might be okay if this code is only ever called as part of the crash reporter, but it could break the safepoint protocol if executed normally.
08-11-2023
[~dlong], [~dholmes] Here's a workaround that helped in the cases I've been observing. Would something like that be an option? diff --git a/src/hotspot/share/ci/ciEnv.cpp b/src/hotspot/share/ci/ciEnv.cpp index 44a8e37af35..1affda529fc 100644 --- a/src/hotspot/share/ci/ciEnv.cpp +++ b/src/hotspot/share/ci/ciEnv.cpp @@ -1716,9 +1716,16 @@ void ciEnv::dump_replay_data_helper(outputStream* out) { // Don't safepoint or acquire any locks. // void ciEnv::dump_replay_data_unsafe(outputStream* out) { - GUARDED_VM_ENTRY( + { + NoSafepointVerifier nsv; + CompilerThread* thread = CompilerThread::current(); + thread->set_thread_state(_thread_in_vm); + HandleMarkCleaner __hm(thread); + dump_replay_data_helper(out); - ) + + thread->set_thread_state(_thread_in_native); + } }
07-11-2023
Here's another example: Thread 23 (Thread 0x7f110e828700 (LWP 3142999)): warning: Section `.reg-xstate/3142999' in core file too small. #0 0x00007f1179fa6918 in nanosleep () from /lib64/libc.so.6 #1 0x00007f1179fa681e in sleep () from /lib64/libc.so.6 #2 0x00007f11791f1c42 in os::infinite_sleep() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #3 0x00007f11795d9fb8 in VMError::report_and_die(int, char const, char const, __va_list_tag, Thread, unsigned char, void, void, char const, int, unsigned long) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #4 0x00007f11795daabb in VMError::report_and_die(Thread, unsigned int, unsigned char, void, void, char const, ...) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #5 0x00007f11795daade in VMError::report_and_die(Thread, unsigned int, unsigned char, void, void) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #6 0x00007f11795dac43 in crash_handler(int, siginfo, void) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #7 <signal handler called> #8 0x00007f1174704938 in ?? () #9 0x000000000000000d in ?? () #10 0x000000000000000d in ?? () #11 0x00007f110e8273c0 in ?? () #12 0x00007f117031ef60 in ?? () #13 0x00007f110e827330 in ?? () #14 0x00007f1178ba4ddc in InterpreterRuntime::frequency_counter_overflow(JavaThread, unsigned char) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so Backtrace stopped: previous frame inner to this frame (corrupt stack?) Thread 10 (Thread 0x7f110ebfb700 (LWP 3142996)): warning: Section `.reg-xstate/3142996' in core file too small. #0 0x00007f1179fa6918 in nanosleep () from /lib64/libc.so.6 #1 0x00007f1179fa681e in sleep () from /lib64/libc.so.6 #2 0x00007f11791f1c42 in os::infinite_sleep() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #3 0x00007f11795d9fb8 in VMError::report_and_die(int, char const, char const, __va_list_tag, Thread, unsigned char, void, void, char const, int, unsigned long) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #4 0x00007f11795daabb in VMError::report_and_die(Thread, unsigned int, unsigned char, void, void, char const, ...) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #5 0x00007f11795daade in VMError::report_and_die(Thread, unsigned int, unsigned char, void, void) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #6 0x00007f11795dac43 in crash_handler(int, siginfo, void) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #7 <signal handler called> #8 0x00007f1179ee69bb in syscall () from /lib64/libc.so.6 #9 0x00007f1179611d36 in LinuxWaitBarrier::wait(int) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #10 0x00007f1179332281 in SafepointSynchronize::block(JavaThread) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #11 0x00007f117933a25d in SafepointMechanism::process(JavaThread, bool, bool) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #12 0x00007f11785f65bf in ThreadInVMfromNative::ThreadInVMfromNative(JavaThread) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #13 0x00007f11785f4ac1 in ciEnv::dump_replay_data_unsafe(outputStream) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #14 0x00007f11795da914 in VMError::report_and_die(int, char const, char const, __va_list_tag, Thread, unsigned char, void, void, char const, int, unsigned long) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #15 0x00007f11795dab11 in VMError::report_and_die(Thread, void, char const, int, char const, char const, __va_list_tag) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #16 0x00007f11787c7f30 in report_vm_error(char const, int, char const, char const, ...) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #17 0x00007f1179528b80 in TypeAryPtr::cast_to_not_null_free(bool) const () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #18 0x00007f1178ab680f in GraphKit::gen_checkcast(Node, Node, Node, bool) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #19 0x00007f1179257a5e in Parse::do_checkcast() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #20 0x00007f117924fcc4 in Parse::do_one_bytecode() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #21 0x00007f11792379a2 in Parse::do_one_block() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #22 0x00007f1179238865 in Parse::do_all_blocks() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #23 0x00007f117923d5f6 in Parse::Parse(JVMState, ciMethod, float) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #24 0x00007f1178570118 in ParseGenerator::generate(JVMState) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #25 0x00007f117857316e in PredictedCallGenerator::generate(JVMState) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #26 0x00007f11788b48ea in Parse::do_call() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #27 0x00007f117924db50 in Parse::do_one_bytecode() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #28 0x00007f11792379a2 in Parse::do_one_block() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #29 0x00007f1179238865 in Parse::do_all_blocks() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #30 0x00007f117923d5f6 in Parse::Parse(JVMState, ciMethod, float) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #31 0x00007f1178570118 in ParseGenerator::generate(JVMState) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #32 0x00007f11788b48ea in Parse::do_call() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #33 0x00007f117924db50 in Parse::do_one_bytecode() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #34 0x00007f11792379a2 in Parse::do_one_block() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #35 0x00007f1179238865 in Parse::do_all_blocks() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #36 0x00007f117923d5f6 in Parse::Parse(JVMState, ciMethod, float) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #37 0x00007f1178570118 in ParseGenerator::generate(JVMState) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #38 0x00007f11788b48ea in Parse::do_call() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #39 0x00007f117924db50 in Parse::do_one_bytecode() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #40 0x00007f11792379a2 in Parse::do_one_block() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #41 0x00007f1179238865 in Parse::do_all_blocks() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #42 0x00007f117923d5f6 in Parse::Parse(JVMState, ciMethod, float) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #43 0x00007f1178570118 in ParseGenerator::generate(JVMState) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #44 0x00007f1178721b9e in Compile::Compile(ciEnv, ciMethod, int, Options, DirectiveSet) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #45 0x00007f117856db0b in C2Compiler::compile_method(ciEnv, ciMethod, int, bool, DirectiveSet) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #46 0x00007f117872d967 in CompileBroker::invoke_compiler_on_method(CompileTask) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #47 0x00007f117872e670 in CompileBroker::compiler_thread_loop() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #48 0x00007f1178bf12ac in JavaThread::thread_main_inner() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #49 0x00007f11794fc9ba in Thread::call_run() () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #50 0x00007f11791e47ea in thread_native_entry(Thread) () from /opt/mach5/mesos/work_dir/jib-master/install/2023-11-06-1544380.tobias.hartmann.valhalla/linux-x64-debug.jdk/jdk-22/fastdebug/lib/server/libjvm.so #51 0x00007f117a47e1da in start_thread () from /lib64/libpthread.so.0 #52 0x00007f1179ee6e73 in clone () from /lib64/libc.so.6
07-11-2023
That would be a good idea in general.
11-01-2023
Just a note, we do have not only a global timeout, but also a per-error-step timeout. That was added deal with scenarios like these, where a single error reporting step gets stuck, holding up all other steps. Regardless of the other issues mentioned (implicit null checks not handled etc) this should prevent error reporting from hitting the global timeout and allow subsequent steps to run. But error reporting step timeouts are implemented as part of the STEP logic inside VMError::report(). And a lot of logic, like replay dumping, happens outside in VMError::report_and_die(). A mitigation would be to move replay dumping into hs-err reporting as just another STEP.
10-01-2023
I think the original reason I added the GUARDED_VM_ENTRY was because the replay dump was accessing raw oops. But looking into it more, it seems like complicated data collection during error reporting, like what replay dump tries to do, is not so simple. Unlike the old days when getting a signal meant all the other user threads were suspended, we now can have other threads running at the same time as the error reporting thread. Those threads can grab locks and other things that the error reporting thread can't safely do. It seems like getting a consistent replay dump can only be done from a different thread, or when other threads are suspended.
21-12-2022
ILW = Timeout during error reporting, rare case but might be related to timeouts we are seeing in the CI, no workaround = MMH = P3
07-12-2022
[~dlong], could you please have a look?
07-12-2022
So this is a regression from JDK-8271911 in JDK 18.
07-12-2022
Moving to compiler team and unassigning myself.
07-12-2022
// Called from VM error reporter, so be careful. // Don't safepoint or acquire any locks. // void ciEnv::dump_replay_data_unsafe(outputStream* out) { GUARDED_VM_ENTRY( dump_replay_data_helper(out); ) } But GUARDED_VM_ENTRY will transition to VM from native and that will block if a safepoint is in progress. The code was added by: https://github.com/openjdk/jdk/pull/5270
07-12-2022
As [~dholmes] has pointed out, the root cause seems to be to wrongly perform a safepoint check during error reporting in dump_replay_data_unsafe() which could lead to the observed timeout. Here is a summary of my analysis: Setup: C2 code contains guarantee(false) which is eventually executed by all C2 compiler threads and they crash: diff --git a/src/hotspot/share/opto/loopopts.cpp b/src/hotspot/share/opto/loopopts.cpp --- a/src/hotspot/share/opto/loopopts.cpp (revision 79e13192321545441217dece4f22276a8e0223c3) +++ b/src/hotspot/share/opto/loopopts.cpp (date 1669883711227) @@ -1498,7 +1498,7 @@ // } // } // - + guarantee(false, "asdf"); // clone pinned nodes thru the resulting regions push_pinned_nodes_thru_region(dom_if, new_true_region); push_pinned_nodes_thru_region(dom_if, new_false_region); Events that must happen in this order to reproduce this timeout/dead lock: 1. The first compiler thread crashes and error reporting is done. The other compiler threads also crash and will just spin in VMError::report_and_die() with os::infinite_sleep() because we are already reporting an error. 2. The error reporting compiler thread (first crashed thread) calls install_secondary_signal_handler() to properly handle recursive crashes. All other crashes will now call crash_handler() in vmError_posix.cpp. 3. While error reporting is still being performed and not yet finished: a. The "Java Thread" executes C2 compiled code and segfaults because of an implicit null check. b. The "Java Thread" executes crash_handler() which does not know anything about implicit null checks and simply wants to report this segfault as normal error. Since we are already error reporting, this thread will also just spin in error reporting like the other compiler threads. But this thread's state is _thread_in_Java. c. The "VM Thread" wants to safepoint to perform a periodic cleanup_op in VMThread::inner_execute() and spins in SafepointSynchronize::synchronize_threads() to wait for all threads to not run anymore. All threads with state _thread_in_native or _thread_blocked are considered safe in safepoint_safe_with(). However, the "Java Thread" with _thread_in_Java is not! The VM thread keeps waiting for that thread to change its state but that is not gonna happen because it is already spinning indefinitely in error reporting. 4. The compiler error reporting thread also gets stuck when trying to dump the compiler replay data in dump_replay_data_unsafe(), because it cannot do the transition to VM when there is a safepoint request. 5. Error reporting times out after 120s. On first sight, there seem to be two problems: 1. The secondary signal handler is not aware of implicit null checks and wrongly treats the implicit null check segfault as normal error and tries to report it (there might be more cases which we miss to handle/special case here). 2. We are spinning indefinitely in VMError::report_and_die() for a second thread that crashes if the thread does not have state _thread_in_native or _thread_blocked. We not only end up in this state with implicit null checks but also with other crashes where the state is different. For example, the "Java Thread" could crash when executing code in the interpreter runtime when its state is _thread_in_vm. This could be simulated by adding a guarantee(false) in InterpreterRuntime::resolve_from_cache() together with the C2 guarantee(false). These two issues, however, only become problems because the error reporting thread is not able to finish reporting - otherwise, the error reporting thread would terminate the VM. We are checking for a safepoint in dump_replay_data_unsafe() to dump the compiler replay data and get stuck. [~dholmes] suggested to fix dump_replay_data_unsafe() as it is wrong to perform safepoint checks during error reporting.
06-12-2022
[~chagedorn] has done a detailed analysis (yet to be posted) but a key issue in that analysis is that the crashing compiler thread calls dump_replay_data_unsafe() which tries to transition to _thread_in_vm which can't happen because a safepoint is in progress during the crash. We must avoid any code that needs to perform safepoint checks during error reporting!
05-12-2022
Once we are in error reporting mode the fact the VM may be trying to come to a safepoint is not directly relevant. The crashing thread will not (should not) reach the safepoint and that is expected. As long as error reporting does not rely on any resource that has been "locked" by the safepoint then error reporting should proceed. The concern is that the source code parsing introduced by JDK-8242181 can take too long and so trigger these timeouts. Given we are in the process of crashing are we sure that we can safely do this additional work without running into deadlocks with other VM or OS actions?
01-12-2022
I've had a closer look and it is not per-se related to the implementation of JDK-8242181 but rather to the fact that we are spending more time during error reporting. I've replaced the source information parsing with just a long running loop (depending on the power of the machine, the number of iterations could be tweaked) and we can also observe the same problem: diff --git a/src/hotspot/share/opto/loopopts.cpp b/src/hotspot/share/opto/loopopts.cpp --- a/src/hotspot/share/opto/loopopts.cpp (revision 79e13192321545441217dece4f22276a8e0223c3) +++ b/src/hotspot/share/opto/loopopts.cpp (date 1669883711227) @@ -1498,7 +1498,7 @@ // } // } // - + guarantee(false, "asdf"); // clone pinned nodes thru the resulting regions push_pinned_nodes_thru_region(dom_if, new_true_region); push_pinned_nodes_thru_region(dom_if, new_false_region); diff --git a/src/hotspot/share/utilities/vmError.cpp b/src/hotspot/share/utilities/vmError.cpp --- a/src/hotspot/share/utilities/vmError.cpp (revision 79e13192321545441217dece4f22276a8e0223c3) +++ b/src/hotspot/share/utilities/vmError.cpp (date 1669883711243) @@ -347,6 +347,8 @@ } } +static uint32_t crc = 324; + void VMError::print_native_stack(outputStream* st, frame fr, Thread* t, bool print_source_info, char* buf, int buf_size) { // see if it's a valid frame @@ -362,8 +364,14 @@ if (count == 1 && _lineno != 0) { // We have source information of the first frame for internal errors. There is no need to parse it from the symbols. st->print(" (%s:%d)", get_filename_only(), _lineno); - } else if (print_source_info && - Decoder::get_source_info(fr.pc(), filename, sizeof(filename), &line_no, count != 1)) { + } else if (print_source_info) { + uint8_t buffer[20192]; + for (int i = 0; i < 100000000; i++) { + for (int j = 0; j < 4; j++) { + crc += buffer[i % 16000] ^ j; + } + } + tty->print_cr("Done"); st->print(" (%s:%d)", filename, line_no); } } Run with product build without flags: java HelloWorld.java I've had a closer look and the problem seems that the VM Thread wants to safepoint to perform a periodic cleanup_op. It waits until all threads are suspended. But one of these threads is still running and won't be interrupted. It is the java thread (id 2): (gdb) info threads Id Target Id Frame .... 2 Thread 0x7ffff68f0640 (LWP 530278) "java" 0x00007ffff7e2e868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffff68ed420, rem=rem@entry=0x7ffff68ed420) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78 .... Looking at the stack trace, we can see that this thread also crashed in InterpreterRuntime::resolve_from_cache() somewhere and it now sleeps indefinitely while the VM thread is still waiting for that thread: #0 0x00007ffff7e2e868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffff68ed420, rem=rem@entry=0x7ffff68ed420) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78 #1 0x00007ffff7e336e7 in __GI___nanosleep (req=req@entry=0x7ffff68ed420, rem=rem@entry=0x7ffff68ed420) at ../sysdeps/unix/sysv/linux/nanosleep.c:25 #2 0x00007ffff7e3361e in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55 #3 0x00007ffff7665f22 in os::infinite_sleep () at open/src/hotspot/os/posix/os_posix.cpp:870 #4 0x00007ffff78aac7d in VMError::report_and_die (id=11, message=message@entry=0x0, detail_fmt=0x7ffff793ac22 "%s", detail_args=detail_args@entry=0x7ffff68ed6c8, thread=0x0, pc=0x7fffe8bbfbc3 "D\213R\bA\201\372\320\034\004", siginfo=0x7ffff68ed8f0, context=0x7ffff68ed7c0, filename=0x0, lineno=0, size=0) at open/src/hotspot/share/utilities/vmError.cpp:1602 #5 0x00007ffff78ab88b in VMError::report_and_die (thread=<optimized out>, sig=<optimized out>, pc=<optimized out>, siginfo=<optimized out>, context=<optimized out>, detail_fmt=detail_fmt@entry=0x7ffff793ac22 "%s") at open/src/hotspot/share/utilities/vmError.cpp:1476 #6 0x00007ffff78ab8be in VMError::report_and_die (thread=<optimized out>, sig=<optimized out>, pc=<optimized out>, siginfo=<optimized out>, context=<optimized out>) at open/src/hotspot/share/utilities/vmError.cpp:1482 #7 <signal handler called> #8 0x00007fffe8bbfbc3 in ?? () #9 0x00007ffff68ee4e0 in ?? () #10 0x00007ffff727aaf7 in InterpreterRuntime::resolve_from_cache (current=0x7fffbc00a9f0, bytecode=<optimized out>) at open/src/hotspot/share/interpreter/interpreterRuntime.cpp:963 #11 0x00007fffe8479e03 in ?? () #12 0x0000000000000000 in ?? () From the comment in resolve_from_cache(), it says that this method does not safepoint: // This function is the interface to the assembly code. It returns the resolved // cpCache entry. This doesn't safepoint, but the helper routines safepoint. // This function will check for redefinition! JRT_ENTRY(void, InterpreterRuntime::resolve_from_cache(JavaThread* current, Bytecodes::Code bytecode)) { So, that might be the problem but I'm not familiar with that code and might have missed some things. Would be good if someone from the runtime team could have a look at that. I'm also not sure how critical it is. It only seems to happen if multiple threads crash and try to error report. For example, when running my patch with -Xbatch, we are not running into this problem.
01-12-2022
I suggest to set it at least to P3 and try to get a fix into JDK 20. I'll have a look.
30-11-2022
As [~dnsimon] pointed out, this might be related to JI-9071551. Update: It's not related to JI-9071551.
30-11-2022
I verified that this is a regression from JDK-8242181, [~chagedorn].
30-11-2022
I'm not sure if P4/tbd is justified here because these timeouts happen frequently in the CI, currently in the context of JDK-8297345.
30-11-2022