Bug ID: JDK-8359820 Improve handshake/safepoint timeout diagnostic messages

JDK-8359820 : Improve handshake/safepoint timeout diagnostic messages

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 17,26

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2025-06-17
Updated: 2025-08-07
Resolved: 2025-08-06

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 26
26 b10Fixed

Related Reports

Causes :	JDK-8219584 - Try to dump error file by thread which causes safepoint timeout
Relates :	JDK-8189941 - Implementation JEP 312: Thread-local handshake

Description

I did some testing with SDE [1] and noticed the following crash:

sde64 -emit-illegal-insts 0 -future -- java -XX:+UnlockDiagnosticVMOptions -XX:HandshakeTimeout=1 Test.java

(also happens wit -rpl instead of -future)

[12.526s][error][handshake] Handshake timeout: Deoptimize(0x00007d9dc1495b90), pending threads: 9
[12.540s][error][handshake] JavaThread 0x00007d9db402d740 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db41801e0 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db4181520 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db4182d90 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db4184980 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db4186480 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db4188a30 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9db418a720 has not cleared handshake op: 0x00007d9dc1495b90
[12.541s][error][handshake] JavaThread 0x00007d9d40017fa0 has not cleared handshake op: 0x00007d9dc1495b90
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x00007d9dc5a98d71 (sent by kill), pid=329828, tid=329852
#
# JRE version: Java(TM) SE Runtime Environment (26.0+3) (fastdebug build 26-ea+3-153)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-ea+3-153, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libc.so.6+0x98d71]
#

Stack: [0x00007d9d346f1000,0x00007d9d347f1000],  sp=0x00007d9d347ef9b0,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.6+0x98d71]
C  [libc.so.6+0x9bc8e]  pthread_cond_timedwait+0x23e
V  [libjvm.so+0x1799c6d]  PlatformMonitor::wait(unsigned long)+0xbd
V  [libjvm.so+0x16db26a]  Monitor::wait(unsigned long)+0xca
V  [libjvm.so+0xb509cb]  CompileQueue::get(CompilerThread*)+0x8b
V  [libjvm.so+0xb5653b]  CompileBroker::compiler_thread_loop()+0x3ab
V  [libjvm.so+0x10a8e0b]  JavaThread::thread_main_inner()+0x13b
V  [libjvm.so+0x1b0ca06]  Thread::call_run()+0xb6
V  [libjvm.so+0x1787f38]  thread_native_entry(Thread*)+0x128
C  [libc.so.6+0x9caa4]




It reproduces since JDK 17u for me.

[1] https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

Comments

Changeset: 6656e767 Branch: master Author: Anton Artemov <anton.artemov@oracle.com> Committer: David Holmes <dholmes@openjdk.org> Date: 2025-08-06 04:45:35 +0000 URL: https://git.openjdk.org/jdk/commit/6656e767db05e8e11ea17385674d8aa4385c60ed
06-08-2025
I had forgotten about these two uses of signals for timeouts.So we need a more generic mechanism to indicate that the SIGILL was triggered by a timeout.
16-07-2025
Good, I renamed the issue to better reflect what are we doing here. I think the original synopsis was a symptom of the problem we are fixing: spooky SIGILLs, which may easily be interpreted as actual compiler bug emitting wrong instructions (especially when running under simulator!). Diagnostics should not be that spooky.
16-07-2025
OK, from this log, can you see _why_ threads have not cleared the handshake op? That's a common problem in the field: something does not reach handshake/safepoint, and you want to understand why. So the purpose of SIGILL is to trigger crash from the _remote_ thread, the one that is actually blocking the handshake/safepoint. The error diagnostics from that remote thread often directly shows what that remote thread is so busy with instead of reacting promptly at handshake/safepoint request. The final "fatal", AFAICS, handles the case when remote thread had not crashed for the 3 seconds. For whatever reason, that thread might be non-responsive at all. So we patiently wait for 3 seconds, and then proceed to crash from this caller thread. Or, we cannot send signal at all, then we also crash from here. if (target != nullptr) { if (os::signal_thread(target, SIGILL, "cannot be handshaked")) { // Give target a chance to report the error and terminate the VM. os::naked_sleep(3000); } } else { log_error(handshake)("No thread with an unfinished handshake op(" INTPTR_FORMAT ") found.", p2i(op)); } fatal("Handshake timeout");
16-07-2025
Speaking of, if you are handling the handshake part, maybe you want to handle the safepoint part as well? https://github.com/openjdk/jdk/blob/5e4a2ead714814cb4eb90ca88debc226f9c75864/src/hotspot/share/runtime/safepoint.cpp#L653-L655
16-07-2025
To be clear, I fully agree that SIGILL without any additional hints is confusing if you don't know you are supposed to get it. So an improvement that makes it more clear would be very welcome. Just try not to break this rather useful diagnostic feature :)
16-07-2025
A few findings: In the current state of the code, there is a fatal macro in the end of handshake::handle_timeout(). That thing will end up calling VMError::report(), but by the time the call is done another fatal error (the one caused by the SIGILL signal) will be already reported, and, if I get it right, there can be only one fatal error reported (see out_done and log_done variables in the VMError::report()). Without the SIGILL firing, one gets the following crash output, which is way more informative: ../intel_sde/sde -emit-illegal-insts 0 -future -- java -XX:+UnlockDiagnosticVMOptions -XX:HandshakeTimeout=1 Test.java [59,083s][error][handshake] Handshake timeout: Deoptimize(0x00007d189e9188f0), pending threads: 9 [59,219s][error][handshake] JavaThread 0x00007d188802cb80 has not cleared handshake op: 0x00007d189e9188f0 [59,235s][error][handshake] JavaThread 0x00007d18881b0ef0 has not cleared handshake op: 0x00007d189e9188f0 [59,257s][error][handshake] JavaThread 0x00007d18881b2a50 has not cleared handshake op: 0x00007d189e9188f0 [59,261s][error][handshake] JavaThread 0x00007d18881b47c0 has not cleared handshake op: 0x00007d189e9188f0 [59,261s][error][handshake] JavaThread 0x00007d18881b6410 has not cleared handshake op: 0x00007d189e9188f0 [59,261s][error][handshake] JavaThread 0x00007d18881b7f10 has not cleared handshake op: 0x00007d189e9188f0 [59,261s][error][handshake] JavaThread 0x00007d18881bc150 has not cleared handshake op: 0x00007d189e9188f0 [59,261s][error][handshake] JavaThread 0x00007d1888253ad0 has not cleared handshake op: 0x00007d189e9188f0 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/antoni/src/jdk/open/src/hotspot/share/runtime/handshake.cpp:211), pid=13461, tid=13471 # fatal error: Handshake timeout # # JRE version: Java(TM) SE Runtime Environment (26.0) (slowdebug build 26-internal-2025-07-16-0722133.antoni...) # Java VM: Java HotSpot(TM) 64-Bit Server VM (slowdebug 26-internal-2025-07-16-0722133.antoni..., mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xfa16f2] handle_timeout(HandshakeOperation, JavaThread)+0x1fb I am not sure about the purpose of the fatal macro after SIGILL has been fired.
16-07-2025
Thanks [~shade], with a small hint from [~dholmes] I've made it work. Sure, I can do safepoint as well.
16-07-2025
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/26309 Date: 2025-07-15 08:37:38 +0000
15-07-2025
[~dholmes] I re-opened the issue and submitted a PR with a small enhancement as suggested. Please take a look.
15-07-2025
[~aartemov] I think that is a bit too obscure. I think we should augment the SIGILL handling so that it is very clear what is happening from the initial crash information that also goes to stdout/err.
15-07-2025
> I'm unclear on the intent/purpose of this handshake timeout mechanism. If I am reading things correctly if we appear to take too long to respond to a handshake then the handshake code fires a SIGILL at the target which it then expects to abort the VM. I'm unclear why we do not handle that SIGILL so that the crash report makes it clear this is what happened and that it was not some other kind of SIGILL. The intent for SIGILL is to trigger the crash at the thread that blocks handshake/safepoint sync. E.g. a Java thread that is stuck on miscompiled loop without safepoint checks. Or some VM code that spins without VM transitions. See JDK-8219584. This feature is remarkably useful in the field, used this dozens of times. So whatever we do, we need to keep printing the instructions block and hopefully a backtrace.
15-07-2025
The existing error reporting mechanism provides a way how to identify what happened.
14-07-2025
I think that what one can see in the crash report with -emit-illegal-insts 1 (default), i.e. the following event: Event: 45,705 Thread 0x00007c4c9c1515f0 sent signal 4 to Thread 0x00007c4c9c253ad0 because cannot be handshaked identifies what happened. I close this issue as "won't fix".
14-07-2025
Right, I'm also not familiar enough with this code to tell but the SIGILL failure mode seems unexpected to me. So I'd say we should either improve the error reporting for the expected SIGILL case or just close this as "Won't fix" because it's such an edge case.
11-07-2025
I'm unclear on the intent/purpose of this handshake timeout mechanism. If I am reading things correctly if we appear to take too long to respond to a handshake then the handshake code fires a SIGILL at the target which it then expects to abort the VM. I'm unclear why we do not handle that SIGILL so that the crash report makes it clear this is what happened and that it was not some other kind of SIGILL. With a low timeout value combined with running under the emulator, triggering this crash is not unexpected.
11-07-2025
Well, I think it is a timing issue. Here is what I found with the help of gdb. 1) The SIGILL signal is sent in handle_timeout(HandshakeOperation* op, JavaThread* target) on line 204 of handshake.cpp. This happens in two cases, either target it not null when the method is called or target is null, and then we look for a thread with a pending operation. It becomes a new target. 2) handle_timeout(HandshakeOperation* op, JavaThread* target) is called from check_handshake_timeout(jlong start_time, HandshakeOperation* op, JavaThread* target = nullptr) IF os::javaTimeNanos() >= (start_time + timeout_ns) 3) Whenever the code is executed with SDE, that condition is true, but the target is null, because the call is made from VM_HandshakeAllThreads::doit(). So we arrive in the situation where SIGILL will be fired. 4) When executing without SDE, the condition os::javaTimeNanos() >= (start_time + timeout_ns) is not true, so we do not call handle_timeout(). So, my hypothesis is that with SDE the condition is triggered because of the introduced overhead. I am not sure if and how SDE affects time routines. [~thartmann], [~dholmes] Do you think it is a "feature" of SDE or a bug? I do not think we should do any adjustments on our side.
10-07-2025
Another finding: "Event: 45,705 Thread 0x00007c4c9c1515f0 sent signal 4 to Thread 0x00007c4c9c253ad0 because cannot be handshaked." means VMThread sends SIGILL to the JavaThread "Common-Cleaner", see line 204 of handshake.cpp
10-07-2025
A bit more experiments: I was able to reproduce the problem on Linux Mint 22.1 running inside of Oracle Virtual Box on my Windows machine. This is what -long-info reports about the flag of interest: -emit_illegal_insts [default 1] Emit information about illegal instructions. Default is enabled. So removing that flag is the same as having it set to 1. In this regard I cannot reproduce the message that [~thartmann] got without the flag, for me both cases look identical, but there is nothing about compiler-related classes: Current thread (0x00007f5d70263dc0): JavaThread "Common-Cleaner" daemon [_thread_blocked, id=400645, stack(0x00007f5cf71dd000,0x00007f5cf72dd000) (1024K)] Stack: [0x00007f5cf71dd000,0x00007f5cf72dd000], sp=0x00007f5cf72db1d0, free space=1016k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libc.so.6+0x98d71] C [libc.so.6+0x9bc8e] pthread_cond_timedwait+0x23e V [libjvm.so+0x17adf26] PlatformEvent::park_nanos(long)+0x126 (os_posix.cpp:1750) V [libjvm.so+0x1752805] ObjectMonitor::wait(long, bool, JavaThread)+0xbd5 (objectMonitor.cpp:1850) V [libjvm.so+0x1aba77f] ObjectSynchronizer::wait(Handle, long, JavaThread)+0x7f (synchronizer.cpp:754) V [libjvm.so+0x123d04b] JVM_MonitorWait+0xcb (jvm.cpp:631) j java.lang.Object.wait0(J)V+0 java.base j java.lang.Object.wait(J)V+55 java.base j java.lang.ref.ReferenceQueue.remove0(J)Ljava/lang/ref/Reference;+21 java.base j java.lang.ref.ReferenceQueue.remove(J)Ljava/lang/ref/Reference;+36 java.base j jdk.internal.ref.CleanerImpl.run()V+45 java.base j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base j java.lang.Thread.run()V+19 java.base j jdk.internal.misc.InnocuousThread.run()V+20 java.base v ~StubRoutines::call_stub 0x00007f5d685386fb V [libjvm.so+0x107e1b7] JavaCalls::call_helper(JavaValue, methodHandle const&, JavaCallArguments, JavaThread)+0x4f7 (javaCalls.cpp:415) V [libjvm.so+0x107e913] JavaCalls::call_virtual(JavaValue, Klass, Symbol, Symbol, JavaCallArguments, JavaThread)+0x313 (javaCalls.cpp:323) V [libjvm.so+0x107ef53] JavaCalls::call_virtual(JavaValue, Handle, Klass, Symbol, Symbol, JavaThread)+0xb3 (javaCalls.cpp:185) V [libjvm.so+0x1236743] thread_entry(JavaThread, JavaThread)+0xe3 (jvm.cpp:2748) V [libjvm.so+0x10bbecb] JavaThread::thread_main_inner()+0x13b (javaThread.cpp:773) V [libjvm.so+0x1b22de6] Thread::call_run()+0xb6 (thread.cpp:243) V [libjvm.so+0x179d298] thread_native_entry(Thread*)+0x128 (os_linux.cpp:868) C [libc.so.6+0x9caa4] The only difference I was able to find so far is this one, with -emit-illegal-insts 1 (default) if one looks at events, there is a signal 4 sent: Events (14 events): Event: 2,622 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c02cb50 Event: 7,425 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1b0f00 Event: 7,483 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1b2a60 Event: 7,600 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1b47d0 Event: 7,620 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1b6420 Event: 7,633 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1b7f20 Event: 7,657 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1ba470 Event: 7,712 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c1c4170 Event: 16,874 Thread 0x00007c4c9c02cb50 Thread added: 0x00007c4c9c253ad0 Event: 31,502 Thread 0x00007c4c9c1c4170 Thread added: 0x00007c4c302a0fc0 Event: 34,728 Thread 0x00007c4c9c1c4170 Thread added: 0x00007c4c302e1280 Event: 43,443 Thread 0x00007c4c302a0fc0 Thread exited: 0x00007c4c302a0fc0 Event: 43,446 Thread 0x00007c4c302e1280 Thread exited: 0x00007c4c302e1280 Event: 45,705 Thread 0x00007c4c9c1515f0 sent signal 4 to Thread 0x00007c4c9c253ad0 because cannot be handshaked. This last line is absent if -emit-illegal-insts is 0. Is that correct to say that the thread cannot be handshaked in both cases? But it is reported only with the flag on? Investigating further.
10-07-2025
I've tested it on Windows x64, cannot reproduce illegal instruction, it does not matter whether -emit-illegal-insts is 0 or 1, or the flag is absent, it is always like this: sde -emit-illegal-insts 0 -future -- java -XX:+UnlockDiagnosticVMOptions -XX:HandshakeTimeout=1 Test.java [23.284s][error][handshake] Handshake timeout: Deoptimize(0x000000c5721fdcd8), pending threads: 12 [23.319s][error][handshake] JavaThread 0x0000022a095aaff0 has not cleared handshake op: 0x000000c5721fdcd8 [23.322s][error][handshake] JavaThread 0x0000022a8cf31950 has not cleared handshake op: 0x000000c5721fdcd8 [23.323s][error][handshake] JavaThread 0x0000022a8cf37590 has not cleared handshake op: 0x000000c5721fdcd8 [23.324s][error][handshake] JavaThread 0x0000022a8cf3ba00 has not cleared handshake op: 0x000000c5721fdcd8 [23.324s][error][handshake] JavaThread 0x0000022a8cf3c6a0 has not cleared handshake op: 0x000000c5721fdcd8 [23.325s][error][handshake] JavaThread 0x0000022a8cf40e80 has not cleared handshake op: 0x000000c5721fdcd8 [23.328s][error][handshake] JavaThread 0x0000022a8cf42f30 has not cleared handshake op: 0x000000c5721fdcd8 [23.329s][error][handshake] JavaThread 0x0000022a8cf54920 has not cleared handshake op: 0x000000c5721fdcd8 [23.329s][error][handshake] JavaThread 0x0000022a8cfb3790 has not cleared handshake op: 0x000000c5721fdcd8 [23.329s][error][handshake] JavaThread 0x0000022a8d0a45a0 has not cleared handshake op: 0x000000c5721fdcd8 [23.329s][error][handshake] JavaThread 0x0000022a8d0a4e90 has not cleared handshake op: 0x000000c5721fdcd8 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (d:\src\jdk\open\src\hotspot\share\runtime\handshake.cpp:211), pid=32072, tid=46796 # fatal error: Handshake timeout # # JRE version: Java(TM) SE Runtime Environment (26.0) (fastdebug build 26-internal-2025-07-07-1009335.aartemov...) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-internal-2025-07-07-1009335.aartemov..., mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64) # Problematic frame: # V [jvm.dll+0x868763] handle_timeout+0x193
07-07-2025
No, I could never reproduce this without SDE, so it might well be an issue with SDE itself. Maybe not worth investigating this in more depth for now then but it's good to have it documented for when we potentially run more testing with SDE enabled (JDK-8354684).
18-06-2025
[~thartmann] can you reproduce this without SDE? I couldn't. I wonder whether SDE is messing with the SIGILL delivery and reception? This may be a real SIGILL we potentially trigger during VM_Version checks, but I could only see one use of pthread_kill(SIGILL) and that relates to VM error handing.
18-06-2025
[~dholmes] The emulator can't let the CPU fire a "real" SIGILL, right? Without the '-emit-illegal-insts 0' option of SDE, we get: [54.231s][error][handshake] Handshake timeout: Deoptimize(0x000072bb1c2b7100), pending threads: 11 [54.249s][error][handshake] JavaThread 0x000072bb1002d740 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb101801d0 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb10181510 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb10182d80 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb10184970 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb10186470 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb10188a20 has not cleared handshake op: 0x000072bb1c2b7100 [54.250s][error][handshake] JavaThread 0x000072bb1018a710 has not cleared handshake op: 0x000072bb1c2b7100 [54.251s][error][handshake] JavaThread 0x000072bb102408b0 has not cleared handshake op: 0x000072bb1c2b7100 [54.251s][error][handshake] JavaThread 0x000072bb10252d20 has not cleared handshake op: 0x000072bb1c2b7100 [54.251s][error][handshake] JavaThread 0x000072ba9c103fc0 has not cleared handshake op: 0x000072bb1c2b7100 Illegal instruction at address = 72bb20898d71: 44 89 ef 48 89 c3 e8 54 f8 ff ff e9 65 ff ff Image name: /lib/x86_64-linux-gnu/libc.so.6 Offset in image: 0x98d71 If you believe your application should attempt to execute this illegal instruction (and others that may be present), Then use this knob: -emit-illegal-insts 0 and this error message will be avoided. SDE ERROR: Illegal instruction at address = 72bb20898d71: 44 89 ef 48 89 c3 e8 54 f8 ff ff e9 65 ff ff Image name: /lib/x86_64-linux-gnu/libc.so.6 Offset in image: 0x98d71 If you believe your application should attempt to execute this illegal instruction (and others that may be present), Then use this knob: -emit-illegal-insts 0 and this error message will be avoided.
18-06-2025
> # SIGILL (0x4) at pc=0x00007d9dc5a98d71 (sent by kill), pid=329828, tid=329852 The emulator appears to be firing SIGILL at the process. ??
18-06-2025