Bug ID: JDK-8369255 Assess and remedy any unsafe usage of the Semaphores used by JFR

JDK-8369255 : Assess and remedy any unsafe usage of the Semaphores used by JFR

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 26

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2025-10-07
Updated: 2025-10-21
Resolved: 2025-10-13

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 26
26 b20Fixed

Related Reports

Relates :	JDK-8361462 - JVM crashed with assert(ret == 0) failed: Failed to wait on semaphore
Relates :	JDK-8369991 - Thread blocking during JFR emergency dump must be in safepoint safe state

Description

Extracted from JDK-8361462:

The original semaphore related crash was:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/d2398cde-9325-49c3-b030-8961a4f0a253-S577077/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/1725c443-f2fc-4cb3-8d36-4912f92abfb1/runs/bf13cee2-c1c1-49dc-af9a-95488455fd59/workspace/open/src/hotspot/os/bsd/semaphore_bsd.cpp:65), pid=69909, tid=58739
# assert(ret == 0) failed: Failed to wait on semaphore 

--------------- T H R E A D ---------------

Current thread (0x0000000298d41a10): JavaThread "Thread-7972" daemon [_thread_in_vm, id=58739, stack(0x000000028ab04000,0x000000028ad07000) (2060K)]

Stack: [0x000000028ab04000,0x000000028ad07000], sp=0x000000028ad05c30, free space=2055k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.dylib+0x12125c8] VMError::report(outputStream*, bool)+0x1b00 (semaphore_bsd.cpp:65)
V [libjvm.dylib+0x1215e68] VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x55c
V [libjvm.dylib+0x5a8448] print_error_for_unit_test(char const*, char const*, char*)+0x0
V [libjvm.dylib+0xfd8330] OSXSemaphore::trywait()+0x0
V [libjvm.dylib+0x98cd00] JfrThreadGroup::thread_group_id_internal(JfrThreadGroupsHelper&)+0x30
V [libjvm.dylib+0x98cc10] JfrThreadGroup::thread_group_id(JavaThread const*, Thread*)+0xe8 

The JFR code has been reorganized somewhat since the initial report but it still contains three statically allocated Semaphores:

./share/jfr/leakprofiler/checkpoint/objectSampleCheckpoint.cpp:  static Semaphore _mutex_semaphore;
./share/jfr/recorder/checkpoint/types/jfrThreadGroupManager.cpp:  static Semaphore _mutex_semaphore;
./share/jfr/recorder/checkpoint/types/jfrTypeManager.cpp:  static Semaphore _mutex_semaphore;

These Semaphores need to be examined to see if they can be used after the static destructor has executed, during VM termination. The Semaphore is unsafe if it can be accessed by a `NonJavaThread`, or by a `JavaThread` in a safepoint-safe state.

Comments

Changeset: 62fa971f Branch: master Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2025-10-13 11:34:30 +0000 URL: https://git.openjdk.org/jdk/commit/62fa971f3116828398294c9d57650c4e0aca7379
13-10-2025
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/27722 Date: 2025-10-09 10:07:35 +0000
09-10-2025
Yes, that is the meaning I intended. I want the design to exclude any issues, if possible.
09-10-2025
> (This also explains the stacktrace where a thread is in thread_in_vm (because the transition to that thread state in Jfr::on_vm_shutdown() is non-official)) By non-official I assume you mean it is done directly and so by-passes the checks that would have held it at the termination safepoint? > We can coordinate better what threads do when they call into JFR_ONLY(Jfr::on_vm_shutdown(true, false, halt)). Or just use `DeferredStatic` for the semaphores so they don't get destroyed - but you do have to insert an init call sopmewhere appropriate.
08-10-2025
I have a hypothesis about what is happening: 1. The DestroyJavaVM thread first invokes shutdown hooks as part of Threads::destroy_vm(). If JFR is running, a shutdown hook will be registered. Its task is to stop any running recordings and terminate JFR gracefully. # RetAddr : Args to Child : Call Site 00 00007ffb`de825376 : 000002ac`07d14cd0 00007ffb`df37d100 ffffffff`00000000 00007ffb`dd7869d1 : jvm!JavaThread::invoke_shutdown_hooks+0xbd [d:\dev\github\jdk_copy4\open\src\hotspot\share\runtime\javaThread.cpp @ 2101] 01 00007ffb`de0cef09 : 000002ac`07d14cd0 000000f6`00000006 000000f6`fc3ff801 00007ffb`de811896 : jvm!Threads::destroy_vm+0x106 [d:\dev\github\jdk_copy4\open\src\hotspot\share\runtime\threads.cpp @ 973] 02 00007ffb`de0cef3b : 00007ffb`df5d3170 000002ac`00000000 000002ac`00000001 000002ac`24296ca0 : jvm!jni_DestroyJavaVM_inner+0xa9 [d:\dev\github\jdk_copy4\open\src\hotspot\share\prims\jni.cpp @ 3741] 03 00007ffc`1718aa79 : 00007ffb`df5d3170 000002ac`24298628 000002ac`24298640 00007ffc`1719ebd4 : jvm!jni_DestroyJavaVM+0x1b [d:\dev\github\jdk_copy4\open\src\hotspot\share\prims\jni.cpp @ 3751] 04 00007ffc`1718fd53 : 000000f6`fbd8f2f8 00007ffc`311f3100 00007ffc`33a3d3a0 00000000`00000000 : jli!JavaMain+0xfb9 [d:\dev\github\jdk_copy4\open\src\java.base\share\native\libjli\java.c @ 668] 05 00007ffc`339a1bb2 : 000000f6`fbd8f2f8 00000000`00000000 00000000`00000000 00000000`00000000 : jli!ThreadJavaMain+0x13 [d:\dev\github\jdk_copy4\open\src\java.base\windows\native\libjli\java_md.c @ 633] 06 00007ffc`34a47344 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ucrtbase!thread_start<unsigned int (__cdecl)(void ),1>+0x42 07 00007ffc`35e826b1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : KERNEL32!BaseThreadInitThunk+0x14 08 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21 2. But the JFR Shutdown hook can fail for many reasons (for example, OOM, or other unhandled exceptions, or the JFR Recorder Thread (handling requests from the JFR Shutdown hook) might run into a problem when terminating (crashing)). 3. If the Shutdown hook fails for some reason, JFR will still be in a recording state. 4. The DestroyJavaVM thread then enters before_exit(); it is non-reentrant with other threads, so only one thread will execute the function. 5. As part of before_exit(), there is a call to JFR_ONLY(Jfr::on_vm_shutdown(true, false, halt);). This function is also invoked as part of VMError::report_and_die(). 6. JFR_ONLY(Jfr::on_vm_shutdown(true, false, halt);) is also non-reentrant. It means that if the DestroyJavaVM thread competes against a crashing thread and loses the race, it returns immediately and unwinds to call the CRT exit, while another thread is still running the JFR code as part of an emergency dump. Therefore, the DestroyJavaVM thread is cutting off CRT support from underneath. (This also explains the stacktrace where a thread is in thread_in_vm (because the transition to that thread state in Jfr::on_vm_shutdown() is non-official)) We can coordinate better what threads do when they call into JFR_ONLY(Jfr::on_vm_shutdown(true, false, halt)). It boils down to waiting until the other thread finishes, resulting in either of: 1. Graceful exit, 2. Crashing.
08-10-2025
[~mgronlun] Yes as per the style guide: "Variables with static storage duration and non-trivial destructors should be avoided. HotSpot doesn't generally try to cleanup on exit, and running destructors at exit can lead to problems."
07-10-2025
The JavaThread list doesn't drain, nor do all the NJT's terminate as part of a shutdown. The termination process takes the VM to a safepoint and tears down a bunch of stuff. Once the VMThread has been shutdown then we reach a point where we return to the launcher and process exit occurs - as you note. There's no way to know when it is "safe" to call exit() - that's why we are not supposed to have non-trivial static objects.
07-10-2025
[~dholmes] where exactly is the call that initiates the CRT teardown located? From what I inferred from JDK-8361462, it appears the DestroyJavaVM thread is allowed to escape back to the launcher and invoke exit() while other threads are still running. Isn't that the real issue to be resolved? I.e. can't the DestroyJavaVM thread wait for the lists of NonJavaThreads and JavaThreads to drain before returning?
07-10-2025
[~dholmes] The implication is that no static objects are safe.
07-10-2025