Bug ID: JDK-8369515 Deadlock between JVMTI and JNI ReleasePrimitiveArrayCritical

JDK-8369515 : Deadlock between JVMTI and JNI ReleasePrimitiveArrayCritical

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 26

Priority: P3
Status: In Progress
Resolution: Unresolved

Submitted: 2025-10-09
Updated: 2025-12-15

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :	JDK-8367650 - jdk8u: Kitchensink timeout after 1h (no failures in log)
Relates :	JDK-8227745 - Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Description

The following test timed out in the JDK26 CI:

applications/kitchensink/Kitchensink.java

Here's a snippet from the log file:

Stress process is started and all modules initialized.
Dumping stress process VM options.
VM non-default flags reported by jcmd VM.flags:
62376:
-XX:-AOTAdapterCaching -XX:-AOTInvokeDynamicLinking -XX:-AOTRecordTraining -XX:-AOTReplayTraining -XX:-AOTStubCaching -XX:+BytecodeVerificationLocal -XX:+BytecodeVerificationRemote -XX:CICompilerCount=4 -XX:CompileCommand=memlimit,*.*,0 -XX:CompressedClassSpaceSize=436207616 -XX:+CrashOnOutOfMemoryError -XX:+DisableExplicitGC -XX:+DisplayVMOutputToStderr -XX:-DisplayVMOutputToStdout -XX:+FlightRecorder -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=35651584 -XX:MaxHeapSize=8589934592 -XX:MaxMetaspaceSize=536870912 -XX:MaxRAM=17179869184 -XX:MaxRAMPercentage=50.000000 -XX:MinHeapDeltaBytes=2097152 -XX:MinHeapSize=8388608 -XX:NativeMemoryTracking=detail -XX:NonNMethodCodeHeapSize=5832704 -XX:NonProfiledCodeHeapSize=122929152 -XX:ProfiledCodeHeapSize=122896384 -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:SoftMaxHeapSize=8589934592 -XX:+StartAttachListener -XX:+UnlockDiagnosticVMOptions -XX:-UseCompressedOops -XX:-UseNUMA -XX:-UseNUMAInterleaving -XX:+UseZGC -XX:+WhiteBoxAPI -XX:ZOldGCThreads=2 -XX:ZYoungGCThreads=2 

Starting picker module.

Picker module started: Jfr
Picker module started: Jcmd
Picker module started: Monitor
Picker module started: NMT
Picker module started: Perfmon
Picker module started: Jstat
[stress.process.err] [ 2025-10-09T14:23:34.422836Z ]  Iteration done: Instrumentation at Thu Oct 09 14:23:34 GMT 2025
[stress.process.err] [ 2025-10-09T14:23:34.435327Z ]  Iteration done: LockDeflation at Thu Oct 09 14:23:34 GMT 2025
 stdout: [];
 stderr: [jfr summary: file is empty '/System/Volumes/Data/mesos/work_dir/slaves/0103b69c-746c-4fb5-bf13-94918f380124-S1706/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d75cd46a-29a3-442a-9af9-f40219daf28d/runs/2c2d5b27-4653-46a3-8eff-d98e27dd80ea/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/jfr-files/external/ks_external13858030528049849745.jfr'
]
 exitValue = 1

 stdout: [];
 stderr: [jfr summary: file is empty '/System/Volumes/Data/mesos/work_dir/slaves/0103b69c-746c-4fb5-bf13-94918f380124-S1706/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d75cd46a-29a3-442a-9af9-f40219daf28d/runs/2c2d5b27-4653-46a3-8eff-d98e27dd80ea/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/jfr-files/external/ks_external11231795738798007271.jfr'
]
 exitValue = 1

 stdout: [];
 stderr: [jfr summary: file is empty '/System/Volumes/Data/mesos/work_dir/slaves/0103b69c-746c-4fb5-bf13-94918f380124-S1706/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d75cd46a-29a3-442a-9af9-f40219daf28d/runs/2c2d5b27-4653-46a3-8eff-d98e27dd80ea/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/jfr-files/external/ks_external16232410193570851567.jfr'
]
 exitValue = 1

 stdout: [];
 stderr: [jfr summary: file is empty '/System/Volumes/Data/mesos/work_dir/slaves/0103b69c-746c-4fb5-bf13-94918f380124-S1706/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/d75cd46a-29a3-442a-9af9-f40219daf28d/runs/2c2d5b27-4653-46a3-8eff-d98e27dd80ea/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/jfr-files/external/ks_external8609090402161680561.jfr'
]
 exitValue = 1

Unexpected exception Connection reset during communication. Check process module status.
[Thu Oct 09 15:23:23 GMT 2025] (1760023403052) Picker module is about to shutdown
Picker module expected time before shutdown for: Jcmd: 30s
Picker module expected time before shutdown for: Jfr: 3m 20s
Picker module expected time before shutdown for: Jstat: 5m
Picker module expected time before shutdown for: Monitor: 1m
Picker module expected time before shutdown for: NMT: 1m
Picker module expected time before shutdown for: Perfmon: 5m
Picker module finished at [Thu Oct 09 15:23:23 GMT 2025]: Jfr
Picker module finished at [Thu Oct 09 15:23:23 GMT 2025]: Jcmd
Picker module finished at [Thu Oct 09 15:23:23 GMT 2025]: Jstat
Picker module finished at [Thu Oct 09 15:23:23 GMT 2025]: Perfmon
Picker module finished at [Thu Oct 09 15:23:23 GMT 2025]: NMT
Picker module finished at [Thu Oct 09 15:23:23 GMT 2025]: Monitor
[Thu Oct 09 15:23:23 GMT 2025] (1760023403056) Picker module has been shutdown
[Thu Oct 09 15:23:23 GMT 2025] (1760023403056) Stress process is about to shutdown
Going to request to stop or kill stress process: 62376
Stress process: 62376 is still alive.
Unexpected exception sending stop message to ProcessStopper.
java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.connect0(Native Method)
	at java.base/sun.nio.ch.Net.connect(Net.java:546)
	at java.base/sun.nio.ch.Net.connect(Net.java:535)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:585)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:284)
	at java.base/java.net.Socket.connect(Socket.java:666)
	at java.base/java.net.Socket.connect(Socket.java:597)
	at java.base/java.net.Socket.<init>(Socket.java:464)
	at java.base/java.net.Socket.<init>(Socket.java:276)
	at applications.kitchensink.utils.ProcessStopper.stopProcess(ProcessStopper.java:146)
	at applications.kitchensink.process.glue.Main.execute(Main.java:371)
	at applications.kitchensink.process.glue.Main.main(Main.java:219)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138)
	at java.base/java.lang.Thread.run(Thread.java:1474)
ERROR: Failure in shutdown connection to the process.62376Will kill it.
----------rerun:(40/10402)*----------

<snip>

result: Error. Program `/System/Volumes/Data/mesos/work_dir/jib-master/install/jdk-26+19-1960/macosx-aarch64-debug.jdk/jdk-26/fastdebug/bin/java' timed out (timeout set to 3600000ms, elapsed time including timeout handling was 3857844ms).

Comments

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/28779 Date: 2025-12-12 04:20:11 +0000
15-12-2025
[~eosterlund] there are two problems being discussed here. The more general problem of debuggers running concurrently with JNI critical sections is something for which we can file a RFE to look into. It is a "day one" problem that as far as I know has never been flagged before. The actual deadlock at hand could be solved in numerous ways of varying complexity including a complete re-design of how JNI critical interacts with the GC and "locks things out". I don't think "making copies" is a viable solution as it negates the whole point of the critical API. I also don't know if using copies only when debugging is viable either given the dynamic nature of debugging. Modifying the transition for just these JNI critical release operations, to not check for object_deopt_suspend (by avoiding the special runtime exit condition check) seems like a pragmatic and simple solution with very limited disruption. EDIT: my history comments were incorrect as I missed the rename between GC_locker and GCLocker and that we still use per-thread critical access counters.
12-12-2025
[~eosterlund] yes "exit condition" is a misnomer (and somewhat conflated with the related check_special_condition_for_native_trans). As already stated above the concern about the fact you can race with a critical-thread and the debugger is valid, but none of the activities involved are in any sense new, so this has always been the case. > So it might be possible to remove the check when going from _thread_in_native to _thread_in_vm. [~rrich] Thanks for chiming in. I was hoping for something a little more definitive than "might". I have removed the check and run a lot of testing with no issues spotted, but I can't honestly say the testing, in general, would be exercising these particular code paths. So I was looking for some solid theory on why this would be okay to elide, rather than just me hand waving and saying "can't see a reason not to". :)
12-12-2025
(This is similar to how the GC locker itself is implemented).
10-12-2025
Another example of a solution would be to use a per-thread counter to indicate the number of active critical sections and have suspension code use handshakes to wait/roll forward until there are no critical sections, before suspending. Since it is already documented that critical sections should be held for a short period of time, this strategy should work fine. It ensures we avoid suspending inside of a critical section and ensures the escape barrier code isn’t issued when unlocking a critical section (the deadlock).
10-12-2025
While writing documentation saying this works as expected certainly moves the problem away from us, and assigns blame to users instead. But I thought we usually document limitations to describe how an API can be used correctly and how users should think about that. This is more of a situation where we are saying there is no way of using this API correctly without introducing inconsistencies in debuggers. There is no clear action point for users other than ”I guess if I use this API I will mess up debuggers”. But the user can’t really reason about whether that is okay or not. It doesn’t help that a lot of code has already been written using this API without this bug documented. I’d much prefer if we fix the bug instead of documenting the bug. At least as a plan A and plan B.
10-12-2025
> Would be good if we could keep it and just add an exception for the transitions to leave a jni critical section. I will look into customizing the transition for that particular case. It may get a little messy though or else duplicate code.
10-12-2025
[~rrich] I think it is an issue because the debugging code can't do anything to make it correct. Even if you suspend all threads so you can inspect/modify the array, the native thread can still mutate the array concurrently even though logically it is suspended and should not ( as Erik pointed out) be able to interact with any java objects. But perhaps the solution for that one is simply to document it as another limitation of using the JNI critical access APIs.
10-12-2025
> > A debugger can suspend the thread while it is in native, in the critical > > section. But the native code mutates the primitive array racingly while we > > look at it through a debugger. > > That is definitely an issue, but where does the fault lie. Is it really? Is there a section in the jvmti spec that says a thread suspended while in native code cannot proceed executing native code? There's no guarantee that it won't print "still running..." on the console, right? Likewise it's ok when native code changes an array if it was given a direct pointer into that array.
09-12-2025
> So I was looking for some solid theory on why this would be okay to elide, > rather than just me hand waving and saying "can't see a reason not to". :) You're right [~dholmes]. The established theory is: handshake the target and rely on the guarantee that it does not leave the safe state until notified (EscapeBarrier::resume_one(), resume_all()). Would be good if we could keep it and just add an exception for the transitions to leave a jni critical section. The deadlock is still possible though if the native code does not follow the spec, which says "Inside a critical region, native code must not call other JNI functions". IMHO that's ok, since the native code isn't compliant then and the situation arises only with an active agent.
09-12-2025
This sounds like a return to the "bad old days" when we had dual code paths all over the place: one for agent attached and one for not. Do we really want to go back to that? And can we even do that with dynamic attach and remote debugging?
09-12-2025
While it’s true that the critical native code ignoring suspension is not a new bug, I can’t help but think these things are related and we can fix both problems with a more robust solution. For example, say we return a native copy of the object instead when these agents are running around. Then there is no race with suspension (mutations are atomic in VM), and there is also no deadlock because we don’t have to lock out the GC when running escape barriers. When not running these agents we only have handshakes to worry about in the transition from native to vm when releasing the critical section, and they are not allowed to allocate. So no deadlock. Just a thought.
09-12-2025
It was not a suggestion of a solution as much as it was a suggestion that there might be solutions that fix both issues as they are to me seemingly related. Returning copies is an example of a solution that would fix both issues. Perhaps there are more such solutions. Just a thought really.
09-12-2025
> Perhaps I can attract [~rrich]'s attention and get his thoughts on this. Thanks. > I have been looking at the Escape Analysis work that introduced this > (JDK-8227745) and it seems to me that checking for obj-deopt when returning to > the VM from native code (during which a safepoint could have occurred) is > necessary to ensure that the VM code is not going to interact with an oop that > needs to be re-materialized. I don't think there is an issue. Besides the execution paths with EscapeBarriers the VM doesn't care so much if there are scalar replaced objects in a compiled frame. This can be seen by looking at the callers of Deoptimization::realloc_objects(). The important thing here is that the stack of the thread entering the VM needs to remain walkable for the JVMTI agent. E.g. calling a java method from jni should block the thread in JavaThread::wait_for_object_deoptimization(). ThreadStateTransition::transition_from_vm() checks for it. So it might be possible to remove the check when going from _thread_in_native to _thread_in_vm.
08-12-2025
[~dholmes] I thought it's strange that something called "special runtime exit condition" is checked at the runtime _entry_ rather than exit. I suppose there is a good reason why: this is really supposed to guard execution of "bytecode equivalent" operations. Most of the time, that's actual bytecodes (hence runtime exit). But when, for example executing native code, the "bytecode equivalent operation" may be modifying the JVM state through a downcall into the JVM runtime, which often but notably not always occurs when there is a transition from native to VM. For this to work well, a hook on the VM entry is inserted, because it's in VM that we perform the bytecode equivalent operation that mutates the state. What concerns me more than the deadlock, is that in the case of these critical APIs, resolving the deadlock condition that we are staring at here is really not sufficient. Because the entire critical operation is a "bytecode equivalent" operation. The racing native code has access to the content of an array and is able to mutate its state, concurrently with JVMTI suspension. This means that a debugger that reads the state of the exposed "pinned" object with JVMTI will be able to observe inconsistent state of the pinned object, because the thread is not suspended between the critical enter/exit methods. It's in native, and what bytecode equivalent operation could it possibly perform while in native, as opposed to calling into the JVM? Well with pinned primitive arrays, they actually can do such things. In summary, checking for special runtime exit conditions at the runtime entry seems like a way of catching 99% of the bytecode equivalent JNI functions, which plus minus this deadlock seems fine. But even if we fix the deadlock problem itself, the approximation of where we have bytecode equivalent operations is still not a good match for JNI critical heap accessing APIs, as the entire critical section is really a "bytecode equivalent" operation.
08-12-2025
> IMO it’s because we check for special runtime exit when transitioning from native to VM. That on its own is dubious, but is why we end up replying to the escape barrier. [~eosterlund] why do you think this check is dubious? I have been looking at the Escape Analysis work that introduced this (JDK-8227745) and it seems to me that checking for obj-deopt when returning to the VM from native code (during which a safepoint could have occurred) is necessary to ensure that the VM code is not going to interact with an oop that needs to be re-materialized. It may not be necessary in relation to this particular ReleasePrimitiveArrayCritical, but in the general case? Perhaps I can attract [~rrich]'s attention and get his thoughts on this. FWIW even with the old deopt_suspend we always checked on return to the VM.
08-12-2025
ILW = HLM = P3, bordering P2.
08-12-2025
Seems such a complex solution is not needed. [~pchilanomate] pointed out the EscapeBarrier handshake itself does not cause the thread to suspend, it only helps to "install" the need to suspend so that the next time the threads checks the deopt flag it will suspend itself. But that check is only done through `handle_runtime_special_exit_condition`. So we can either try eliding that for the native->VM case in general, or else specialize things so it can be elided only for these ReleaseXXXCritical JNI functions.
04-12-2025
I am experimenting with making EscapeBarrierSuspendHandshakeClosure a "suspend" operation and defining a JNI_ENTRY_NO_SUSPEND for use with ReleaseXXXCritical. In addition we only check special runtime exit condition for native->Java, not native->VM. One question to answer though is whether disallowing this handshake in places where "suspension" is already disallowed, will affect the correct use of this feature.
03-12-2025
> IMO it’s because we check for special runtime exit when transitioning from native to VM EDIT: if you saw my original comment I was mistakenly looking at modified code. Even if we do not check "special runtime exit condition" when transitioning from native to VM we still have the same problem with the base safepoint mechanism poll to see if there is a safepoint or handshake requested. In this case the EscapeBarrierSuspendHandshakeClosure has been requested. This handshake operation is not defined as `is_suspend` (which it seems may be reserved for JVM TI suspension) and so we cannot avoid it just by adjusting `allow_suspend`. That only leaves skipping the poll completely, which we cannot do as we must block for a safepoint if reentering the VM, so somehow we need finer-grained control over processing this particular handshake operation. Alternatively, flipping the problem around, can we adapt Universe::heap()->unpin_object so that it can be executed whilst still in native? That will need to be possible for all GC's.
03-12-2025
Looking at the original failure again. The log suggests process 62376 is the problem: Going to request to stop or kill stress process: 62376 Stress process: 62376 is still alive. Unexpected exception sending stop message to ProcessStopper. java.net.ConnectException: Connection refused at java.base/sun.nio.ch.Net.connect0(Native Method) at java.base/sun.nio.ch.Net.connect(Net.java:546) at java.base/sun.nio.ch.Net.connect(Net.java:535) at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:585) at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:284) at java.base/java.net.Socket.connect(Socket.java:666) at java.base/java.net.Socket.connect(Socket.java:597) at java.base/java.net.Socket.<init>(Socket.java:464) at java.base/java.net.Socket.<init>(Socket.java:276) at applications.kitchensink.utils.ProcessStopper.stopProcess(ProcessStopper.java:146) at applications.kitchensink.process.glue.Main.execute(Main.java:371) at applications.kitchensink.process.glue.Main.main(Main.java:219) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:565) at com.sun.javatest.regtest.agent.MainWrapper$MainTask.run(MainWrapper.java:138) at java.base/java.lang.Thread.run(Thread.java:1474) ERROR: Failure in shutdown connection to the process.62376Will kill it. but there is nothing in the test artifacts specifically referring to that process. After downloading the workdir I found Thread.print.1760019812636.out which is the jcmd thread dump of process 62376, but it doesn't show anything untoward - just all of the various stress tests running. The JniStressModule.out file is empty which may suggest an issue with that module; as to is the JvmtiStressModuleNative.out. The stack dump shows: "JniStressModule" #37 [36099] prio=5 os_prio=31 cpu=8153.97ms elapsed=33.03s tid=0x000000013f83e810 nid=36099 runnable [0x0000000171baa000] java.lang.Thread.State: RUNNABLE Thread: 0x000000013f83e810 [0x8d03] State: _at_safepoint _at_poll_safepoint 0 JavaThread state: _thread_blocked at applications.kitchensink.process.stress.modules.JniStressModule.newString(Native Method) at applications.kitchensink.process.stress.modules.JniStressModule.runOneIteration(JniStressModule.java:124) at applications.kitchensink.process.stress.modules.JniStressModule.execute(Unknown Source) at applications.kitchensink.process.stress.modules.StressModule.run(Unknown Source) at java.lang.Thread.runWith(java.base@26-ea/Unknown Source) at java.lang.Thread.run(java.base@26-ea/Unknown Source) where newString does: JNIEXPORT jboolean JNICALL Java_applications_kitchensink_process_stress_modules_JniStressModule_newString( JNIEnv env, jobject this, jstring string) { static jsize s_len; if (s_chars == NULL) { s_len = (env)->GetStringLength(env, string); s_chars = (jchar)malloc(sizeof(jchar) s_len); (env)->GetStringRegion(env, string, 0, s_len, s_chars); } (env)->NewString(env, s_chars, s_len); return JNI_FALSE; } So no direct interaction with JNI critical operations or the GCLocker The stack dump also shows: "JvmtiStressModule" #39 [38659] prio=5 os_prio=31 cpu=16.24ms elapsed=33.03s tid=0x000000013f857210 nid=38659 waiting on condition [0x0000000171fc2000] java.lang.Thread.State: TIMED_WAITING (sleeping) Thread: 0x000000013f857210 [0x9703] State: _at_safepoint _at_poll_safepoint 0 JavaThread state: _thread_blocked at java.lang.Thread.sleepNanos0(java.base@26-ea/Native Method) at java.lang.Thread.sleepNanos(java.base@26-ea/Thread.java:509) at java.lang.Thread.sleep(java.base@26-ea/Thread.java:540) at applications.kitchensink.process.stress.modules.JvmtiStressModule.execute(Unknown Source) at applications.kitchensink.process.stress.modules.StressModule.run(Unknown Source) at java.lang.Thread.runWith(java.base@26-ea/Unknown Source) at java.lang.Thread.run(java.base@26-ea/Unknown Source) which indicates that thread is waiting for the 2 JvmtiWorkerThreads of the test to complete, both of which have a stack like this: "applications.kitchensink.process.stress.modules.JvmtiWorkerThread-1" #56 [62467] prio=5 os_prio=31 cpu=765.34ms elapsed=32.95s tid=0x000000012f048810 nid=62467 runnable [0x00000001746a5000] java.lang.Thread.State: RUNNABLE Thread: 0x000000012f048810 [0xf403] State: _at_safepoint _at_poll_safepoint 0 JavaThread state: _thread_blocked at java.lang.Thread.sleepNanos0(java.base@26-ea/Native Method) at java.lang.Thread.sleepNanos(java.base@26-ea/Thread.java:509) at java.lang.Thread.sleep(java.base@26-ea/Thread.java:540) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:105) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.makeStack(JvmtiStressModule.java:109) at applications.kitchensink.process.stress.modules.JvmtiWorkerThread.run(JvmtiStressModule.java:48) which is a little odd that they are both stuck doing a sleep(1) before making the final iterative call that would have terminated them. The only other curiosity is in the dmesg output: [1461027.435013]: java[62357] Corpse allowed 1 of 5 [1461027.853161]: Corpse released, count at 0 where 62357 is the main test process. But the cores.html lldb output for that shows nothing out of the ordinary. So we really have zero idea about what may have happened for the initial failure reported on macOS.
24-10-2025
In any case this particular deadlock has nothing to do with JVMTI suspension. To go back to Erik's first comment: > 1) The immediate deadlock we see. IMO it’s because we check for special runtime exit when transitioning from native to VM. That on its own is dubious, but is why we end up replying to the escape barrier. The only thing we have as a "special runtime exit" condition now is: void JavaThread::handle_special_runtime_exit_condition() { if (is_obj_deopt_suspend()) { frame_anchor()->make_walkable(); wait_for_object_deoptimization(); } } but despite the name "runtime_exit_condition", which implies we do these checks only when exiting the VM-runtime to return to Java, we actually do them in a number of places including native-to-VM transition. And if we go back to Java 8, for example, the code is packaged differently but the native-to-VM transition still processes "deopt_suspend". So this deopt check when entering the VM has "always" been in place and is not directly the cause of the deadlock. To reiterate the deadlock as Stefan described above: - thread A has obtained the GCLocker lock as part of a JNI critical access and it attempting to release it. But it blocks on the transition back into the VM before it can do so, because deopt_suspend is active. - deopt_suspend is active because thread B is performing a JVM TI IterateThroughHeap operation, but it reaches a point where it needs to allocate (deep in the EscapeBarrier code) and that blocks trying to get the GCLocker lock, held by thread A.
24-10-2025
> 2) What’s perhaps even more dubious is that in an array critical section, we hand out a direct pointer to a type array. But that is kind of the whole point of these "critical" functions, to get direct access to the underlying array's without incurring copying overhead > A debugger can suspend the thread while it is in native, in the critical section. But the native code mutates the primitive array racingly while we look at it through a debugger. That is definitely an issue, but where does the fault lie. My thought is we should not consider a thread suspended if it is in native but has a "critical access" active. But that means we have to track when "critical access" is active ... which we used to do when we had the GC_locker framework. But AFAICS JVMTI suspension has never tried to deal with this problem. Maybe in the GC_locker days the code the debugger would execute to access the array was checking for the GC_locker being active?
13-10-2025
Given that there are two failures with different information we might have to split this into two. Or we guess that the macos failure is caused by the same issue.
10-10-2025
It's unclear if this should be under the runtime or jvmti subcomponent. I've placed it in runtime for now.
10-10-2025
Comment from [~eosterlund]: I think there are two separate but related issues here: 1) The immediate deadlock we see. IMO it’s because we check for special runtime exit when transitioning from native to VM. That on its own is dubious, but is why we end up replying to the escape barrier. 2) What’s perhaps even more dubious is that in an array critical section, we hand out a direct pointer to a type array. A debugger can suspend the thread while it is in native, in the critical section. But the native code mutates the primitive array racingly while we look at it through a debugger. It seems to me that the critical section is a “bytecode equivalent”, but is not treated as such. I think that is the cause for what seems to me like two orthogonal but related bugs. If we could hand out a copy instead when racing with JVMTI code, then the GCs would have nothing to do with this code and we wouldn’t have the deadlock. We also would have atomic updates and a suspended thread in a critical section wouldn’t racingly update the primitive array. This would be quite nice, I think. But I don’t have a proposition how to detect we are racing with this JVMTI code.
10-10-2025
I'm not sure that the two listed failures are the same. 1) The first one happens on macos and the stress process seems to have been killed and we don't have any stack traces available. 2) The second one is the clear deadlock between the JVMTI escape barrier code and JNI critical.
10-10-2025
Comment moved from JDK-8367650: The following is a comment for the JDK 26 failure: There seems to be a deadlock here: Thread 23 holds the GCLocker hostage being inside the ReleasePrimitiveArrayCritical and trying to release it but can't get out because it waits for deoptimization allocation. Thread 22 tries to do the deoptimization allocation but gets blocked on the GCLocker. Thread 23 (Thread 0x7fa9fb8f8700 (LWP 3443653)): #0 0x00007faa5e6a23d1 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007faa5d93371c in PlatformMonitor::wait(unsigned long) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #2 0x00007faa5d8717a9 in Monitor::wait_without_safepoint_check(unsigned long) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #3 0x00007faa5d243dfc in JavaThread::wait_for_object_deoptimization() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #4 0x00007faa5cb8e63b in ThreadInVMfromNative::ThreadInVMfromNative(JavaThread) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #5 0x00007faa5d36bb5d in jni_ReleasePrimitiveArrayCritical () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #6 0x00007faa5f1496ee in Java_java_util_zip_Deflater_deflateBytesBytes () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/libzip.so #7 0x00007faa44105f63 in ?? () #8 0x00007fa9fb8f7590 in ?? () #9 0x0000000000000000 in ?? () Thread 22 (Thread 0x7fa9fb9f9700 (LWP 3443652)): #0 0x00007faa5edd8238 in nanosleep () from /lib64/libc.so.6 #1 0x00007faa5d930907 in os::naked_short_nanosleep(long) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #2 0x00007faa5dafe737 in SpinYield::yield_or_sleep() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #3 0x00007faa5d0c8538 in GCLocker::block() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #4 0x00007faa5d0df538 in VM_GC_Operation::doit_prologue() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #5 0x00007faa5ddd6ff8 in VMThread::execute(VM_Operation) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #6 0x00007faa5dab8e81 in SerialHeap::mem_allocate_work(unsigned long, bool) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #7 0x00007faa5d7b91db in MemAllocator::mem_allocate(MemAllocator::Allocation&) const () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #8 0x00007faa5d7b930f in MemAllocator::allocate() const () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #9 0x00007faa5d19cfa2 in InstanceKlass::allocate_instance(JavaThread) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #10 0x00007faa5cddf5d4 in Deoptimization::realloc_objects(JavaThread, frame, RegisterMap, GrowableArray<ScopeValue>, JavaThread) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #11 0x00007faa5cde818f in rematerialize_objects(JavaThread, int, nmethod, frame&, RegisterMap&, GrowableArray<compiledVFrame>, bool&) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #12 0x00007faa5cde948e in Deoptimization::deoptimize_objects_internal(JavaThread, GrowableArray<compiledVFrame>, bool&) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #13 0x00007faa5cef9dfd in EscapeBarrier::deoptimize_objects_internal(JavaThread, long) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #14 0x00007faa5cefb36c in EscapeBarrier::deoptimize_objects_all_threads() () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #15 0x00007faa5d5cd891 in JvmtiTagMap::iterate_through_heap(int, Klass, jvmtiHeapCallbacks const, void const) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #16 0x00007faa5d5650ab in JvmtiEnv::IterateThroughHeap(int, _jclass, jvmtiHeapCallbacks const, void const) () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #17 0x00007faa5d4fa5cf in jvmti_IterateThroughHeap () from /opt/mach5/mesos/work_dir/jib-master/install/jdk-26-cpu+1-33/linux-x64-debug.jdk/jdk-26/fastdebug/lib/server/libjvm.so #18 0x00007faa5f18b597 in get_heap_info (klass=<optimized out>, env=0x7faa54aaf888) at /opt/mach5/mesos/work_dir/slaves/9a8d6c29-9c09-4644-b4fc-a39135134048-S1889/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/4a4e333e-960b-4ebf-b78d-1295f953066d/runs/6a927348-79c1-41f6-9910-5b97a68e9467/workspace/closed/test/hotspot/jtreg/applications/kitchensink/process/stress/modules/util.c:73 #19 0x00007faa5f18e02d in Java_applications_kitchensink_process_stress_modules_JvmtiStressModule_finishIteration (env=0x7faa54aaf888, this=<optimized out>) at /opt/mach5/mesos/work_dir/slaves/9a8d6c29-9c09-4644-b4fc-a39135134048-S1889/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/4a4e333e-960b-4ebf-b78d-1295f953066d/runs/6a927348-79c1-41f6-9910-5b97a68e9467/workspace/closed/test/hotspot/jtreg/applications/kitchensink/process/stress/modules/libJvmtiStressModule.c:938
10-10-2025