JDK-8327743 : JVM crash in hotspot/share/runtime/javaThread.cpp - failed: held monitor count should be equal to jni: 0 != 1
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2024-03-11
  • Updated: 2024-04-26
  • Resolved: 2024-04-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 23
23 b19Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
I've been working on a fix for a jtreg issue https://bugs.openjdk.org/browse/CODETOOLS-7903621 which incorrectly marks failing tests as PASSED in certain cases (like a VM crash). I ran the JDK tests with a proposed fix for that issue and that now has uncovered that when running the test/hotspot/jtreg/serviceability/jvmti/GetOwnedMonitorInfo/GetOwnedMonitorInfoTest.java test, there's a JVM crash in hotspot/share/runtime/javaThread.cpp. The test however is being reported as PASSED, which hides the issue.

With the fixed jtreg, this is now properly reported as a test error. The JVM crashes as follows:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/src/hotspot/share/runtime/javaThread.cpp:880), pid=3066994, tid=3067123
#  assert(held_monitor_count() == jni_monitor_count()) failed: held monitor count should be equal to jni: 0 != 1
#
# JRE version: Java(TM) SE Runtime Environment (23.0) (fastdebug build 23-internal-2024-03-10-1335402.foo.jdk)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-internal-2024-03-10-1335402.foo.jdk, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xec54d0]  JavaThread::exit(bool, JavaThread::ExitType)+0x880
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_serviceability/scratch/0/core.3066994)
#
# An error report file with more information is saved as:
# /testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_serviceability/scratch/0/hs_err_pid3066994.log
[thread 3066997 also had an error]
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
Comments
A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/523 Date: 2024-04-23 07:34:36 +0000
23-04-2024

Changeset: 274c805c Author: David Holmes <dholmes@openjdk.org> Date: 2024-04-16 00:48:15 +0000 URL: https://git.openjdk.org/jdk/commit/274c805c5137d9080a7670d864ecca8a0befc3f6
16-04-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/18445 Date: 2024-03-22 06:26:03 +0000
08-04-2024

[~dholmes] It should be similar to the SharedRuntime::rc_trace_method_entry() call from SharedRuntime::generate_native_wrapper(). Except it will be simpler since there are no registers to save. All registers we care about will be saved across the VM call. Even rax (or its equivalent in other platforms) doesn't need to be saved since all this would be done when returning from Continuation.enterSpecial, and that method doesn't return anything. Although I see that we could also be unwinding Continuation.enterSpecial due to an exception ("--- Exception handling path" in gen_continuation_enter()), in which case we would have to save rax. Maybe we could save it always just to avoid identifying these two cases.
20-03-2024

[~pchilanomate] yes that is doable but somewhat painful. Do we already have any support for calling out to a logging function from generated code? [~alanb] No the only place in the spec this is mentioned relates to JNI_DetachCurrentThread. So it is only on detach that we call ObjectSynchronizer::release_monitors_owned_by_thread, not regular termination.
20-03-2024

[~dholmes] Aside from JNI attached threads, is the JNI spec silent on the behavior for broken code that allows a thread to terminate that has done a JNI MonitorEnter without a MonitorExit?
20-03-2024

[~dholmes] But in this case the warning should be done when the virtual thread exits since that's the actual owner. I guess if we want to keep the same behavior of platform threads we can add code in continuation_enter_cleanup() so that if CheckJNICalls && jni_monitor_count() > 0 we call a VM_leaf method to do the logging.
20-03-2024

My fix has the side-effect of not triggering this warning: if (CheckJNICalls && jni_monitor_count() > 0) { // We would like a fatal here, but due to we never checked this before there // is a lot of tests which breaks, even with an error log. log_debug(jni)("JavaThread %s (tid: " UINTX_FORMAT ") with Objects still locked by JNI MonitorEnter.", exit_type == JavaThread::normal_exit ? "exiting" : "detaching", os::current_thread_id()); } as the counts are both zero for the carrier thread
20-03-2024

Back to the crash. The scenario where a virtual thread exits while holding a monitor obtained with JNI monitorenter (with no JNI monitoexit) is a real corner case. It seems the only sane way to handle this is to do the equivalent of ObjectSynchronizer::release_monitors_owned_by_thread when a continuation for a virtual thread completes with JNI monitors held.
20-03-2024

Thanks [~alanb] I'll report the issue to Doug in that bug report - I don't think this is working as intended.
20-03-2024

Okay so the mystery is solved. I reverted my changes to allow the original assertion to fire at thread termination. In doing so I discovered that the assert can fail during VM shutdown _after_ jtreg has already decided that the test passed! So all of those "passing" tests I have seen may not in fact have been passing at all! EDIT: Of course this should have been readily apparent given the bug description as that is how [~jpai] discovered the problem. Confirming what I wrote above, the additional assertion I put in place to detect the mismatch earlier was in the slow-path inc/dec code in javaThread.cpp and it seems that without Xcomp I never take those slow paths when the mismatch exists and so never trigger the new assert. The remaining mystery was why the pool thread was terminating but then I discovered: /** * Returns the common pool instance. This pool is statically * constructed; its run state is unaffected by attempts to {@link * #shutdown} or {@link #shutdownNow}. However this pool and any * ongoing processing are automatically terminated upon program * {@link System#exit}. so, if our exit is slow enough (such as when running in Xcomp mode) and the pool worker thread actually gets a chance to terminate before the VM process exits, then it will crash the VM. Otherwise the VM appears to terminate normally. That said, [~pchilanomate] wrote: > The issue reproduces easily by adding Thread.sleep() after t1.join(). With 1s always crashes for me. and it does. But why? Putting a sleep in the main thread delays the VM shutdown, which delays the pool termination and so delays the pool thread termination. It should not have any effect on the crash! ???
20-03-2024

The trimming you see now is new since JDK 22 and JDK-8319662.
19-03-2024

Okay mystery solved. It seem "keep-alive does not mean what we might (naively) think it means: * @param keepAliveTime the elapsed time since last use before * a thread is terminated (and then later replaced if needed). Sounds like a worker will wait up to 30 seconds with a 30 second keep-alive, but no. We then find this: * Trimming workers. To release resources after periods of lack of * use, a worker starting to wait when the pool is quiescent will * time out and terminate if the pool has remained quiescent for * period given by field keepAlive (default 60sec), which applies * to the first timeout of a fully populated pool. Subsequent (or * other) cases use delays such that, if still quiescent, all will * be released before one additional keepAlive unit elapses. which is a bit vague about the "Subsequent" part. But then in the code we have something even more strange: if (quiet) { // use timeout if trimmable int nt = (short)(qc >>> TC_SHIFT); long delay = keepAlive; // scale if not at target if (nt != (nt = Math.max(nt, parallelism)) && nt > 0) delay = Math.max(TIMEOUT_SLOP, delay / nt); if ((deadline = delay + System.currentTimeMillis()) == 0L) deadline = 1L; // avoid zero } this has the effect of dividing the keep-alive by the parallelism value - which on my system is 30s/16 -> 1875ms. So if the pool is quiescent for 1875ms the worker thread terminates and we hit the assert.
19-03-2024

I thought virtual threads were using the common FJP. If they have their own then I need to check how it operates. EDIT: okay so 30 second keep-alive as Patricio stated, nothing else seems special about and also I cannot find anything that indicates that this pool gets shutdown when the VM exits.
19-03-2024

Virtual threads have their own FJP instance. It's configured FIFO (or async mode in FJP speak) so different to the common pool which is LIFO. As Patricio says, the keep alive timeout is 30s but that only comes into effect when the pool is at quiescence. There can be cases where a FJP worker terminates although that is hard to demonstrate.
19-03-2024

I was also expecting that we would have to sleep for more than the keepAliveTime to see the crash (keepAliveTime is actually 30s for the FJP used by virtual threads, we set it in VirtualThread.createDefaultScheduler). But then I added the Thread.sleep() and saw the crash was happening way before that. So seems the FJP workers don't necessarily have to wait keepAliveTime before terminating.
19-03-2024

I assume the Thread.sleep gives time for the keep alive to kick in and the carrier to terminate. I think this will have to handled when the continuation finishes, like the equivalent of a platform thread exiting while holding a monitor.
19-03-2024

[~pchilanomate] Thanks Patricio that confirms my understanding. I have in fix in place that also updates the JNI count the same way as the held count is done. I'm still unclear about the reproducability though. The original failure above is an assert failure during termination of the FJP worker thread - but I'm not clear why it would be terminating as I don't believe the default pool gets shutdown when the VM exits. I added a new assertion to catch the mismatch earlier when we updated the held and JNI counts. Yet somehow that only triggers in Xcomp mode. But that code is only executed on slow path monitor enter/exit (or all JNI monitor enter/exit). So perhaps, somewhat ironically, I have to force compilation to actually execute a slow path that will hit my new assert. But I also do not hit the thread termination assert either.
19-03-2024

I took a look at this. The problem is that we are not considering the _jni_monitor_count counter when a vthread is mounted/unmounted, only _held_monitor_count. Normally this will not be an issue since we won’t unmount a vthread with jni monitors held, nor the carrier will be holding jni monitors when mounting a vthread. But if the vthread actually exited holding one, while _held_monitor_count will be cleared _jni_monitor_count will not (continuation_enter_cleanup()), so when the carrier exits we will hit the assert. The issue reproduces easily by adding Thread.sleep() after t1.join(). With 1s always crashes for me.
19-03-2024

[~alanb] the keep-alive is 60 seconds.
19-03-2024

[~kvn] AFAICS there is no concurrency issue here at all. My understanding of the issue is that we mount virtual thread 1, which performs a JNI monitor lock, then a regular monitor lock/unlock, leaving the counts at 1 all as they should be. Then either during unmount of the terminated virtual thread 1, or initial mount of the virtual thread 2, we explicitly set the held count to zero (leaving the JNI count at 1). So the next time we perform any locking we trip over the assertion that the lock counts are out of sync. Though I wonder whether it depends on the next monitor being inflated or not? I'll look further into that aspect. The monitor that triggers the assert may not be one that is obvious from the test code.
18-03-2024

> Though in that case why does it only occur when using Xcomp and C2? Some kind of concurrency? -Xbatch will stop executing Java code of the method while C2 is compiling it. And C2 may take time to compile it. Could be also some critical method is compiled and one of threads which uses is executed faster than other threads.
18-03-2024

This appears to be a virtual threads bug. If I am reading the code correctly when we mount a new VT we will set the carrier JavaThread's held_monitor_count to zero, without touching the JNI count. Hence they get out of sync. Though in that case why does it only occur when using Xcomp and C2?
18-03-2024

[~syan] yes we have already established that - see above. Thanks.
13-03-2024

fastdebug build with jvm options "-Xcomp -XX:TieredStopAtLevel=4" can reproduce this failure.
13-03-2024

ILW=HLM=P3
12-03-2024

[~dholmes] please always run with `-XX:-DoEscapeAnalysis -XX:-EliminateLocks -XX:-EliminateNestedLocks` flags to make sure C2 don't remove any lock/unlock.
12-03-2024

I modified the C2 fast_lock and fast_unlock code to go straight to the slow path, but the crash still happens. From the logging I can tell that there are still some unlogged fast-path locks/unlocks happening.
12-03-2024

The test passes with lockingMode=2 - lightweight locking.
12-03-2024

[~kvn] my understanding of that code is that if we jump over the count update then we are taking the slow path which updates the count itself. But I haven't examined that slow-path.
12-03-2024

[~dholmes] Is it possible that we don't decrement `held_monitor_count` when C2 calls `fast_unlock()` code? There is jump over the decrement instruction for some cases: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L921
12-03-2024

I confirmed David's observation that the test still fail even if all EA and Locks optimizations are switched off: -XX:-EliminateLocks -XX:-DoEscapeAnalysis -XX:-EliminateNestedLocks
12-03-2024

I reproduced it on lab machine running with jtreg. It failed even if I exclude compilation of all GetOwnedMonitorInfoTest methods: -XX:CompileCommand=exclude,GetOwnedMonitorInfoTest*::*
12-03-2024

I ran with LockingMode=0 (LM_MONITOR) but the test still failed. It was interesting to observe (via added logging) that in this mode we usually had slow locking but fast-path unlocking.
12-03-2024

I tried reverting JDK-8322743 but this still failed, so it seems this is an independent issue we had been unaware of.
12-03-2024

The count discrepancy indicates either an extra unlock, or a missing lock, however the actual monitor operations seem fine and we have no IllegalMonitorStateException as would be expected if either of those things happened. That suggests it may be the count management that is incorrect, but AFAICS the count is correctly incremented on lock and decremented on unlock. The test only fails in the virtual thread configuration. This is because the same carrier thread will then get used to execute the two runs of the test logic: one including the JNI lock, and one excluding it. Hence on the second run the forkjoin-pool-worker-thread starts with a held count and a JNI count of 1. Sometime during that second execution of runTest we are missing an update to the held monitor count. Test passes with -XX:TieredStopAtLevel=1,2,3 but fails with 4 Test still fails with -XX:-DoEscapeAnalysis or -XX:-EliminateLocks
12-03-2024

Looking ...
12-03-2024

This looks very similar to JDK-8322743. [~kvn] can you taker a look at this please?
11-03-2024

Somehow the held_monitor_count and jni_monitor_count have gotten out of sync - they should both be 1 in this test.
11-03-2024

hs_err shows: --------------- T H R E A D --------------- Current thread (0x00007f8d4802a460): JavaThread "ForkJoinPool-1-worker-1" daemon [_thread_in_vm, id=3067123, stack(0x00007f8dc5bfd000,0x00007f8dc5cfd000) (1024K)] Stack: [0x00007f8dc5bfd000,0x00007f8dc5cfd000], sp=0x00007f8dc5cfbcd0, free space=1019k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xec54d0] JavaThread::exit(bool, JavaThread::ExitType)+0x880 (javaThread.cpp:880) V [libjvm.so+0xec58d5] JavaThread::post_run()+0x15 (javaThread.cpp:731) V [libjvm.so+0x17b66b5] Thread::call_run()+0xd5 (thread.cpp:231) V [libjvm.so+0x14bc717] thread_native_entry(Thread*)+0x127 (os_linux.cpp:864) It is also failing under -Xcomp: -Dtest.java.opts='-Xcomp -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation' Also the fact it is a FJP indicates the test is running the virtual thread test case.
11-03-2024