JDK-8305670 : Performance regression in LockSupport.unpark with lots of idle threads
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17,21
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-04-05
  • Updated: 2023-07-25
  • Resolved: 2023-05-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21
17.0.9-oracleFixed 21 b23Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8307067 :  
JDK-8307068 :  
Description
We noticed latency degradation when migrating an app from JDK 8 to JDK 17 or 20. The app has more than 8K threads most of which are idle. Analysis showed a large amount of CPU time spent in ThreadsListHandle::cv_internal_thread_to_JavaThread called from Unsafe_Unpark.

I could reproduce the issue locally with a simple test case - see attached UnparkRegression.java

It creates 10K idle threads that sleep indefinitely and two active threads communicating to each other via CyclicBarrier. The particular synchronization primitive does not matter: any java.util.concurrent class that eventually calls LockSupport.unpark is affected by the issue.

Running the test with JDK 17 or JDK 20 on 4-core ARM64 machine, it does 70K roundtrips per second, while with JDK 8 it does 162K (2.3x more).

The issue appeared in JDK 10 with the introduction of Thread-SMR. The problem is the linear search in ThreadsList::includes called by cv_internal_thread_to_JavaThread:
https://github.com/openjdk/jdk/blob/44f33ad1a9617fc23864c9ba5f063b3fc2f1e18c/src/hotspot/share/runtime/threadSMR.cpp#L829

The call to ThreadList::includes is guarded by a diagnostic flag `EnableThreadSMRExtraValidityChecks` which is enabled by default.
After adding -XX:-EnableThreadSMRExtraValidityChecks, the performance returns back to JDK 8 levels.
Comments
Fix request [17u] I backport this for parity with 17.0.9-oracle. Medium risk Trivial resolve needed. SAP nightly testing passed.
21-07-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/1604 Date: 2023-07-20 12:22:34 +0000
20-07-2023

Changeset: f030937a Author: Daniel D. Daugherty <dcubed@openjdk.org> Date: 2023-05-13 14:44:57 +0000 URL: https://git.openjdk.org/jdk/commit/f030937a51b95dde33ce33537ee830153b2c3b56
13-05-2023

Mach5 test results for the v07 version of the PR which is baselined on jdk-21+20 and includes the dependent patches for JDK-8307067 and JDK-8307068: Mach5 Tier1: - no failures Mach5 Tier2: - no failures Mach5 Tier3: - no failures I'm about to rebase the group of patches to the latest jdk/jdk repo in preparation for possible integration.
10-05-2023

This bug fix (JDK-8305670) depends on the following bug fix: JDK-8307068 store a JavaThread* in the java.lang.Thread object after the JavaThread* is added to the main ThreadsList in order to function properly. If this bug fix (JDK-8305670) is backported, then JDK-8307068 must also be backported.
03-05-2023

Mach5 test results for the v05 version of the PR which is baselined on jdk-21+20 and includes the dependent patches for JDK-8307067 and JDK-8307068: Mach5 Tier1 - no failures Mach5 Tier2 - no failures Mach5 Tier3 - no failures Mach5 Tier4 - no failures Mach5 Tier5 - 1 known, unrelated test failure: - java/lang/Thread/virtual/HoldsLock.java#id0 failed due to JDK-8305919 Mach5 Tier6 - no failures Mach5 Tier7 - no failures Mach5 Tier8 - 2 known, unrelated test failures: - vmTestbase/runtime/jni/LoadTests/LoadTests.java failed due to JDK-8306754 - java/lang/Thread/virtual/HoldsLock.java#id0 failed due to JDK-8305919
03-05-2023

[~apangin] - Thanks for the info. [~ecaspole]'s reply isn't visible because it contains Oracle internal URLs. The summary is that he's adapted the sample program into a JMH and is doing testing on our performance machines.
20-04-2023

The issue is observed on ARM64 hardware: for instance, Ampere A1 machines in Oracle Cloud. $ uname -a Linux arm1 5.15.0-1033-oracle #39-Ubuntu SMP Mon Apr 3 14:10:58 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux $ /usr/lib/jvm/java-8-openjdk-arm64/bin/java -version openjdk version "1.8.0_362" OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-0ubuntu1~22.04-b09) OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode) $ jdk-21/bin/java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+18-1480) OpenJDK 64-Bit Server VM (build 21-ea+18-1480, mixed mode, sharing)
19-04-2023

[~apangin] - We're having trouble reproducing improvements with the current patch in our environments relative to 8u371. Please provide more exact version info for JDK8u and JDK21 and for the machines on which you've seen this issue and the improvement.
19-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13519 Date: 2023-04-18 21:09:54 +0000
18-04-2023

Ah I see my mistake now. The issue Robbin found only becomes a problem if we don't check includes() inside cv_internal_thread_to_JavaThread. There are a number of other places where the JavaThread is extracted from eetop but they also seem to be correctly protected one way or another.
16-04-2023

[~dcubed] Robbin has identified a bug with the existing code due to allowing the JavaThread* to escape via eetop before it has been added to the threads-list. That needs fixing independent of the performance issue with checking includes(). As such it would be better to fix it under its own issue. BTW eetop -> "execution environment top" from classic VM days - JDK 1.0 (and likely earlier). I suspect over the years there have been "tools" that would peek at this field to gain access to the underlying VM object and so renaming it would have been problematic
16-04-2023

[~dholmes] - Perhaps a split would be useful, but the only reason for making the changes to when we save the JavaThread* in the java.lang.Thread object is because we want to speed up cv_internal_thread_to_JavaThread(). If we aren't trying to add a quick_mode to cv_internal_thread_to_JavaThread(), then we don't have to do anything to when we save the JavaThread* in the java.lang.Thread object. I'm wrapping up the merge of my (mostly comments) work with [~rehn]'s quick_mode work and then I'm going to do more testing before spinning up a new PR. I'm going to withdraw the old PR to avoid confusion. P.S. Yes, I very much dislike the "eetop" name for the field where we store the JavaThread* in the java.lang.Thread object. For some reason, that name has no meaning to me at all.
14-04-2023

I think we perhaps should split out the fix to prevent eetop from exposing a JavaThread that is not on a Threads-list and so cannot be guarded by a TLH. After that we can re-reason about the need for the includes() call in this issue and address that.
13-04-2023

[~rehn] - I've verified that your patch also prevents the following two tests from crashing: runtime/handshake/HandshakeWalkOneExitTest.java runtime/handshake/HandshakeDirectTest.java I'm going to take a closer look at your patch and merge some of my comment changes with your changes (if appropriate).
13-04-2023

I'm having API indigestion with the idea that we need to add a check for the JavaThread's thread state not being "NEW" before we create a ThreadsListHandle. That's just down right ugly... :-) Here's our description of the Thread-SMR API from: src/hotspot/share/runtime/threadSMR.hpp: // Thread Safe Memory Reclamation (Thread-SMR) support. // // ThreadsListHandles are used to safely perform operations on one or more // threads without the risk of the thread or threads exiting during the // operation. It is no longer necessary to hold the Threads_lock to safely // perform an operation on a target thread. // // There are several different ways to refer to java.lang.Thread objects // so we have a few ways to get a protected JavaThread *: // // JNI jobject example: // jobject jthread = ...; // : // ThreadsListHandle tlh; // JavaThread* jt = nullptr; // bool is_alive = tlh.cv_internal_thread_to_JavaThread(jthread, &jt, nullptr); // if (is_alive) { // : // do stuff with 'jt'... // } // // JVM/TI jthread example: // jthread thread = ...; // : // JavaThread* jt = nullptr; // ThreadsListHandle tlh; // jvmtiError err = JvmtiExport::cv_external_thread_to_JavaThread(tlh.list(), thread, &jt, nullptr); // if (err != JVMTI_ERROR_NONE) { // return err; // } // : // do stuff with 'jt'... // // JVM/TI oop example (this one should be very rare): // oop thread_obj = ...; // : // JavaThread *jt = nullptr; // ThreadsListHandle tlh; // jvmtiError err = JvmtiExport::cv_oop_to_JavaThread(tlh.list(), thread_obj, &jt); // if (err != JVMTI_ERROR_NONE) { // return err; // } // : // do stuff with 'jt'... // // A JavaThread * that is included in the ThreadsList that is held by // a ThreadsListHandle is protected as long as the ThreadsListHandle // remains in scope. The target JavaThread * may have logically exited, // but that target JavaThread * will not be deleted until it is no // longer protected by a ThreadsListHandle. [~rehn] - Thanks for changeset! I will check it out in a few minutes.
13-04-2023

[~dcubed] https://github.com/openjdk/jdk/compare/master...robehn:jdk:double-check?expand=1 Maybe I missed something again but above seems to work for me. Since we know eetop is not null before we create ThreadsList, which means it must still be there if it's still non null.
13-04-2023

In JNI case we seem to already do it this way, we first create JavaThread and Threads:add(), then we create Thread obj and set eetop. So we already have the invariant with JavaThread on ThreadsList, but eetop still NULL.
13-04-2023

1: j.l.Thread is created, eetop = NULL. This is globally visible. 2: We create a JavaThread to start this Thread. 3: We store eetop = JavaThread*. JavaThread* is now globally visible via eetop, but still not on ThreadsList. Hence we cannot dereference it. 4: We add JavaThread to ThreadsList. If we swap 3<->4 3: We add JavaThread to ThreadsList. JavaThread* is now globally visible via ThreadsList, but not via eetop. We can deference it since it is on ThreadsList. 4: We store eetop = JavaThread*. Publishing a pointer which cannot be validated is not so nice. Adding to ThreadsList before publishing seem safer and nicer. And there is no change in visibility, both cases gets publish in step 3, but via Thread obj vs ThreadsList. That is what I mean.
13-04-2023

[~dholmes] No my bad, I need to explain better and read more carefully, sorry.
13-04-2023

[~rehn] Apologies. I was reading: java_lang_Thread::set_thread(thread_oop(), this); as the setting of threadObj (and wrote that 3 times above with no one calling me out for it :( ), but it is the setting of eetop as you note. I would still be wary of exposing the JavaThread in the Threads-list when the j.l.Thread <->JavaThread connection is only half in place (though at worst I would anticipate an asertion failure). But you are right that we already similarly expose it via the eetop field in a way that is in fact dangerous as any TLH we create to guard it won't in fact do so! That is a significant bug in the existing code!
13-04-2023

> Be move is from just above Threads::add(this); to under it, so exposure is the same. I don't know what you mean by this. There is no exposure with the current code. As soon as you publish the thread via Threads:add it can be found by other logic which can then see a null threadObj. This is a risky change and not necessary.
12-04-2023

When I did the experiment to move "java_lang_Thread::set_thread()" after "Threads::add()", it had to be done in two places: --- a/src/hotspot/share/runtime/javaThread.cpp +++ b/src/hotspot/share/runtime/javaThread.cpp @@ -1656,7 +1656,6 @@ void JavaThread::prepare(jobject jni_thread, ThreadPriority prio) { assert(InstanceKlass::cast(thread_oop->klass())->is_linked(), "must be initialized"); set_threadOopHandles(thread_oop()); - java_lang_Thread::set_thread(thread_oop(), this); if (prio == NoPriority) { prio = java_lang_Thread::priority(thread_oop()); @@ -1672,6 +1671,9 @@ void JavaThread::prepare(jobject jni_thread, ThreadPriority prio) { // added to the Threads list for if a GC happens, then the java_thread oop // will not be visited by GC. Threads::add(this); + // Publish the JavaThread* in java.lang.Thread after the JavaThread* is + // on a ThreadsList. + java_lang_Thread::set_thread(thread_oop(), this); } oop JavaThread::current_park_blocker() { @@ -2094,10 +2096,6 @@ void JavaThread::start_internal_daemon(JavaThread* current, JavaThread* target, MutexLocker mu(current, Threads_lock); - // Initialize the fields of the thread_oop first. - - java_lang_Thread::set_thread(thread_oop(), target); // isAlive == true now - if (prio != NoPriority) { java_lang_Thread::set_priority(thread_oop(), prio); // Note: we don't call os::set_priority here. Possibly we should, @@ -2110,6 +2108,9 @@ void JavaThread::start_internal_daemon(JavaThread* current, JavaThread* target, target->set_threadOopHandles(thread_oop()); Threads::add(target); // target is now visible for safepoint/handshake + // Publish the JavaThread* in java.lang.Thread after the JavaThread* is + // on a ThreadsList. + java_lang_Thread::set_thread(thread_oop(), target); // isAlive == true now Thread::start(target); } And I didn't play with this patch a whole lot because it didn't fix the racing issue. I'll take a look the 11 comments posted since I last touched this bug last night and see how I want to proceed. My current gut feel is that we (me included) are mulling on adding even more brittle logic to potentially solve a performance issue. Hoops and checks and complicated comments to explain it all. I'm starting to wonder if I should just bite the bullet and write the code to speed up ThreadsListHandle.contains()...
12-04-2023

If we remove the j.l.Thread.State, the internal VM works just fine. There no code at all inside the VM which uses that state to take any action. Be move is from just above Threads::add(this); to under it, so exposure is the same.
12-04-2023

j.l.Thread.State represent the state a of thread in the Java virtual machine. This have no direct relationship with the lifetime of JavaThread* and ThreadsList. The point of "once a Thread moves out of the NEW" can be at anytime from a ThreadsList/JavaThread* POV, it just happen to be correct at the moment. If we use it, we involve yet another variable in the thread lifecycle, we already have to many. Therefore I prefer using those we already have. EDIT: If we need we could do SENTINEL => JavaThread* => NULL, but yes that involves the reorder. I don't see directly why that is a bad idea? Anyhow that is a double read of eetop, one before TLH and one after. (boolean alive() could check TERMINATED instead.)
12-04-2023

> I don't think it's a good idea to use a Java (j.l.Thread.State) level state to determine if a JavaThread* is on a ThreadsList without walking it. What are you concerned about? We know that once a Thread moves out of the NEW state then it's JavaThread has been added to the main threads-list, so any TLH created thereafter will capture that JavaThread unless it has terminated - which is also determined by examining the state of the j.l.Thread (just a different bit of state). I don't think it is a good idea to reorder Threads::add and set_thread() as noted above.
12-04-2023

I don't think it's a good idea to use a Java (j.l.Thread.State) level state to determine if a JavaThread* is on a ThreadsList without walking it. I rather do the double read of eetop.
12-04-2023

[~dholmes] Seems you are right. Checking thread status before TLH should work then.
12-04-2023

But thinking more. Thread::start is called after prepare() which means after Threads:add(). So if we see the target thread is RUNNABLE it must be on the main threads-list, so if we create the TLH after that then we must still be guaranteed to capture the newly started JavaThread.
12-04-2023

I could have sworn the thread set its own state to RUNNABLE - drat! I can't immediately see why it can't set its own state in JavaThread::run ?? I'm very wary about changing the order of Threads::add and set_thread() as I think it would expose a partially initialized JavaThread that would then require all the native thread processing code to always have to watch for a null threadObj. I think it would take a long time for bugs to show up there.
12-04-2023

I had a similar idea, but realized it will not work with a naive implementation, since the thread status is set to RUNNABLE before the thread actually starts: void Thread::start(Thread* thread) { // Start is different from resume in that its safety is guaranteed by context or // being called from a Java method synchronized on the Thread object. if (thread->is_Java_thread()) { // Initialize the thread state to RUNNABLE before starting this thread. // Can not set it after the thread started because we do not know the // exact thread state at that time. It could be in MONITOR_WAIT or // in SLEEPING or some other state. java_lang_Thread::set_thread_status(JavaThread::cast(thread)->threadObj(), JavaThreadStatus::RUNNABLE); } os::start_thread(thread); } Alternatively, we could check 'eetop' field twice: the first time before TLH creation, and the second time afterwards. This requires changing the order of Threads::add() and java_lang_Thread::set_thread() calls as [~rehn] suggested above. However, I'm not sure this change is correct - need to check carefully if this breaks some other invariant.
12-04-2023

I totally disagree sorry - the state of a j.l.Thread and its associated JavaThread are tightly interwined. The thread startup process is not something that fluctuates or is likely to change. The reorder is potentially bad because it can expose a JavaThread that has a null threadObj and that is something that is only expected in very specific circumstances (primarily JNI attaching threads).
12-04-2023

I think we may be able to address the startup issue by checking the state of the j.l.Thread before we create the TLH. If the Thread is in threadStatus state NEW then it is unstarted; otherwise we create the TLH and then extract the JavaThread from eetop, now guaranteed that the JavaThread had to be captured by the TLH. Unfortunately this means all callers of cv_internal_thread_to_JavaThread have to perform this check (and cv_internal_thread_to_JavaThread should assert threadStatus != NEW). This really messes up the intent to have all such logic encapsulated by the TLH, but I don't see any other way (other than continue to check tlh.includes()).
12-04-2023

Right got it. If the thread was already started then it can't have terminated but it is possible to find a newly starting thread that is not in the TLH. So suppose this newly starting thread is short-lived and runs to completion, it could in fact terminate and be deleted while the code with the TLH tries to operate on it!
11-04-2023

An observer can create a ThreadsListHandle that does not include a newly started thread that hasn't been added to the system ThreadsList yet. However, by the time that the observer code gets to cv_internal_thread_to_JavaThread(), the newly created JavaThread has passed all the conditions that we can check inside that function to see that the java_thread we're trying to convert is/was new. So if the observer thread's call into cv_internal_thread_to_JavaThread() tries to assert that the newly started thread is on "this" ThreadsList, that will fail because it does not contain the newly created JavaThread because the ThreadsList was created before that JavaThread got going...
11-04-2023

I'm trying to read between the lines here but am struggling a bit. What exactly is the problematic scenario for thread startup? The JavaThread <-> j.l.Thread two-way connection should be established before the new JavaThread is published on the threads-list. That ensures it is fully initialized before becoming visible to other code that interacts with threads. Top me this is a desired invariant and not something to mess with.
11-04-2023

[~rehn] - Sorry the race is still there. The creation of the ThreadsListHandle in the observer can be followed by a pause in the observer such that the target JavaThread has been added to the system ThreadsList _and_ has published its eetop field in the java.lang.Thread so when the observer thread resumes, it will see a non-nullptr value from eetop in java.lang.Thread so the observer will think that the JavaThread is on the ThreadsList created in the caller, but it is not.
11-04-2023

[~rehn] - Very interesting idea. I'll take a look...
11-04-2023

Nice one. eetop may only be non-null while on ThreadsList. When we start we set it before we are on it. I think this maybe fixes it: diff --git a/src/hotspot/share/runtime/javaThread.cpp b/src/hotspot/share/runtime/javaThread.cpp index 9eee1028c86..7f835065891 100644 --- a/src/hotspot/share/runtime/javaThread.cpp +++ b/src/hotspot/share/runtime/javaThread.cpp @@ -1658,3 +1658,2 @@ void JavaThread::prepare(jobject jni_thread, ThreadPriority prio) { set_threadOopHandles(thread_oop()); - java_lang_Thread::set_thread(thread_oop(), this); @@ -1674,2 +1673,3 @@ void JavaThread::prepare(jobject jni_thread, ThreadPriority prio) { Threads::add(this); + java_lang_Thread::set_thread(thread_oop(), this); } Writer do: Add JavaThread to ThreadsList Publish by setting eetop to JavaThread ... De-publish by setting eetop to NULL (ensure_join()) Remove JavaThread from ThreadsList Reader do: Take a ThreadsList Read eetop, if non-null it should be on that list
11-04-2023

So I've taken a close look at the starting JavaThread case and I agree that is the case that explains my race observations with the new assert() in the comments above. [~apangin] - thanks for sending me down the proper path. So there is no way to definitively distinguish that we're in the newly starting JavaThread case and thereby keep the new assertion (with a bail out for the newly starting JavaThread case). So while I think we've adequately proven that the exiting JavaThread case doesn't require the ThreadsList search, we have to do the search for the newly starting JavaThread case. I'm going to remove the EnableThreadSMRExtraValidityChecks option, remove the new assertion and refine and update the comments. I'm also going to look at speeding up the 'includes()' function. It'll be a little while before I take the PR back out of draft mode.
11-04-2023

[~apangin] - Interesting idea. I've been focused on the JavaThread exit code paths that I hadn't much thought about the JavaThread start code paths. Here's [~rehn]'s description for the two tests: * @test HandshakeWalkOneExitTest * @summary This test tries to stress the handshakes with new and exiting threads and * @test HandshakeDirectTest * @bug 8240918 * @summary This test tries to stress direct handshakes between threads while suspending them. HandshakeWalkOneExitTest is the one that is most reproducible for me... [~dholmes] - "... then we have a bug somewhere." That's exactly what I'm trying to figure out. Things are not quite working the way I think they should and that's why the assertion is firing. Of course, based on what [~apangin] wrote, that assertion might be invalid for the early startup of new threads. I have to investigate from that angle...
11-04-2023

I really struggled to try and follow that explanation. A JavaThread can't be recycled until it no longer appears on any threads-list. A thread oop can't be storing a non-null JavaThread address after the JavaThread has exited/terminated died. If either of those statements is appearing to not be true then we have a bug somewhere.
11-04-2023

I suppose the issue happens not with an exited thread, but with a not-yet-started one. 'hst' thread in the test accesses 'threads' array which is recreated from scratch multiple times. At some point it may contain a java.lang.Thread object which is already constructed but not yet started. This thread is started after ThreadListHandle in WB_HandshakeWalkStack is created, but before java_lang_Thread::thread(thread_oop) is called. Does this sound like a plausible explanation? (I haven't verified it yet)
10-04-2023

Hmmm... I had forgotten that we converted the java.lang.Thread oop that we're holding in the JavaThread* into an OopHandle. Those OopHandles are queued up on the ServiceThread as part of the JavaThread destructor to be cleaned up by the ServiceThread when it gets the chance... So the java.lang.Thread oop does live longer that the JavaThread* itself and I wonder what happens if our observer thread queries one of these via its 'jthread' -> 'thread_oop' connection...
10-04-2023

Normally we don't include detailed test failure info for an in progress fix, but I'm going to do so in order to explain why this code doesn't quite work the way that we think it does. First here's the failing assertion: # Internal Error (/work/shared/bug_hunt/8305670_for_jdk21.git/open/src/hotspot/share/runtime/threadSMR.cpp:837), pid=1011815, tid=1011969 # assert(is_on_list) failed: java_thread=0x00007fcfd80267f0 is not on ThreadsList(0x00007fcfdc003cc0) # # JRE version: Java(TM) SE Runtime Environment (21.0) (slowdebug build 21-internal-LTS-2023-04-10-1532144.dcubed...) # Java VM: Java HotSpot(TM) 64-Bit Server VM (slowdebug 21-internal-LTS-2023-04-10-1532144.dcubed..., mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x149b0e3] ThreadsListHandle::cv_internal_thread_to_JavaThread(_jobject*, JavaThread**, oopDesc**)+0x18d And here's the crashing thread's stack trace: --------------- T H R E A D --------------- Current thread (0x00007fcfd802ac70): JavaThread "Thread-1" [_thread_in_vm, id=1011969, stack(0x00007fd02a794000,0x00007fd02a895000)] _threads_hazard_ptr=0x00007fcfdc003cc0, _nested_threads_hazard_ptr_cnt=0 Stack: [0x00007fd02a794000,0x00007fd02a895000], sp=0x00007fd02a893670, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x149b0e3] ThreadsListHandle::cv_internal_thread_to_JavaThread(_jobject*, JavaThread**, oopDesc**)+0x18d (threadSMR.cpp:837) V [libjvm.so+0x15d1237] WB_HandshakeWalkStack+0x1a8 (whitebox.cpp:2196) J 203 jdk.test.whitebox.WhiteBox.handshakeWalkStack(Ljava/lang/Thread;Z)I (0 bytes) @ 0x00007fd043f93bb5 [0x00007fd043f93ac0+0x00000000000000f5] j HandshakeWalkOneExitTest$1.run()V+27 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21-internal j java.lang.Thread.run()V+19 java.base@21-internal v ~StubRoutines::call_stub 0x00007fd0437b6d21 V [libjvm.so+0xd4ebce] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x630 (javaCalls.cpp:415) V [libjvm.so+0x1234260] os::os_exception_wrapper(void (*)(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*), JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x36 (os_linux.cpp:4833) V [libjvm.so+0xd4e59b] JavaCalls::call(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x3d (javaCalls.cpp:329) V [libjvm.so+0xd4d5e0] JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x1ba (javaCalls.cpp:185) V [libjvm.so+0xd4d6e9] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x95 (javaCalls.cpp:191) V [libjvm.so+0xe77bab] thread_entry(JavaThread*, JavaThread*)+0x8e (jvm.cpp:2918) V [libjvm.so+0xd68f6c] JavaThread::thread_main_inner()+0x168 (javaThread.cpp:717) V [libjvm.so+0xd68e01] JavaThread::run()+0x1d7 (javaThread.cpp:702) V [libjvm.so+0x1491339] Thread::call_run()+0x1b9 (thread.cpp:224) V [libjvm.so+0x122a012] thread_native_entry(Thread*)+0x1ad (os_linux.cpp:740) Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) J 203 jdk.test.whitebox.WhiteBox.handshakeWalkStack(Ljava/lang/Thread;Z)I (0 bytes) @ 0x00007fd043f93b42 [0x00007fd043f93ac0+0x0000000000000082] j HandshakeWalkOneExitTest$1.run()V+27 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@21-internal j java.lang.Thread.run()V+19 java.base@21-internal v ~StubRoutines::call_stub 0x00007fd0437b6d21 Our crashing thread ("Thread-1") is busy abusing the worker threads in the runtime/handshake/HandshakeWalkOneExitTest.java test program using the WB_HandshakeWalkStack() testing API. WB_HandshakeWalkStack() is a whitebox API that takes a jthread, converts the jthread into a JavaThread*, and tries to handshake with the JavaThread* and walks the JavaThread's stack. Pretty simple stuff. WB_HandshakeWalkStack() uses ThreadsListHandle::cv_internal_thread_to_JavaThread() to convert the jthread parameter into a JavaThread*. And that's where we run into the new assert(). Let's repeat that assertion failure: assert(is_on_list) failed: java_thread=0x00007fcfd80267f0 is not on ThreadsList(0x00007fcfdc003cc0) so 0x00007fcfd80267f0 isn't found on the specified ThreadsList. Let's check the hs_err_pid: Threads class SMR info: _java_thread_list=0x00007fcfc4001000, length=17, elements={ 0x00007fd05402df50, 0x00007fd05459a8d0, 0x00007fd05459c0b0, 0x00007fd05459df00, 0x00007fd05459f630, 0x00007fd0545a0d60, 0x00007fd0545a2ad0, 0x00007fd0545ac330, 0x00007fcff803e270, 0x00007fcffc06d590, 0x00007fcffc06e940, 0x00007fcffc06fd70, 0x00007fd054628e90, 0x00007fd054643490, 0x00007fd054675fc0, 0x00007fcfd802ac70, 0x00007fcfd80267f0 } The *last* entry in _java_thread_list matches '0x00007fcfd80267f0'! What the heck? Oh wait, the assertion also reported the ThreadsList address that we searched with a value of: 0x00007fcfdc003cc0 And that ThreadsList doesn't match the current _java_thread_list which is 0x00007fcfc4001000. So our JavaThread (0x00007fcfd80267f0) is on the current ThreadsList, but is NOT on an older ThreadsList. Since 0x00007fcfd80267f0 is in the last slot of the current ThreadsList, that indicates that 0x00007fcfd80267f0 is a new entry added to a new copy of the previous ThreadsList which is likely the one that we searched (0x00007fcfdc003cc0). We could check the _to_delete_list, but that's a bit of work that we don't have to do (yet, if at all). Let's check the Events in the hs_err_pid file: Event: 1.089 Thread 0x00007fcfd80267f0 Thread exited: 0x00007fcfd80267f0 Event: 1.089 Thread 0x00007fcfd80267f0 Thread added: 0x00007fcfd80267f0 Event: 1.090 Thread 0x00007fcfd80267f0 Thread exited: 0x00007fcfd80267f0 Event: 1.090 Thread 0x00007fcfd80267f0 Thread added: 0x00007fcfd80267f0 Event: 1.091 Thread 0x00007fcfd80267f0 Thread exited: 0x00007fcfd80267f0 Event: 1.091 Thread 0x00007fcfd80267f0 Thread added: 0x00007fcfd80267f0 It looks like our JavaThread (0x00007fcfd80267f0) has been reused quite a few times in the last set of events tracked before the VM crashed. It always surprises me when I see a JavaThread* being used over and over, but that's pretty normal for malloc'ed and free'ed data structures... So let's take a look at what WB_HandshakeWalkStack() is doing: src/hotspot/share/prims/whitebox.cpp: } else if (thread_handle != nullptr) { ThreadsListHandle tlh; JavaThread* target = nullptr; bool is_alive = tlh.cv_internal_thread_to_JavaThread(thread_handle, &target, nullptr); if (is_alive) { Handshake::execute(&tsc, &tlh, target); } } So WB_HandshakeWalkStack() is passed a 'thread_handle' that refers to a JavaThread and we pass it to cv_internal_thread_to_JavaThread() to convert that 'thread_handle' into a JavaThread*. We already created the ThreadsListHandle (tlh) that doesn't include the JavaThread* that we think we're converting, but we don't know that yet. The JavaThread* is not in the list in 'tlh' because that JavaThread* has exited. We're just the observer so we don't know that yet. So let's dive into the cv_* function and see the racy details... bool ThreadsListHandle::cv_internal_thread_to_JavaThread(jobject jthread, JavaThread ** jt_pp, oop * thread_oop_p) { assert(this->list() != nullptr, "must have a ThreadsList"); assert(jt_pp != nullptr, "must have a return JavaThread pointer"); // thread_oop_p is optional so no assert() // The JVM_* interfaces don't allow a null thread parameter; JVM/TI // allows a null thread parameter to signify "current thread" which // allows us to avoid calling cv_external_thread_to_JavaThread(). // The JVM_* interfaces have no such leeway. dcubed: We convert our 'jthread' into a 'thread_oop'... Our 'jthread' dcubed: refers to a dead JavaThread, but there's no verification of dcubed: the resolution from jthread -> oop. oop thread_oop = JNIHandles::resolve_non_null(jthread); // Looks like an oop at this point. if (thread_oop_p != nullptr) { // Return the oop to the caller; the caller may still want // the oop even if this function returns false. *thread_oop_p = thread_oop; } dcubed: We use the thread_oop to query the java.lang.Thread dcubed: object in Java land and fetch the JavaThread* that's dcubed: stored there. Above I mentioned that our 'jthread' and dcubed: our 'thread_oop' refer to a dead JavaThread so when dcubed: ask for the JavaThread* we should get nullptr which was dcubed: set by ensure_join() in the JavaThread::exit() code path. dcubed: However, we have a newly running JavaThread and it dcubed: has set a new JavaThread* value in the java.lang.Thread dcubed: object that the thread_oop refers to. In a sense, it appears dcubed: that the same JavaThread has been resurrected, but it has dcubed: a new thread name so it's not really the same JavaThread. JavaThread *java_thread = java_lang_Thread::thread(thread_oop); dcubed: It is this non-nullptr 'java_thread' value that allows us to get dcubed: to the new assertion which fails because the JavaThread* dcubed: is NOT in the ThreadsList that we're searching. if (java_thread == nullptr) { // The java.lang.Thread does not contain a JavaThread * so it has // not yet run or it has died. return false; } // Looks like a live JavaThread at this point. if (java_thread != JavaThread::current()) { dcubed: So our 'java_thread' doesn't match the current JavaThread so we dcubed: head right into the new assertion from here. // jthread is not for the current JavaThread so we could verify // the JavaThread * against the ThreadsList. if (EnableThreadSMRExtraValidityChecks) { // The java.lang.Thread's JavaThread* value is cleared by ensure_join() // in the middle of the JavaThread's exit() call. The JavaThread removes // itself from the ThreadsList at the end of the JavaThread's exit() // call. Since we have a non-nullptr java_thread value here, we know // that this ThreadsListHandle is protecting the JavaThread so this // is optional verification against future changes. dcubed: The reason that the code without the assertion "works" is dcubed: because this 'includes()' check will return 'false' because the dcubed: ThreadsList we're searching doesn't contain the JavaThread* dcubed: so that indicates that the 'java_thread' is dead (at least at the dcubed: time that the ThreadsList was created). bool is_on_list = includes(java_thread); assert(is_on_list, "java_thread=" INTPTR_FORMAT " is not on ThreadsList(" INTPTR_FORMAT ")", p2i(java_thread), p2i(this->list())); // Robustness check for non-ASSERT bits: if (!is_on_list) { dcubed: So this return 'false' tells the caller that we have a bad 'jthread' dcubed: because the associated JavaThread has died. return false; } } } // Return a live JavaThread that is "protected" by the // ThreadsListHandle in the caller. *jt_pp = java_thread; return true; } Okay that was gnarly. There is a part of this I don't quite grok. In this code path, we have a 'jthread' that got us to a 'thread_oop' and that 'thread_oop' refers to a java.lang.Thread object. On the thread creation code path, we have created a new java.lang.Thread object and we populated that java.lang.Thread object with a 'JavaThread*' that happens to be recycled. I can accept that. What I don't grok is how our observer thread's 'thread_oop' gets us to the new 'java.lang.Thread' object where we can pickup the same 'JavaThread*'. Did the 'java.lang.Thread' object also get recycled? I don't know GC algorithms so I don't know if that is possible or if the question even makes sense.
10-04-2023

The v00 proposed fix in the PR includes a refactoring that adds an assertion: --- a/src/hotspot/share/runtime/threadSMR.cpp +++ b/src/hotspot/share/runtime/threadSMR.cpp @@ -824,11 +824,22 @@ bool ThreadsListHandle::cv_internal_thread_to_JavaThread(jobject jthread, // Looks like a live JavaThread at this point. if (java_thread != JavaThread::current()) { - // jthread is not for the current JavaThread so have to verify + // jthread is not for the current JavaThread so we could verify // the JavaThread * against the ThreadsList. - if (EnableThreadSMRExtraValidityChecks && !includes(java_thread)) { - // Not on the JavaThreads list so it is not alive. - return false; + if (EnableThreadSMRExtraValidityChecks) { + // The java.lang.Thread's JavaThread* value is cleared by ensure_join() + // in the middle of the JavaThread's exit() call. The JavaThread removes + // itself from the ThreadsList at the end of the JavaThread's exit() + // call. Since we have a non-nullptr java_thread value here, we know + // that this ThreadsListHandle is protecting the JavaThread so this + // is optional verification against future changes. + bool is_on_list = includes(java_thread); + assert(is_on_list, "java_thread=" INTPTR_FORMAT " is not on ThreadsList(" + INTPTR_FORMAT ")", p2i(java_thread), p2i(this->list())); + // Robustness check for non-ASSERT bits: + if (!is_on_list) { + return false; + } } } In my Mach5 Tier1 testing, the following tests failed: - runtime/handshake/HandshakeWalkOneExitTest.java - failed on linux-aarch64-debug, linux-x64-debug, macosx-aarch64-debug, and macosx-x64-debug - runtime/handshake/HandshakeDirectTest.java - failed only on linux-aarch64-debug I was able to reproduce the runtime/handshake/HandshakeWalkOneExitTest.java on my macosx-x64 MBP13 and on my linux-x64 server. On linux-x64, I was able to reproduce it with both 'fastdebug' and 'slowdebug' bits. I'm attaching an example set of logs from a recent slowdebug failure: $ unzip -l 8305670-new-assert-failure.zip Archive: 8305670-new-assert-failure.zip Length Date Time Name --------- ---------- ----- ---- 41833 2023-04-10 15:46 test_failures.2023-04-10-154316/HandshakeWalkOneExitTest.jtr.slowdebug 212918 2023-04-10 15:59 test_failures.2023-04-10-154316/hs_err_pid1011815.log --------- ------- 254751 2 files
10-04-2023

[~apangin] - Please test the patch in your environment: https://github.com/openjdk/jdk/pull/13393 I used the attached test on my X64 MBP13, but I had to limit the thread count to 4000 or the test would blow up. I saw very little difference between baseline runs and runs with '-XX:-EnableThreadSMRExtraValidityChecks', but that's probably because of the machine I'm using today. Update: I couple of tests are failing unexpectedly with the current patch: - runtime/handshake/HandshakeWalkOneExitTest.java - runtime/handshake/HandshakeDirectTest.java Please hold off on testing this patch until I figure out this problem next week.
08-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13393 Date: 2023-04-07 20:08:11 +0000
07-04-2023

The EnableThreadSMRExtraValidityChecks first made an appearance in the "Thread-SMR (8167108)(phase2) big code review" cycle. The change is credited to gthornbr's CR, but I couldn't find an email documenting those code review comments. I'm going to guess that these comments were made during an in person code review at my house in Colorado. The only other mentions of EnableThreadSMRExtraValidityChecks that I have in 900+ emails are comments about the indenting being wrong in the globals.hpp file. Those comments were from David Holmes so he at least saw the option in its declaration/definition context way, way back in 2017. :-) Of course, that's just one line of code in an XL code review so... Okay, we added EnableThreadSMRExtraValidityChecks as a diagnostic option with a default value of 'true'. Typically that means that we intended to change the default value from 'true' to 'false' in a future release. I didn't find any other bugs/RFEs that mention EnableThreadSMRExtraValidityChecks so I didn't file a follow-up bug targeting a later release. That's an error on my part. JDK-8167108 was integrated in jdk-10+36 (to use the new version style) on 2017.11.23 so this code has had a very, very long time to bake. I think we can change the default of EnableThreadSMRExtraValidityChecks to 'false' and we can do a little bit more...
07-04-2023

[~dholmes] - Thanks for your last post. That was what I was writing from my research last night. Your post is much more concise than what I was writing (as usual). I'm still researching the history behind EnableThreadSMRExtraValidityChecks and will comment again soon...
07-04-2023

> Looks to me that thread termination sequence is: detect no SMR conflicts (thread is not on any TLH-protected list) -> do some termination steps -> set set_thread(oop, nullptr) -> proceed with termination. set_thread(oop, nullptr) happens in ensure_join() which happens before Threads::remove which is where ThreadSMRSupport::remove_thread is called, after which the thread will no longer appear in any created TLH. So if we create the TLH first (which we do) and then ask the oop for the JavaThread and get a non-null result then that JavaThread is guarded by that TLH.
07-04-2023

AFAIU the SMR machinery, TLH captures the current alive threads list, and keeps all threads on that list from terminating while TLH is in scope. But that list might not include the thread we ask to unpark? That's seems to be the whole problem here: we don't quite know if the asked-to-be-unparked thread is on TLH-protected list or not without explicitly checking for it, except for a few corner cases. Looks to me that thread termination sequence is: detect no SMR conflicts (thread is not on any TLH-protected list) -> do some termination steps -> set set_thread(oop, nullptr) -> proceed with termination. Which means two cases: 1. Seen thread(oop) == nullptr: we can argue the thread is dead, since the termination sequence is already past detecting the thread is dead, and there is no point in looking into the list. This is what the first block here does: https://github.com/openjdk/jdk/blob/44f33ad1a9617fc23864c9ba5f063b3fc2f1e18c/src/hotspot/share/runtime/threadSMR.cpp#L818-L823 2. Seen thread(oop) != nullptr: the thread *looks* alive, but we might be just unlucky that the thread is currently terminating, and set_thread(oop, nullptr) had not been done yet. If we continue down this path, the termination sequence might delete the JavaThread* under our feet. So, we now need to figure if TLH is guarding this JavaThread* for us, i.e. if it is on its list. The only shortcut is asking for the current thread, which guarantees it is not currently terminating. Which is why we go into the checks here: https://github.com/openjdk/jdk/blob/44f33ad1a9617fc23864c9ba5f063b3fc2f1e18c/src/hotspot/share/runtime/threadSMR.cpp#L826-L833 Am I completely misreading what is going on here?
06-04-2023

Hmmm. The TLH either captures the thread and makes it safe to touch, or it doesn't. If the TLH did not catch the thread then it has already terminated in which case querying the j.l.Thread will return null and the whole thing is as no-op. If the j.l.Thread returns a JavaThread* then that JavaThread must be protected by the TLH. What am I missing?
06-04-2023

Oh yes, I should have asked Dan, sorry! It looks to me that TLH guarantees no other thread would destroy the unparking thread *if* that thread was on the list. Just having the JavaThread* in hands, when it is not on the current thread list, does not enjoy any protection from TLH. So there is a window where JT* in question is both being unparked *and* already taken off the list and being destroyed, since from the perspective of SMR it is safe to do so. It looks as if EnableThreadSMRExtraValidityChecks = false just restores the old, racy behavior in code that converts thread oop to JT*, like Unsafe_Unpark. If so, it is not just an extra check, it is actually a load-bearing piece for correctness :) Is it? To solve the regression, we might want to try speeding up the `includes` scan.
06-04-2023

I'm searched everything I have on the Thread-SMR design (note I wasn't the designer or implementor) and can't find anything that specifically discusses the implication of the linear search or why the EnableThreadSMRExtraValidityChecks flag was added and only used in some places. Perhaps [~dcubed] can say more? A linear search is always going to be a problem with excessive numbers of threads, but 99.9% of the time this is not an issue. The usual alternatives to linear search would apply here - with the usual trade-offs. Thinking more about this and the use of cv_internal_thread_to_JavaThread though, I'm not sure the search is actually necessary - and perhaps why EnableThreadSMRExtraValidityChecks is just an "extra check". At the time of the call to cv_internal_thread_to_JavaThread we already have a ThreadsListHandle active and so if the target was alive when the TLH was created it remains safe to access. Inside cv_internal_thread_to_JavaThread we ask the j.l.Thread for its JavaThread. If that is non-null the thread is at least transiently alive but could logically terminate at any time. However we don't care about that - as long as the JavaThread doesn't go through it's destruction process due to it being protected by the TLH then we can continue the unpark operation on it.
06-04-2023

The code path in question is added by JDK-8167108. AFAICS, this is the intended use of SMR: the whole thing protects from dealing with a dead (concurrently deleted) `JavaThread*` in `Unsafe_Unpark`, which would -- best case -- crash. So turning off `EnableThreadSMRExtraValidityChecks` is okay for experiments, but looks rather unsafe for production use. (Just in case a casual reader would decide to add this to their JVM flags). Maybe [~dholmes] would see the ways out of this performance penalty.
05-04-2023