Bug ID: JDK-8238761 Asynchronous handshakes

JDK 16
16 b18Fixed

Changeset: 6bddeb70 Author: Robbin Ehn <rehn@openjdk.org> Date: 2020-09-29 08:50:54 +0000 URL: https://git.openjdk.java.net/jdk/commit/6bddeb70
29-09-2020
An update and clarification that I posted in the PR (preserved here in JBS by request): I've gone back to refresh myself on the previous discussions and (internal) design walk-throughs to get a better sense of these changes. Really the "asynchronous handshake" aspect is only a small part of this. The fundamental change here is that handshakes are now maintained via a per-thread queue, and those handshake operations can, in the general case, be executed by any of the target thread, the requestor (active_handshaker) thread or the VMThread. Hence the removal of the various "JavaThread::current()" assumptions. Unless constrained otherwise, any handshake operation may be executed by the VMThread so we have to take extra care to ensure the code is written to allow this. I'm a little concerned that our detour into direct-handshakes actually lulled us into a false sense of security knowing that an operation would always execute in a JavaThread, and we have now reverted that and allowed the VMThread back in. I understand why, but the change in direction here caught me by surprise (as I had forgotten the bigger picture). It may not always be obvious that the transitive closure of the code from an operation can be safely executed by a non-JavaThread. Then on top of this generalized queuing mechanism there is a filter which allows some control over which thread may perform a given operation - at the moment the only filter isolates "async" operations which only the target thread can execute. In addition another nuance is that when processing a given thread's handshake operation queue, different threads have different criteria for when to stop processing the queue: the target thread will drain the queue completely the VMThread will drain the queue of all "non-async" operations the initiator for a "direct" operation will drain the queue up to, and including, the synchronous operation they submitted the initiator for an "async" operation will not process any operation themselves and will simply add to the queue and then continue on their way (hence the "asynchronous") I do have some concerns about latency impact on the VMThread if it is used to execute operations that didn't need to be executed by the VMThread! I remain concerned about the terminology conflation that happens around "async handshakes". There are two aspects that need to be separated: the behaviour of the thread initiating the handshake operation; and which thread can execute the handshake operation When a thread initiates a handshake operation and waits until that operation is complete (regardless of which thread performed it, or whether the initiator processed any other operations) that is a synchronous handshake operation. When a thread initiates a handshake operation and does not wait for the operation to complete (it does the target->queue()->add(op); and continues on its way) that is an asynchronous handshake operation. The question of whether the operation must be executed by the target thread is orthogonal to whether the operation was submitted as a synchronous or asynchronous operation. So I have problem when you say that an asynchronous handshake operation is one that must be executed by the target thread, as this is not the right characterisation at all. It is okay to constrain things such that an async operation is always executed by the target, but that is not what makes it an async operation. In the general case there is no reason why an async operation might not be executed by the VMThread, or some other JavaThread performing a synchronous operation on the same target.
23-09-2020
This is getting side-tracked somewhat. Our implementation chooses to split suspension into two parts for the suspend-requester - one part done by the suspend request and the other done when an operation that requires the target to be suspended is called. We don't have to do it that way but that is not the point. My concern is with the terminology related to "asynchronous handshakes" as the asynchrony seems to be in the wrong place to me. When we talk about synchronous or asynchronous operations the common, standard terminology applies these with respect to the entity making the request. If a thread makes a request to have operation X performed and the request doesn't return until X has been performed, then that is a synchronous request. If a thread makes a request to have operation X performed and the request can return prior to X being performed then that is an asynchronous request. The suspend case confuses things because it can be either synchronous or asynchronous depending on where you want to draw some lines. As Dan notes SuspendThread() must not return until it is guaranteed the target won't execute any further bytecodes (etc) - so the handshake must be synchronous with respect to that requirement. However, that doesn't mean the target thread is actually blocked (as it may never return to the VM it may never block) and so it might be viewed as asynchronous with respect to the target blocking as that will happen later. But the point is that in relation to "asynchronous handshakes" we don't need a new mechanism to implement thread suspension - we already do it with the existing "synchronous" handshake mechanism. So using suspension as an argument for why we need these new "asynchronous" handshakes makes no sense to me. The need for the target to execute (part of) the handshake operation is obviously the only possibility for an asynchronous handshake (as there is no other thread involved to do it) but it is not that which makes the operation an asynchronous one. You can have (we don't currently support it) a synchronous handshake where the operation must be executed by the target thread - but that doesn't make it an asynchronous operation!
21-09-2020
To load a JVM/TI agent you must have access to the command line. If I have access to your command line, then it is not your VM. It is mine. :-) Update: > A rogue resume by a dubious JVMTI agent should not crash the JVM. > This should be handled more gracefully. Agreed now, but that was not the thinking at time this code was written. See the next comment for when it was fixed.
21-09-2020
I found the right point in time for GetStackTrace(). Before this fix: JDK-8034249 need more workarounds for suspend equivalent condition issue GetStackTrace() used to look like this: jvmtiError JvmtiEnv::GetStackTrace(JavaThread* java_thread, jint start_depth, jint max_frame_count, jvmtiFrameInfo* frame_buffer, jint* count_ptr) { jvmtiError err = JVMTI_ERROR_NONE; uint32_t debug_bits = 0; if (is_thread_fully_suspended(java_thread, true, &debug_bits)) { err = get_stack_trace(java_thread, start_depth, max_frame_count, frame_buffer, count_ptr); } else { // JVMTI get stack trace at safepoint. Do not require target thread to // be suspended. VM_GetStackTrace op(this, java_thread, start_depth, max_frame_count, frame_buffer, count_ptr); VMThread::execute(&op); err = op.result(); } return err; } /* end GetStackTrace / After JDK-8034249, GetStackTrace() looked like this: jvmtiError JvmtiEnv::GetStackTrace(JavaThread java_thread, jint start_depth, jint max_frame_count, jvmtiFrameInfo* frame_buffer, jint* count_ptr) { jvmtiError err = JVMTI_ERROR_NONE; // It is only safe to perform the direct operation on the current // thread. All other usage needs to use a vm-safepoint-op for safety. if (java_thread == JavaThread::current()) { err = get_stack_trace(java_thread, start_depth, max_frame_count, frame_buffer, count_ptr); } else { // JVMTI get stack trace at safepoint. Do not require target thread to // be suspended. VM_GetStackTrace op(this, java_thread, start_depth, max_frame_count, frame_buffer, count_ptr); VMThread::execute(&op); err = op.result(); } return err; } /* end GetStackTrace */ So GetStackTrace() called "is_thread_fully_suspended(java_thread, true, &debug_bits)" waited for external suspend completion for target threads until JDK9-B04.
21-09-2020
A rogue resume by a dubious JVMTI agent should not crash the JVM. This should be handled more gracefully. I know by experience that it can be hard to find out and prove that the bug is in that dubious JVMTI agent and not in the JVM. Even if we succeeded to convince the customer that the crash was cause by a bug in the agent, she might still be not amused that her (production?) system crashed. Also a crash is a potential exploit (but the person who can load a JVMTI agent into a JVM likely has control over it anyway).
21-09-2020
My point on JVM/TI SuspendThread() remains: > It is not the suspension request that must wait. This was a conscious design/implementation decision with the suspend/resume subsystem. It's not the suspend request that needs to wait for the suspend to be completed. It's the consumer of the suspend state that needs to wait. build/macosx-x86_64-normal-server-release/hotspot/variant-server/gensrc/jvmt ifiles/jvmtiEnter.cpp: jvmti_SuspendThread() - creates a ThreadsListHandle to protect the target thread - calls SuspendThread() src/hotspot/share/prims/jvmtiEnv.cpp: JvmtiEnv::SuspendThread() - grabs the target's SR_lock. - calls set_external_suspend() w/ some error checking - drops the target's SR_lock - calls JvmtiSuspendControl::suspend() src/hotspot/share/prims/jvmtiImpl.cpp: JvmtiSuspendControl::suspend() - calls java_suspend() - does one last has-the-thread-exited check Note: even with the target JavaThread protected by a ThreadsListHandle, it is possible for the target JavaThread to exit, but the JavaThread won't be freed until it is safe. src/hotspot/share/runtime/thread.cpp: JavaThread::java_suspend() - validate the thread isn't exiting (necessary since there are a couple of different ways into JavaThread::java_suspend()) - grab the target's SR_lock - call is_ext_suspend_completed() and return if true Note: It is possible (and highly likely) that the target JavaThread will pass through a transition point where it will notice the pending external suspend request and it will self-suspend before the requesting thread reaches this check. Self-suspension makes is_ext_suspend_completed() true. - drop the target's SR_lock - if we're suspending our self, then call java_suspend_self() else request a VM_ThreadSuspend Even if we take the longest possible code path on the JVM/TI SuspendThread() call and have to request a VM_ThreadSuspend, that does not mean that we are waiting for the suspend to be completed (i.e., the target thread to self-suspend). What it does mean is that the suspend contract will be honored and no more byte-code or byte-code equivalents will be executed by the target JavaThread. That JavaThread might be executing native code so it will happily continue on until it tries to return to Java or call into the VM and at that point it will self-suspend. Checkout JavaThread::is_thread_fully_suspended() and JavaThread::wait_for_ext_suspend_completion() to see how the waiting for external suspend completion works. It looks like both of those functions are now called from fewer places since Thread Local Handshakes has been implemented and that's a good thing.
21-09-2020
Yes, JVM/TI GetStackTrace() does not require the thread to be suspended, but it used to behave differently if you were using thread suspension. I'm talking about the older implementation of JVM/TI GetStackTrace(). It looks like it has changed a bit and now uses handshakes when the target thread != current thread. There used to be logic in place that would detect when the target thread had an external suspend request and was not yet suspended. In that case it would wait for the suspend to complete before doing the stack walk. Yes, that code was at risk of a "rogue resume", but we didn't worry about that too much because you had a broken JVM/TI agent in that case.
21-09-2020
> When GetStackTrace() is called, it will wait for the external suspend request to > be completed before it walks the target thread's stack. Dan was probably referring to is_thread_fully_suspended() (covered by JDK-8223312) which is called by JVMTI operations that require the target thread to be suspended (e.g. PopFrame). I must admit that right now I don't understand why is_thread_fully_suspended() is needed. Maybe because the SuspendThread was done by another thread than the one requesting the PopFrame? But this looks like another unsafe stackwalk https://github.com/openjdk/jdk/blob/2e30ff61b0394793655d6c6a593a5de8eb9c07b9/src/hotspot/share/prims/jvmtiEnv.cpp#L1789 which can crash the vm if the suspended target thread is resumed. The resume would certainly be illegal but it should not crash the vm.
21-09-2020
> When GetStackTrace() is called, it will wait for the external suspend request to > be completed before it walks the target thread's stack. JVMTI spec says "The thread need not be suspended to call this function." [1] So I wouldn't expect GetStackTrace() to wait for suspend completion. It would be my understanding that JavaThread::java_suspend() (used by the JVMTI impl. of GetStackTrace()) won't return before the target has reached a state that it cannot leave without noticing the suspend request and in which it cannot execute bytecode or equivalent. If necessary the requesting thread executes VM_ThreadSuspend at a safepoint for that. > And why does suspension need an async handshake when the requester > needs to wait until it knows the target is effectively suspended? I'd like to see it that way: executing an async. handshake is synchronous to some extend too. It blocks until the target reaches a state it cannot leave without noticing the async. handshake operation. This is very much like JavaThread::java_suspend() is synchronous the difference is that the target does not poll _suspend_flags but it polls for safepoints/handshakes. The async. part is doing the self suspend. [1] https://docs.oracle.com/en/java/javase/14/docs/specs/jvmti.html#GetStackTrace
21-09-2020
> And why does suspension need an async handshake when the requester > needs to wait until it knows the target is effectively suspended? Minor clarification on this point. It is not the suspension request that must wait. It is the following operation that depends on the suspend to be completed that must wait. For example, a SuspendThread() followed by a GetStackTrace() call will result in an external suspend request being set (after grabbing the SR_lock) and a return to the caller. After SuspendThread() has returned, we guarantee that no byte-code or byte-code equivalent will be executed. When GetStackTrace() is called, it will wait for the external suspend request to be completed before it walks the target thread's stack.
21-09-2020
"I thought we had reduced the places where we poll for safepoints (and now handshakes) significantly." No. We never just check _suspend_flag, since we deliver _suspend_flag events at transition we also have a poll. "may not need to happen at all of the same points of execution" Yes, I wrote that in above: "delivery terms". It's not a big issue, since current patch of asynchronous gives us the filter mechanism needed. The missing information that needs to be supplied is why we poll, e.g. entering java. The suspended thread must block if suspended when transitioning from a 'suspend equivalent' state, e.g. native/blocked to vm/java. That is what we use the _suspend_flag for and the requester cannot wait for that. So the _suspend_flag is just the 'barrier' part of the S/R protocol. Doing the initial request to suspend is a synchronous request as you say, but the suspend-barrier is asynchronous and target only executed.
21-09-2020
I'm not sure I agree that all such interactions should reduce to a single mechanism with a single set of points where you query for an "operation". For that to work for all operations you have to query for any operation at all the possible places that any operation might be desirable. You state: > Since we already have a safepoint(/handshake) poll at every possible delivery location but do we? I thought we had reduced the places where we poll for safepoints (and now handshakes) significantly. Checking for suspension requests, or async exceptions, may not need to happen at all of the same points of execution. And why does suspension need an async handshake when the requester needs to wait until it knows the target is effectively suspended? As I have commented before this is conflating the sync/async nature of the operation request, with what thread has to perform the operation.
21-09-2020
The main point is to remove _suspend_flag and as_special_runtime_exit_condition/handle_special_runtime_exit_condition and friends which will greatly simplify transition and will make it possible to contain the functionality. Right now it's spread out all over the runtime and we have many special transition for not getting this wrong. Delivery conditions are duplicate and very hard to understand. The _suspend_flag is asynchronous and target executed: Set bit X and thread execute method Y when doing a transition where the delivery terms are met. Since we already have a safepoint(/handshake) poll at every possible delivery location, these can be implementerad as handshake (if we had the correct functionality). The handshake operation could contain both the delivery terms and the delivery, thus having everything at one place and only once. This only adds the asynchronous and target only executed handshakes, not all other pieces.
21-09-2020
> Some JavaThread handshake operations must be performed by only the targeted thread(s). In what sense? VM operations were always performed by the VMThread, not the target. Synchronous handshakes are performed by the handshaker or handshakee - and as I've already flagged there are potential problems if the handshakee now executes code previously only ever executed by the VMThread at a safepoint! Granted an operation like "suspend" does require the target to do something, but that is not hampered by the target being blocked because we ensure it can't transition out of being blocked if it is suspended. So what operations are we envisaging that must be performed by the target thread itself? I can imagine perhaps asking for a stacktrace - the target will walk its own stack safely. But again if the target is blocked then the handshaker can also walk its stack, provided we have a guard that stops it unblocking while that is in progress. Asynchronous operations are much rarer than synchronous ones - normally the requester needs to know what they requested has happened, not just that it will happen at some point in the future (again see thread suspension where we have different APIs that need to treat the request for suspension versus knowing the target it actually suspended, differently). But yes if we do envisage asynchronous handshake operations then a queue would be necessary.
29-05-2020

Duplicate :	JDK-8252479 - Introduce new assertion for checking current thread or handshaker
Relates :	JDK-8254780 - EnterInterpOnlyModeClosure::completed() always returns true
Relates :	JDK-8223312 - Utilize handshakes instead of is_thread_fully_suspended
Relates :	JDK-8267079 - Support async handshakes that can be executed by a remote thread
Relates :	JDK-8252902 - Handshakes and locking improvements
Relates :	JDK-8254263 - Remove special_runtime_exit_condition() check from ~ThreadInVMForHandshake()
Relates :	JDK-8238860 - Safepoints with handshakes
Relates :	JDK-8238861 - Remove suspend_flag