JDK-8245772 : Handshake instead of using suspend flag for native sampling
Type:Enhancement
Component:hotspot
Sub-Component:jfr
Affected Version:15
Priority:P4
Status:Open
Resolution:Unresolved
Submitted:2020-05-26
Updated:2021-11-18
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
This is one piece of puzzle to remove suspend flags.
Comments
DEFUNCT
30-09-2020
Coleen made me aware that there is no check of the _suspend_flags when transitioning from _thread_in_Java to _thread_in_native in generate_native_wrapper(). That means we could actually also send a signal to a thread in native if we see that it's in Java but it then transitions to native by the time we send the signal.
So the original patch that Robbin posted which removes the use of the _suspend_flags and just uses a handshake for the native case should be okay, since the flag is not really doing anything useful for the sampling "in Java" case.
28-09-2020
Thanks for clarifying.
I'm not too concerned that the target has to "process" the handshake after the sample, but obviously this is something that would have to be benchmarked (I don't know if we have anything that tries to measure samples-per-second?).
That said, simply disarming the poll and letting the target cleanup the "dead" op next time, doesn't sound difficult either.
23-09-2020
Yes, we don't know if the target is safe after returning from the async handshake. But if after that we read it's state and it's still in native then it is okay to do the sampling, since we know the target will stop if transitioning back to the VM. If it is in java we still use signals, I didn't change that. Here is the relevant part of the code:
JfrSamplingClosure *jfrscl = new JfrSamplingClosure(thread);
Handshake::execute((AsynchHandshakeClosure*)jfrscl, thread);
bool ret = false;
if (JAVA_SAMPLE == type) {
if (thread_state_in_java(thread)) {
ret = sample_thread_in_java(thread, frames, max_frames);
}
} else {
assert(NATIVE_SAMPLE == type, "invariant");
if (thread_state_in_native(thread)) {
ret = sample_thread_in_native(thread, frames, max_frames);
}
}
jfrscl->done();
The main difference I see as I mentioned earlier is that with handshakes the target will always have to execute and delete the async operation at some point. This is an extra overhead that we don't have today. For the native case it doesn't change much since the target will have to transition to the vm anyways, so we can do it then. But for the java case, after we signal the target back and it returns to java, next time it polls it will see there is a pending operation and it will have to go to the VM to execute and delete it. Maybe I could add a method to remove the async operation and disarm the poll of the target if we are in this case, before signaling back the target. Or even only disarming the poll should be okay. Then the target will remove the async operation on the next safepoint/handshake.
23-09-2020
When Handshake::execute returns you don't know that the target is blocked or safe as it is an async handshake.
I'm not sure what trying to use the handshake this way achieves in relation to the problem of granularity of sampling. The target can only be sampled when safe and it will only poll for handshakes at discrete points, so you don't get the same kind of sampling that you would with signals when in-java.
23-09-2020
Maybe using async handshakes it isn't that complex after all:
https://github.com/pchilano/jdk/commit/a92fa514a12cf743778ff6d5adb35cd17a03ce82
So JfrThreadSampleClosure::do_sample_thread() remains pretty much the same. We replace setting the flag by adding the async handshake. Then we just check if the thread is still either in java or in native and proceed if true as before.
One difference I see with the current code is that the thread always has to poll for the handshake and delete the operation. In the current code the thread might not realize it was being sampled, for example if it was still in native when the sampling ended, or after returning from the signal handler back to java it won't have to return to the VM.
22-09-2020
I hadn't realized the signalling was only used for the in-Java case. In that case it is used purely for finer resolution in the sampling and polling using handshakes would not be a suitable replacement as you note.
You need a mechanism to stop the thread in-native from reentering the VM during sampling - which is logically equivalent to suspending the thread but is not actually done via the exposed suspend/resume mechanism IIUC. So if you split this out from the suspend flags you will just have to add a new mechanism to support this. That can be encapsulated by a handshake operation that must be executed by the target thread but you will still need to introduce a queryable thread-state to know when it is safe to sample and when a signal is needed. This may well be even more complicated than the current scheme.
22-09-2020
One way to sample a thread in native could be to use async handshakes. The jfrsampler would add the operation to the handshake queue which would only be executed by the target when it transitions back. The jfrsampler after adding the operation would have to verify that the thread is still in native though (otherwise before adding the operation to the queue the thread could have already transitioned. In that case there would have to be a way to cancel the operation). The operation would probably be just wait on a monitor until notified, which the jfrsampler would do after sampling.
Notice that in the current code the signaling mechanism is only used to sample a thread in java, it is not used for the "in native" case. For native is not needed because the thread is already safe. The jfrsampler code in question is in JfrThreadSampleClosure::do_sample_thread():
thread->set_trace_flag(); // Provides StoreLoad, needed to keep read of thread state from floating up.
if (JAVA_SAMPLE == type) {
if (thread_state_in_java(thread)) {
ret = sample_thread_in_java(thread, frames, max_frames); // signal + sampling
}
} else {
assert(NATIVE_SAMPLE == type, "invariant");
if (thread_state_in_native(thread)) {
ret = sample_thread_in_native(thread, frames, max_frames); // only sampling
}
}
clear_transition_block(thread); // clear the trace flag and notify the target if blocked
So after setting the _trace_flag bits in the _suspend_flags, the jfrsampler reads the state of the thread. If it's still in native it can proceed directly to do the sampling, since the thread will block if transitioning back to the VM. If the thread is still in java, the sampler sends a signal to it and then starts the sampling. In that case the thread could have already transitioned from java but at least we know it will not be in native because we already set the _suspend_flags, so it would have transitioned and blocked first. (I think we want to avoid at all costs to send a signal when the thread is in native).
The original purpose of this change was to remove the dependency of the jfrsampler on the _suspend_flags (so we can remove it and simplify thread transitions). But even if we do that for the "in native" case, that flag is also needed for sampling in the "in java" case since it avoids a transition of the target all the way up to native right before sending the signal to it. Now, we could also think of a way of using handshakes for the "in java" case and remove the need to use the _suspend_flags, but here is where the other issue kicks in: the solution would still require to use a signal to stop the target, unless we accept that the sampling will only be done at polling sites. If we would still have to use signals, then I'm inclined to keep the current mechanism since it seems the handshake alternative might be even more complex, which would defeat the purpose of the change. It seems the suspende/resume protocol and delivering asynchronous exceptions (ThreadDeath) are still good candidates to use handshakes instead of using _suspend_flags, so the transition code can still be simplified.
21-09-2020
How can you sample a thread "in native" without using signals to suspend it? Anything using handshakes has to effectively be cooperative and can only be done when the target tries to reenter the VM.