Bug ID: JDK-8321098 Cooperative JFR Sampling

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 25
25Resolved

Sampling stacks from safepoints is a very safe way of sampling. All components of the JVM have been designed to be able to walk stacks from safepoints, and by walking only frames that are walkable, we are immune to trouble caused by guessed methods that quack like methods from a genuine stack trace and walk like methods from a genuine stack trace, but potentially explode later on due to use-after-free.

The classic problem with safepoint based sampling, is safepoint bias. Essentially, the trouble is that samples come from points where we poll from safepoints, which might be many bytecodes away from where we were really spending statistically significant time.

I propose a hybrid safepoint and signal based solution. The idea is that we still shoot a signal at a samplee thread. In that signal, we only record the SP and PC of the thread. Then this is enqueued to be sampled on that thread, in a subsequent safepoint pollsite. When we get to the subsequent safepoint pollsite, we check if the PC is from an nmethod. If it is, we can recreate the exact stacktrace that we would normally have reported from the signal handler, from the safepoint pollsite instead. When we hit compiled methods, we get the benefits of signal based accuracy, combined with the fundamental safety of having the entire stacktrace be walked from a safe walkable point in the JVM. When the sampeld PC isn't coming from an nmethod, I propose we perform the stack trace completely from the safe point. As for any safepoint bias from the interpreter, it's rather straight forward to simply poll for safepoints in the dispatch loop of the interpreter, which eliminates the safepoint bias as a problem, from interpreted code. The original proposed patch for thread-local handshakes did exactly that and it worked absolutely fine.

I have a prototype for the suggested changes available here: https://github.com/fisk/jdk/tree/jfr_safe_trace_v1

Implementation of this idea is coming together. We will be able to use this technique / mechanism also for the interpreter, without having to poll for safepoints in the dispatch loop. Additionally, besides a more robust sampling mechanism, we are opening for very interesting feature capabilities. One of them is the ability to sample and measure safepoint latency - we will be able to measure how long it takes for a thread to reach its next safepoint poll site. This is interesting information that might change how we consider placements for safepoint poll instructions. Another feature is that we can sample and measure how long a thread is executing non-Java code, both wall-clock and cpu time. We are also looking into how we can use this capability to sample and measure the duration for pinned virtual threads. I am attaching a document draft describing the suggested solution.

25-10-2024

I also worked on this and have a more visual explanation at: https://mostlynerdless.de/blog/2023/08/10/taming-the-bias-unbiased-safepoint-based-stack-walking/

04-12-2023

Blocks :	JDK-8302350 - JfrThreadSampler failed with "assert((is_native() && bci == 0) \|\| (!is_native() && 0 <= bci && bci < code_size())) failed: illegal bci: 0 for non-native method"
Duplicate :	JDK-8352251 - Implement JEP 518: JFR Cooperative Sampling
Relates :	JDK-8168445 - make pd_get_top_frame_for_profiling more robust
Relates :	JDK-8326236 - assert(ce != nullptr) failed in Continuation::continuation_bottom_sender
Relates :	JDK-8350338 - JEP 518: JFR Cooperative Sampling
Relates :	JDK-8170152 - WhiteBox testing of pd_get_top_frame_for_profiling