Bug ID: JDK-8350338 JEP 518: JFR Cooperative Sampling

JDK-8350338 : JEP 518: JFR Cooperative Sampling

Type: JEP
Component: hotspot
Sub-Component: jfr

Priority: P2
Status: Closed
Resolution: Delivered
Fix Versions: 25

Submitted: 2025-02-19
Updated: 2025-06-10
Resolved: 2025-06-10

Related Reports

Relates :

JDK-8321098 - Cooperative JFR Sampling

Sub Tasks

JDK-8352251 :

Implement JEP 518: JFR Cooperative Sampling - Resolved

Description

Summary
-------

Improve the stability of the JDK Flight Recorder (JFR) when it asynchronously samples Java thread stacks. Achieve this by walking call stacks only at safepoints, while minimizing safepoint bias.

Motivation
----------

A running program consumes computational resources such as memory, CPU cycles, and elapsed time. To _profile_ a program is to measure the consumption of such resources by specific elements of the program. A profile might indicate that, e.g., one method consumes 20% of a resource, while another consumes only 0.1%.

Profiling can help make a program more efficient, and developers more productive, by identifying which program elements to optimize. Without profiling, we might optimize a method that was consuming few resources to begin with, having little impact on the program's overall performance while wasting effort. For example, optimizing a method that takes 0.1% of the program's total execution time to run ten times faster will only reduce the program's execution time by 0.09%.

JFR, the JDK Flight Recorder, is the JDK's profiling and monitoring facility. The core of JFR is a low-overhead mechanism for recording events emitted by the HotSpot JVM or by program code. Some events, such as loading a class, are recorded whenever an action occurs. Others, such as those used for profiling, are recorded by statistically sampling the program's activity as it consumes a resource. The various JFR events can be turned on or off, allowing a more detailed, higher-overhead collection of information during development and a less detailed, lower-overhead collection of information in production.

JFR can create an execution-time profile that shows which program elements consume significant elapsed real time, i.e., wall-clock time. It does this by sampling the execution stacks of program threads at fixed intervals of, say, 20 milliseconds. Each sample produces a JFR event containing a stack trace. Tools such as [jfr](https://docs.oracle.com/en/java/javase/24/docs/specs/man/jfr.html) and [JDK Mission Control](https://www.oracle.com/java/technologies/jdk-mission-control.html) can summarize a stream of such events into a textual or graphical profile.

In order to produce a stack trace for a program thread, JFR's sampler thread must suspend the target thread and parse the call frames on the stack. The HotSpot JVM maintains metadata to guide the parsing of stack frames, but that metadata is valid only when a thread is suspended at well-defined code locations known as _safepoints_. If we sample stacks only at safepoints, however, then we will likely suffer from the _safepoint bias problem:_ We risk losing accuracy, since a frequently-executed span of code might not be anywhere near a safepoint. The safepoint bias problem is [well known](https://plv.colorado.edu/papers/mytkowicz-pldi10.pdf) and [thoroughly researched](https://stefan-marr.de/downloads/mplr23-burchell-et-al-dont-trust-your-profiler.pdf).

So as to avoid the safepoint bias problem, JFR samples the stacks of program threads asynchronously, suspending threads and parsing their stacks at code locations that are not necessarily safepoints. Since the metadata for parsing stack frames is not guaranteed to be valid at non-safepoints, JFR's sampler thread uses heuristics in order to generate a stack trace.

Unfortunately, these stack-parsing heuristics are inefficient and, worse, when their results are incorrect then they can crash the JVM. JFR attempts to prevent such crashes via platform-specific crash-protection mechanisms, but those mechanisms can fail in the presence of concurrent activity such as class unloading.

Description
-----------

We redesign JFR's sampling mechanism to avoid relying on risky stack-parsing heuristics. Instead, we parse thread stacks only at safepoints.

To avoid the safepoint bias problem, we take samples cooperatively. When it is time to take a sample, JFR's sampler thread still suspends the target thread. Rather than attempting to parse the stack, however, it just records the target's program counter and stack pointer in a _sample request_, which it appends to an internal thread-local queue. It then arranges for the target thread to stop at its next safepoint, and resumes the thread.

The target runs normally until its next safepoint. At that time, the safepoint handling code inspects the queue. If it finds any sample requests, then, for each one, it reconstructs a stack trace, adjusting for safepoint bias, and emits a JFR execution-time sampling event.

Aside from being safe, this approach has several other advantages:

- Creating a sample request requires hardly any work, and could be done in response to a hardware event or inside a signal handler.

- The code to create stack traces and emit events is simpler. For example, it can dynamically allocate memory when it runs on the target thread, which it could not do when running on the sampler thread.

- The sampler thread has less work to do, since it need not run heuristics, improving scalability.

This approach works well when the target thread is running Java code, whether interpreted or compiled, but not when the target thread is running native code. In that case, we continue to use the existing approach.

Future Work
-----------

Our new approach does not entirely avoid safepoint bias. In some situations, such as when sampling inside a method for which the HotSpot JVM has an intrinsic implementation, it may be impossible to parse the stack. In these cases, the recorded stack trace will reflect the last Java stack frame, thereby introducing some bias. We intend to address this in future work.

Alternatives
------------

The HotSpot JVM does have an existing internal but unsupported mechanism, `AsyncGetCallTrace`, which is used by some third-party tools. Unfortunately, this mechanism relies on the same kind of risky stack-parsing heuristics that JFR uses today, but without any crash protection, thus it is even riskier. Another drawback is that it is based on the POSIX `SIGPROF` signal, an equivalent of which does not exist on Windows.

Testing
-------

This is strictly an implementation change. Existing unit, integration, and stress tests will suffice.

Dependencies
------------

The implementation of [JEP 509 (JFR CPU-Time Profiling)](https://openjdk.org/jeps/509) leverages the mechanism introduced here.

Comments

Good. Thank you, for answering my questions.

05-03-2025

Thanks Vladimir for your comments. "In "Summary" mention that we sample Java code execution (interpreted and compiled)." I made it more explicit that JFR asynchronous sampling only targets Java code execution (interpreted and compiled) - thank you. "In "Summary" or "Goal" point that this new approach provides more accurate and detailed sampling." In the first iteration of work in this area, described in this JEP, sampling accuracy does not improve much from what already exists, except perhaps that compiled frame prologues are better accounted for. The next stage of work in this area will specifically target "Sampling Accuracy." "May be add to "Not Goal" that it does not provide profiling native (C++) code in HotSpot VM or other native libraries it calls. You mentioned it in "Problems and challenges"." Done - thank you. "I did not find description how JFR will handle a lot more new events publishing from sampled threads. Is such process synchronized between sampled threads? Where and how such events are collected and recorded?" The actual number of events will increase only marginally (empirical evaluation using stress tests). This is because the interrupt sampling period is not changed. In general, the JFR infrastructure for writing events is very distributed; each thread writes its events lock-free to a thread local area, so there is no need to synchronize anything. The ExecutionSample event representing a sample will now be written by the sampled thread like any other event. "Do you add new or reuse old JFR event types?" We will continue to use the existing ExecutionSample event to represent samples. "Do we need to update tools which read and show new JFR events?" This is an implementation change only, with no compatibility impact on existing tools. Thank you for your input. Cheers

05-03-2025

Very good description of implementation. Even I understand it ;) Thank you. Small notes. In "Summary" mention that we sample Java code execution (interpreted and compiled). In "Summary" or "Goal" point that this new approach provides more accurate and detailed sampling. May be add to "Not Goal" that it does not provide profiling native (C++) code in HotSpot VM or other native libraries it calls. You mentioned it in "Problems and challenges". I did not find description how JFR will handle a lot more new events publishing from sampled threads. Is such process synchronized between sampled threads? Where and how such events are collected and recorded? Do you add new or reuse old JFR event types? Do we need to update tools which read and show new JFR events?

04-03-2025