Bug ID: JDK-8288663 JFR: Disabling the JfrThreadSampler commits only a partially disabled state

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 11,17,18,19,20

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2022-06-17
Updated: 2024-02-22
Resolved: 2022-06-20

JDK 17	JDK 19	JDK 20
17.0.11Fixed	19Fixed	20 b03Fixed

void JfrThreadSampling::set_sampling_interval(bool java_interval, size_t period) {
  size_t interval_java = 0;
  size_t interval_native = 0;
  if (_sampler != NULL) {
    interval_java = _sampler->get_java_interval();
    interval_native = _sampler->get_native_interval();
  }
  if (java_interval) {
    interval_java = period;
  } else {
    interval_native = period;
  }
  if (interval_java > 0 || interval_native > 0) {
    if (_sampler == NULL) {
      log_trace(jfr)("Creating thread sampler for java:%zu ms, native %zu ms", interval_java, interval_native);
      start_sampler(interval_java, interval_native);
    } else {
      _sampler->set_java_interval(interval_java);
      _sampler->set_native_interval(interval_native);
      _sampler->enroll();
    }
    assert(_sampler != NULL, "invariant");
    log(interval_java, interval_native);
  } else if (_sampler != NULL) {
    _sampler->disenroll();
  }
}

The _sampler run state is controlled by the java interval and the native interval parameters (milliseconds).

If either is > 0, the sampler is to run. If both are 0, the sampler is disabled and should not run.

Let us say the sampler instance is currently running, with a java interval of 20 ms and a native interval of 20 ms.

The last recording is stopping, so the sampler is to be disabled. This is accomplished by setting first the java interval to 0 ms and then subsequently the native interval to 0 ms.

First, the java_interval comes in as 0. The code reads the current values of the _sampler instance, both for the java interval and the native interval. Again, if either value is > 0, the sampler is running or should run. The updated java interval value is committed as 0 onto the _sampler instance via _sampler->set_java_interval(interval_java).

Current state: java_interval = 0ms, native_interval = 20 ms. The sampler is still running.

Then comes the update to the native interval, for 0 ms.

The code reads the current field values of the sampler, 0 ms for the java interval, and 20 ms for the native interval. After reading 20 ms as the current value of the native interval, the code overwrites that local variable with the incoming value, so it is now 0 ms.

In the test, neither value is > 0, so the sampler state is not updated with the incoming 0 ms for the native interval, as it was for the java interval. Instead, the code branches to the _sampler->disenroll(), which stops the sampler.

At this point, the sampler is stopped, just as intended, because the last recording stopped.

But the _sampler instance state looks like this:

_java_interval = 0 ms
_native_interval = 20 ms

Let us now say another recording starts, but one that does not enable java execution sampling nor native method sampling. After the recording is started, the setting system will pass in disabled values for the java and the native intervals, both set to 0 ms.

First, the java interval: The code reads the current value of the _sampler instance, 0 ms for the Java interval, and 20 ms for the native interval. In the evaluation, this looks like the sampler is running because of the 20 ms native interval. The java interval is set, and the sampler is enrolled (i.e. started) when it should not.

Then comes the disabled native interval value of 0 ms. The system reads the current values from the sampler instance, 0 ms for the Java interval and 20 ms for the native interval. Again, it overwrites the native interval value with 0 ms, so it looks like the sampler is not running (although it is). The code then calls disenroll, which stops the sampler.

Still the _sampler instance state looks like this:
_java_interval = 0 ms
_native_interval = 20 ms

The sampler is eventually stopped (as intended when disabling). However, it was running for a brief period, resulting in native method sampling events generated in a situation where they should not.

Fix Request (17u): On JDK 17 JFR sometimes continues to sample threads after recordings are closed, which is quite unxpected. It looks like "Enrolling thread sampler" immediately after "Disenrolling thread sampler". PR: https://git.openjdk.org/jdk17u-dev/pull/1986 Testing: tier1, tier2, jdk_jfr, gc/stress/jfr on (linux-aarch64 fastdebug).
28-11-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/1986 Date: 2023-11-24 14:24:25 +0000
24-11-2023
The fix for this bug is integrated in jdk-20+3-87.
21-06-2022
Changeset: 0408f9c5 Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2022-06-20 14:24:35 +0000 URL: https://git.openjdk.org/jdk/commit/0408f9c5436c96128bfe1264faf9fe3f9668f492
20-06-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk19/pull/45 Date: 2022-06-20 14:15:54 +0000
20-06-2022
Changeset: a7df5a40 Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2022-06-20 13:30:19 +0000 URL: https://git.openjdk.org/jdk/commit/a7df5a40639a4d3138616c9fc1b144381240d2e5
20-06-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9199 Date: 2022-06-17 13:04:49 +0000
17-06-2022

Duplicate :	JDK-8287832 - jdk/jfr/event/runtime/TestActiveSettingEvent.java failed with "Expected two batches of Active Setting events"
Relates :	JDK-8288846 - misc tests fail "assert(ms < 1000) failed: Un-interruptable sleep, short time use only"
Relates :	JDK-8295274 - HelidonAppTest.java fails "assert(event->should_commit()) failed: invariant" from compiled frame"