Bug ID: JDK-8257602 Introduce JFR Event Throttling and new jdk.ObjectAllocationSample event (enabled by default)

JDK-8257602 : Introduce JFR Event Throttling and new jdk.ObjectAllocationSample event (enabled by default)

Type: Enhancement
Component: hotspot
Sub-Component: jfr

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2020-12-02
Updated: 2022-03-11
Resolved: 2020-12-10

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 16
16 b28Fixed

Related Reports

Duplicate :	JDK-8232620 - General sampling heap allocation event
Relates :	JDK-8281318 - Improve jfr/event/allocation tests reliability
Relates :	JDK-8258826 - Wait-free hashtable for stacktraces

Sub Tasks

JDK-8262902 :

Release Note: New jdk.ObjectAllocationSample Event Enabled by Default - Closed

Description

This enhancement tracks the work which is the result from a collaboration between DataDog and Oracle, building on the POC originally suggested by [~jbachorik] @DataDog:
https://mail.openjdk.java.net/pipermail/hotspot-jfr-dev/2020-September/001754.html

Motivation
In JFR, today, there exist two memory allocation profiling events: jdk.ObjectAllocationInNewTLAB for allocations inside of thread local area buffers (TLABs) and jdk.ObjectAllocationOutsideTLAB for allocations outside. These events are quite useful for both allocation profiling and for TLAB tuning.
They are, however, quite hard to reason about in terms of data production rates, and in certain edge cases, both the overhead and the data volume can be quite high. Because of this, the events are disabled in the default JFR configuration (default.jfc) and only enabled in the profiling JFR configuration (profile.jfc).

Since object allocation is one of the most important aspects for understanding java application performance, and especially for always-on production time profiling, arguably the most important domain for JFR, these are quite serious drawbacks.
It would be very convenient if information about object allocations could be turned on by default, out-of-the-box as it were, as it would provide information about allocation patterns for any java application.

Background
There are two reasons why it has not yet been possible to have the existing allocation events enabled by default:

Huge, in-deterministic, number of events
The sheer number of events recorded is a function of the allocation pressure in an application, and in general, java applications allocate a lot of objects. Although no match for the JFR engine to keep up, when enabled, these two event types attribute almost 75-80% of the entire set of events recorded. Compared to not having the events turned on, recording files on disk can easily increase 5-6x in size. These large files quickly become unwieldy, especially in situations where they are destined to be moved somewhere.

Performance overhead (small)
Although the event sites hook into the slow paths of object allocation in the JVM, in a regular java application, also the slow paths are very heavily trafficked. Understandably, this is a very critical path, and it is important that overhead is reduced to an absolute minimum. Since arguably the most important piece of profiling an allocation is a stacktrace pinpointing where it originated, capturing stacktraces are central to the events. Capturing stacktrace information is very fast in JFR, but still one of the most performance sensitive operations. It can quickly introduce unwanted overhead, both for the sheer number of frames to iterate as well as contention on hashtables as the concurrency increases. Normally this is a non-problem for other JFR event types, but for events that sit in critical paths, it is something that needs to be considered.

Solution
JFR has historically had a problem with unregulated event data sets, because, up to this point, there has not existed an adequate, as in performant, reliable and representative, means to sub-sample, or throttle, the emission rate for instant events. This is one of the main reasons much care goes into deciding the parameters for the default configuration (i.e. default.jfc) - it will have to be universally acceptable, also in anomalous environments and situations.
Granted, the concept of a threshold exists, reified as the threshold setting in configuration files, and it is acting a little bit like a throttle in that it limits the number of events recorded to only those above the threshold. Unfortunately, a threshold setting can only be configured for duration events, but maybe even more important, it will only record events considered to be outliers.

We would like a general mechanism that can record subsets of any event type, of configurable sizes, one that not only give outliers, but is statistically representative.

Introducing JFR Event Throttling

Throttle event setting:

A new configurable setting is made available in the .jfc files and the settings system in general. It takes expressions to evaluate for instructions on how to select a subset from the extension set of an event type. For now, only a single expression form is supported, one that express a rate (see example above). This expression states the number of events per time unit and JFR will, for this specific example, throttle the event emission rate to 100 per second, distributed evenly over time. Intuitively, this expression declares how many events per second we’re aiming for.
Casually, we call this “throttling” and say the event type is “throttled” when it is configured with this setting. Note that it is very likely that the actual rate is much less than the target, simply because there are no or only a small number of events being generated in the system. More importantly, from the perspective of improved determinism and control, JFR will not produce a rate that is higher than the expression, no matter the overall event pressure, hence the expression acts as a maximum rate.

All existing time units will be supported, e.g.: 2/ns, 5/us, 1/ms, 100/s, 600/m, 3600/h, 86400/d. Note that for the initial introduction, only unit times are supported. In the future, additional support can be added, if needed, to also support time coefficients, for example 600/5m.

As part of this enhancement, only a single event, the new jdk.ObjectAllocationSample event will support the new throttle setting; applying it to other event types does nothing.

JFR Adaptive Sampler:

The implementation of the sampler in the JVM is key to enable this functionality. The adaptive sampler is highly performant and general enough so that additional specializations can be built moving forward. We add one specialized concept, the JFR Event Throttler, which is the component to evaluate sample set inclusion for event types configured with the <throttle> setting, with the JFR Adaptive Sampler providing the indicator function.

Recording only a throttled / sub-sampled set of events is very useful, especially for events located in critical paths (allocations, locks and more), because it becomes possible to regulate and control both for size and overhead. The size of a subset can then be configured based on requirements, and also modified dynamically as the system is progressing. With this mechanism, it is possible to introduce a new allocation event that can be enabled by default (just because it is throttled).

The jdk.ObjectAllocationSample event definition:

Disabling jdk.ObjectAllocationInNewTLAB and jdk.ObjectAllocationOutsideTLAB in the profile.jfc:

With the new allocation event enabled in both default.jfc and profile.jfc, we also take the opportunity to disable jdk.ObjectAllocationInNewTLAB and jdk.ObjectAllocationOutsideTLAB in the profile.jfc, since its inclusion has led to reports about very large recording files, hence working with the profile configuration can be cumbersome. The events are still available, but they will not be turned on by the configuration files shipped in the JDK.

Comments

I reject the backport request to 11u, at least for the time being. Find my reasoning here: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-December/004519.html Please, nevertheless, be invited to discuss this further with the experts and re-request a backport approval should there eventually be consensus for allowing this in to 11u (and the backport is properly reviewed and tested, of course).
23-12-2020
I understand the reason for being cautious and wait a bit to have the change 'baked in'. But I would argue that backporting this to an LTS release will be necessary - currently, most of our customers need to disable TLAB events because they are causing too much overhead. Upgrading from an LTS to non-LTS might not be an option for many of them and the next LTS where this feature is available is almost a year away, not even mentioning the time for the uptake (which might easily be another year). Also, I am pretty sure this is the situation for other vendors leveraging on top of JFR where enabling the TLAB events pushes the overhead above the declared limit for any cloud-scale deployments. EDIT 1: I have run a preliminary benchmark to spot any obvious perf degradation with the object allocation sampling event. I have used SPECjvm2008 'xml' benchmarks which were pretty heavy on object allocation and collected the benchmarks results as well as the corresponding recordings. The results are available as a zip archive (https://drive.google.com/file/d/1pluGw1Jgbb3iPhehGHUS00y55fKwZDmR/view?usp=sharing]) containing the following folders: * SPECjvm2008.011 for the JDK11u trunk build and JFR recording with 'profile' settings * SPECjvm2008.012 for the JDK11u with the backport of this feature and JFR recording with 'profile' settings * SPECjvm2008.013 for the JDK11u trunk and no JFR recording There are 2 recordings attached: * Original JDK11u (https://drive.google.com/file/d/1ShpAX7hNLJvJ5ttQLcg2H23MEd7p2CPo/view?usp=sharing) * Patched JDK11u (https://drive.google.com/file/d/1WL6LsicPmRi_p4nN_M_R05TjAb5h1kme/view?usp=sharing) The summary of the benchmark is that using the object allocation sample event does not hurt the performance (the overall execution time diff is within 1% of the base execution time, the object allocation sample run being slightly more performant - but that might be just a noise) and the recording size is reduced by ~50%.
21-12-2020
This is a rather large enhancement and it changes code in the allocation path which is critical to all Java applications.We wasn't even sure we wanted to bring it into JDK 16 this late in the release. Plan is to disable the event during ramp down if we run into issues during stress testing. Furthermore, it changes metadata (adds a new event type and a new settings type) and modifies the event configurations (.jfc), which will likely surprise users and may cause problems for tools consuming the data. We also want user feedback on the sampling. For example. is the default rate reasonable and do we get a statistical accurate representation? This is not what I would expect to get in a security update, and especially not in the current state.
18-12-2020
[11u] Fix Request Please, consider approving the backport of this improvement to JDK 11u. Justification: The improvement allows using allocation profiling capabilities even in pretty busy environments where the current direct TLAB events based approach does not work because the sheer amount of those events puts a lot of pressure on the profiled application and increases the recording size. For more details please see the original description. Risk: The newly introduced code is rather isolated and is not affecting other parts of JVM. The original TLAB events will behave exactly the same and there are no new hooks introduced in the TLAB processing path. The only risky part is the new throttling mechanism not behaving as expected, although we are covering the edge cases in tests so it is rather unlikely. The additional overhead of sampling is very minimal because it is happening on an already slow path and the latency introduced by the sample check (basically a number comparison in a common case) is only a very small fraction of the TLAB retirement processing or out of TLAB allocation. RFR: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-December/004412.html
15-12-2020
Changeset: 502a5241 Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2020-12-10 12:33:48 +0000 URL: https://git.openjdk.java.net/jdk/commit/502a5241
10-12-2020