JDK-8261238 : NMT should not limit baselining by size threshold
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2021-02-05
  • Updated: 2021-07-09
  • Resolved: 2021-04-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 17
11.0.13Fixed 17 b20Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.

But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.

There are two thresholds:

when collecting baseline information for a report, it omits call sites smaller than MemBaseline::SIZE_THRESHOLD, which is hard coded to 1K.
when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.

I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.

The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.

Footprint costs of omitting the (1) threshold:

According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.

A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.

Therefore, the standard footprint should not be affected at all. If NMT is active and someone runs jcmd VM.native_memory, it will cause about 270K (210K more than today) of temporary allocations. If someone runs jcmd VM.native_memory baseline, that increase is not temporary but sticks.

Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.

Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.

Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.

Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.
Comments
Fix Request (11u): I'd like to backport this to 11u. It is an important change since it allows us to use NMT for leak analysis where before NMT would cut off reporting for any call site allocating less than the 4K threshold. The risk is small: the fix increases the footprint of NMT *reporting* (so, not NMT in general, just when you create a report via jcmd) by about 200K. 400K if you do baseline diffs. See discussion in the original PR: it was deemed acceptable in order to get precise NMT reports. Fix applies cleanly. Nightlies ran at SAP across all our platforms for about two weeks without problems. I also executed manual NMT jtreg tests without problems.
09-07-2021

Changeset: 578a0b3c Author: Thomas Stuefe <stuefe@openjdk.org> Date: 2021-04-26 04:56:31 +0000 URL: https://git.openjdk.java.net/jdk/commit/578a0b3c
26-04-2021