Bug ID: JDK-8249666 Improve Native Memory Tracking to report the actual RSS usage

JDK-8249666 : Improve Native Memory Tracking to report the actual RSS usage

Type: Enhancement
Component: hotspot
Sub-Component: runtime
Affected Version: 11,17,21,22

Priority: P4
Status: In Progress
Resolution: Unresolved

Submitted: 2020-07-17
Updated: 2025-05-23

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 26
26Unresolved

Related Reports

Blocks :	JDK-8317453 - NMT: Performance benchmarks are needed to measure speed and memory
Relates :	JDK-8199133 - [BACKOUT] NMT: Enhance thread stack tracking
Relates :	JDK-8199067 - [REDO] NMT: Enhance thread stack tracking
Relates :	JDK-8313083 - Print 'rss' and 'cache' as part of the container information
Relates :	JDK-8301749 - Tracking malloc pooled memory size
Relates :	JDK-8191369 - NMT: Enhance thread stack tracking

Description

Update 2025-04-8:

The goal of the ticket is to provide the live memory size of mmap sections registered with NMT when reporting NMT mmap information. That should be a third number, alongside reserved and committed.

Obviously, this number has to be queried in real-time, at the time when the report is done. Since that may take a bit, we can make the option to show live memory optional by guarding it behind a jcmd flag. The information would be stale after the report finishes, so there is no need to store it long-term.

I don't think we need this information in *detail* reports, just in summary. Since detail reports are kept on a per-commit base and live memory does not correlate well to committing memory, that would not be very useful.

Note that this feature already exists in `System.map` since JDK-8322475, and not only for NMT-tracked VMAs, but for all VMAs of the process. The command is also reasonably fast. So, the base problem is already solved; this same path could be taken (maybe reusing/revamping the original code).

----

Original issue text:

Currently, NMT shows allocated memory as either "Reserved" or "Committed". Reserved memory is actually just reserved, virtual address space which was mmaped with MAP_NORESERVE, while Committed memory is mapped without MAP_NORESERVE. In the output of top or pmap, both Reserved and Committed show up as "Virtual" memory until they will be used for the first time (i.e. touched). Only after a memory page (usually 4k) has been written to for the first time, it will consume physical memory and appear in the "resident set" (i.e. RSS) of top's/pmap's output.

The difference between allocating memory with or without MAP_NORESERVE depends on the Linux memory overcommit configuration [1]. By default, overcommit is allowed and memory allocated with MAP_NORESERVE isn't checked against the available physical memory (see man proc(5) [2]). If the HotSpot VM tries to commit reserved memory (i.e. re-map a memory region without MAP_NORESERVE which was previously mapped with MAP_NORESERVE) and there's not enough free memory available an OutOfMemoyError will be thrown.

But even committing a memory region doesn't mean that physical memory pages will be allocated for that region (and accounted in the processes RSS) until that memory will be written to for the first time. So depending on the overcommit settings, an application might still crash with a SIGBUS because it is running out of physical memory when touching memory for the first time which was committed a long time ago.

The main problem with the current NMT output is that it can not distinguish between touched and untouched Committed memory. If a VM is started with -Xms1g -Xmx1g the VM will commit the whole 1g heap and NMT will report Reserved=Committed=1g. In contrast, system tools like ps/top will only show the part of the heap as RSS which has really been used (i.e. touched), usually just about 100m. This is at least confusing.

But we can do better. We can use mincore() [3] to find the RSS part of the amount of memory which is accounted as Committed in NMT's output and report that instead (or in addition). Notice that this feature has already been implemented for threads stacks with "JDK-8191369: NMT: Enhance thread stack tracking" [4] and just needs to be extended to all other kinds of memory, starting with the Java heap.

Alternatively, instead of using mincore() we could use the information from /proc/<pid>/smaps (also accessible through the pmap [5] command line utility) directly and merge it with the NMT data to get a complete, annotated overview of the whole address space of a Java process.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
[2] https://man7.org/linux/man-pages/man5/proc.5.html
[3] https://man7.org/linux/man-pages/man2/mincore.2.html
[4] https://bugs.openjdk.java.net/browse/JDK-8191369
[5] https://man7.org/linux/man-pages/man1/pmap.1.html

Comments

Okay, guys. I try to slot in some days in the JDK 26 cycle for this.
23-05-2025
Looking forward to this and hopefully the impact of building the cache is not going to be significant.
08-04-2025
Sounds like we're on the same train of thought. I think this seems like a good idea forward, I'll support it with reviewing if a PR comes out.
08-04-2025
[~jsjolen] Some parts can be reused. System.map works like this: 1) build up NMT lookup cache (can be removed once we switch the backend to the binary tree, this is just performance optimization) 2) iterate over VMAs 3) print VMA details; look up VMA in NMT lookup cache to print NMT information if available (1) is needed to avoid O(n^2) lookup. The NMT solution would need a bit different flow, for optimization = probably a reversed one: 1) build up a cache containing all VMAs (e.g. by using an own instance of the VMATree, that would be very practical and fast) 2) print all mmap segments. For every segment, look up the VMA in the cache to get the live memory information. Print that, too. (1) would be, again, needed to avoid O(n^2) lookup. The underlying code uses a procfile scanner atop of proc/smaps. That is (now after several iterations) quite fast, as you can see in the execution time of System.map. I think all of that code could be reused. We may not need the discussed "mincore" solution at all.
08-04-2025
Thank you [~stuefe], yes, that does sound very useful. It sounds like this is strictly a reporter feature. We don't need to integrate this with the NMT structures and instead use the System.map feature to extract the data when reporting. Am I understanding that correctly, or am I missing something crucial?
08-04-2025
[~jsjolen] The goal of the ticket is to provide the live memory size of mmap sections registered with NMT, on a per-mapping granularity. That should be a third number, alongside reserved and committed. Obviously, this number has to be queried in real-time, at the time when the report is done. Since that may take a bit, we can make the option to show live memory optional, guard it behind a jcmd flag. Note that we already do this for System.map. That command has been extended with https://bugs.openjdk.org/browse/JDK-8322475 to provide real live RSS information on all memory segments the process has allocated, and shows the NMT category/flag/memtag/... for NMT-allocated segments. So, its not that complex a problem, maybe just a reshuffling of code. Note that our customers increasingly use System.map instead of NMT because of this very feature, so it shows that this is needed.
08-04-2025
I still don't get the goal of this ticket :-). OK, I understand the problem: we can have a "committed" page which just points to the 0-page, so it doesn't page in any actual physical memory. Is the goal to track the exact RSS of the JVM process? Or is the goal to be able to say, for any NMT-registred virtual memory region, how much of that particular region has a corresponding physical page? I don't see how it is useful to have NMT track the exact RSS of the JVM process. That will be general, very coarse, information, which is already available elsewhere.
08-04-2025
[~gziemski] I forgot what we agreed upon. What was your plan, did you want to grab this issue and fix it? Feel free to take it over if you want. I think this is one of the more important issues that need solving.
08-04-2025
[~gziemski] mincore on a 24TB heap is unrealistic. Better read the procfs then.
05-12-2024
[~stuefe] We don't want to take hours, of course, but what is the time duration that you are thinking about and did it come from your experience or is it some sort of hard requirement based on a specific use case?
04-12-2024
[~rtoyonaga] We don't need to provide a solution for all platforms. Linux is the most important. For Linux, mincore may actually be too slow for large heaps. Another solution would be to read RSS out from proc smaps (see e.g. what I did in the jcmd System.map for Linux). That would have to be done in a smart way though for speed. E.g. similar to how we deal with the VMATree, extract information from smaps and add it to a binary interval tree, then use that as a lookup when printing memory sizes for mmap ranges.
04-12-2024
I think the QueryWorkingSet windows API is almost analogous to mincore and can report memory actually contributing to RSS. However, currently when accounting thread stacks on Windows, Hotspot uses the VirtualQuery API to to get committed size, which does not return the actual physical memory used (inconsistent with the mincore approach on linux). That should probably be changed too. As another side note, on AIX, mincore only works for memory allocated via mmap, which is probably fine for explicit commits, but inconsistent when it comes to thread stacks.
03-12-2024
The more I think about this, and the more I look critically at NMT reports we get from our customers, the more I think this would be valuable and needed. Note that there are alternatives to mincore; for example, one could scan proc pid smaps on Linux for the dirty page count. However, whether this is faster is another question - probably not without some clever way of implementation and caching.
21-02-2024
This is a tricky problem, but not impossible to solve. I am not sure about the priority, though. I would guess that NMT "Committed" values correspond closely to what the hotspot actually touches: - for malloc blocks, since they are usually a lot smaller than a page, we can assume that page is touched (if nothing else, by the glibc below us, since it writes malloc headers) - for mmap areas, the hotspot typically commits what it will use in the foreseeable future. The obvious exception from this rule is the heap, where committed can be a lot larger than used. For heap, "touched" would be somewhere between "used" (which we know) and "committed" (which we know). I suspect that even if we implement mincore probing for all our memory, we still have a large unexplainable gap between RSS and what NMT reports as committed. This is because, among other things: - many allocations are not tracked by NMT - we can have significant overhead in the libc on a per-malloc block etc --- About mincore probing: It sounds like an attractive idea, and I believe it could be done by just probing the mmap blocks NMT knows (so, only the block tracked by VirtualMemoryTracker, not the MallocTracker blocks). We don't have to track malloc because of the same reason I stated above: it is highly likely malloc'ed memory is touched, no need to test that. Still, mincore probing can be expensive if we have lots of mappings or large mappings. But this would be something to be measured. It could be done with an additional option when reporting.
15-02-2024
I don't see how this is affected by JDK-8317453? As I wrote in the Description, I think we should use minors() and/or /proc/<pid>/snaps to implement this feature. It is very confusing if the NMT output doesn't correspond at all (i.e. is much larger although it only covers Hotspot allocations) to that from system tools.
15-02-2024
Pausing until I get JDK-8317453 figured out.
02-02-2024
I moved this issue out of 22 into 23. It's blocked by JDK-8317453 That work in JDK-8317453 will help figure out how we want to store internally all mallocs, that we will have to track for calculating RSS.
09-11-2023
Notice that "JDK-8191369 NMT: Enhance thread stack tracking" was backed out by "JDK-8199133 [BACKOUT] NMT: Enhance thread stack tracking" and it was then fixed by "JDK-8199067 [REDO] NMT: Enhance thread stack tracking"
15-03-2023
Sure, I've assigned it to you.
06-03-2023
[~simonis] hi Volker, we discussed this feature recently internally and have a use case where this would help. Are you OK if I take this one?
06-03-2023
I don't see a reason why this musst be closed as "Won't fix" just because nobody found the time to work on it yet. I still think this is a useful feature which we should implement and it is still on my (unfortunately quite long :) ToDo list.
05-01-2023
Runtime Triage: closing due to long period of inactivity on this request
04-01-2023