Bug ID: JDK-8286991 Hotspot container subsystem unaware of VM moving cgroups

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 11,17,19

Priority: P4
Status: Open
Resolution: Unresolved
OS: linux

Submitted: 2022-05-19
Updated: 2025-03-10

Other
tbdUnresolved

The hotspot container wrapper on Linux is unaware of the VM process moving to a different cgroup - it establishes the paths to the associated controllers at VM start, and those paths never get updated.

This does affect os::available_memory() and os::physical_memory() - used e.g. from Java via JMM - and JFR. It also affects printout in VM.info and the hs-err file. It does *not* effect using the cgroup limit for java heap sizing, AFAICS, since that happens at startup. Unless the cgroup move happens right at startup too.

Note that the container subsystem already takes care of periodically re-reading cgroup limits in case those limits changed. So it already has the notion of knowing the up-to-date limits. Moving the process to a different cgroup should have the same effect.

Reproduction is easy (Ubuntu 20.04):

```
(start VM, has PID 107543)
sudo mkdir /sys/fs/cgroup/memory/foo                                       
echo 107543 | sudo tee /sys/fs/cgroup/memory/foo/cgroup.procs              
echo 500000000 | sudo tee /sys/fs/cgroup/memory/foo/memory.limit_in_bytes  
jcmd 107543 VM.info
```

VM.info will show the original limit, in this case unlimited:

```
container (cgroup) information:            
...
memory_limit_in_bytes: unlimited           
```

Since I started the VM in the default systemd cgroup and the VM initialization ran there. However, the real limit applies: the VM will be killed if it allocates more than 500MB.

JDK-8343191 is related. When investigating it, we've discovered that the paths are not always exposed in the interface files when processes move. In that case there isn't a lot that could be done at the JVM level.
10-03-2025
If the concern is for "jcmd 107543 VM.info" only, I think it's OK to completely refresh the information. I don't see an advantage of caching with CRC checks. In fact, jcmd 107543 VM.info should probably report - container info when VM was started - (if changed) up-to-date container info Then the user of jcmd will not be surprised to find out that the heap ergonomics are completely out of whack w.r.t. to the current container memory limits.
20-05-2022
@sgehwolf: thanks a lot for looking into this. I don't expect a full solution, but it would be nice to know when this information is outdated, at least in VM.info and the hs-err file. One simple pragmatic solution could be to store crc or file size (or the content itself, its not that big) of /proc/self/cgroup at VM startup, when the container wrapper initializes. Then, when printing the container information, to re-read /proc/self/group and check if it changed. And if it changed, at least not to print the output, or to print a mentioning of "this information may be outdated because the process changed cgroups)
19-05-2022
I will also note that for many years we did not try to account for changing zone configuration in Solaris either.
19-05-2022
I agree that dynamic detection implies far too much overhead. I continue to be frustrated that container technology does not provide the right set of interfaces for applications to be fully container aware. If moving a live process to a different cgroup is allowed then there should be a notification mechanism so the process can dynamically adapt. Otherwise it just seems to me that the container developers are actively working against application developers.
19-05-2022
Also, as you say, adjusting the resource limit without rebooting the VM will have bad effects as well as most resources are being used to size its internal structures. e.g. memory for setting up the heap. Number of CPUs which affects compiler threads etc.
19-05-2022
Most of the container detection code has been written with Docker/Kubernetes in mind. The above use-case is harder to do using containers AFAIK. In the kubernetes world a change in resource limits results in a reboot of the container. So it's largely a non-issue there. So there really is a balance to strike. Full dynamic detection will mean more overhead on parsing files, which is already fairly heavy-weight. Also considering we moved to a more dynamic model for the path look-up, then this would also report wrong limits (linux cgroups are hierarchical so a limit on the parent means it affects all children too - in terms of an upper bound): (start VM, has PID 107543) sudo mkdir /sys/fs/cgroup/memory/foo echo 107543 \| sudo tee /sys/fs/cgroup/memory/foo/cgroup.procs echo 500000000 \| sudo tee /sys/fs/cgroup/memory/memory.limit_in_bytes jcmd 107543 VM.info So I'm not convinced we should support each and every use-case.
19-05-2022
[~sgehwolf] I would suspect that in this case, the test would be way more complex to set up than the solution it tests. Up to you I guess; I think it would be justifiable to make do with manual tests in this case, especially since what the patch would do is not very complex.
19-05-2022
[~stuefe] OK, I see. That seems reasonable to me (detecting a change and printing a warning). If we decide to do that, we should have a way to automatically test this functionality.
19-05-2022