Note: This backport CSR is copied verbatim from the original CSR [JDK-8281571](https://bugs.openjdk.java.net/browse/JDK-8281571), with the exception that the affected VM flags are not deprecated.
Summary
-------
Modify HotSpot's Linux-only container detection code to not use *CPU Shares* (the "cpu.shares" file with cgroupv1 or "cpu.weight" file with cgroupv2, exposed through the `CgroupSubsystem::cpu_shares()` API) to limit the number of active processors that can be used by the JVM. Add a new flag , `UseContainerCpuShares`, to restore the old behaviour.
Problem
-------
Since [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115), if the JVM is executed inside a container, it tries to respect the container's CPU resource limits. For example, even if `/proc/cpuinfo` states the machine has 32 cpus, `os::active_processor_count()` may return a smaller value because `CgroupSubsystem::active_processor_count()` returns a CPU limit as configured by the cgroup pseudo file-system. As a result, thread pools in the JVM such as the garbage collector worker threads or the `ForkJoin` common pool are given a smaller size.
However, the current implementation of `CgroupSubsystem::active_processor_count()` also uses `CgroupSubsystem::cpu_shares()` to compute an upper limit for `os::active_processor_count()`. This is incorrect because:
- In general, the amount of CPU resources given to a process running within a container is based on the ratio between (a) the CPU Shares of the current container, and (b) the total CPU Shares of all active containers running via the container engine on a host. Thus, the JVM process cannot know how much CPU it will be given by only looking at its own process's CPU Shares value.
- The ratio between (a) and (b) varies over time, depending on how many other processes within containers are active during each scheduling period. A one-shot static value computed at JVM start-up cannot capture this dynamic behavior.
- [JDK-8216366](https://bugs.openjdk.java.net/browse/JDK-8216366) documents why the `1024` hard-coded constant is being used within the JVM. The referenced review thread uses Kubernetes as (one) justification for using CPU Shares as an **upper** bound for CPU resources. Yet, Kubernetes uses CPU Shares to implement its "CPU request" mechanism. It refuses to schedule a container on a node which would exceed the node's total CPU Shares capacity (`number_of_cores * 1024`). Hence, Kubernetes' notion of "CPU request" is a **lower** bound -- the container running the JVM process would be given at least the amount of CPU requested, potentially more. The JVM using CPU Shares as an **upper** bound is in conflict with how Kubernetes actually behaves.
The JVM's use of CPU Shares has lead to CPU underutilization (e.g., [JDK-8279484](https://bugs.openjdk.java.net/browse/JDK-8279484)).
Also, special-casing of the `1024` constant results in unintuitive behavior. For example, when running on a cgroupv1 system:
- docker run ... --cpu-shares=512 java .... ==> os::active_processor_count() = 1
- docker run ... --cpu-shares=1024 java ... ==> os::active_processor_count() = 32 (total CPUs on this system)
- docker run ... --cpu-shares=2048 java ... ==> os::active_processor_count() = 2
When the `--cpu-shares` option is being set to `1024`, the JVM cannot decide whether `1024` means "at least one CPU" (Kubernetes' interpretation) or "--cpu-shares is unset" (Docker's interpretation -- docker sets CPU shares to `1024` if the `--cpu-shares` flag is **not** specified on the command-line).
Solution
--------
Out of the box, the JVM will not use CPU Shares in the computation of `os::active_processor_count()`.
As describe above, the JVM cannot make any reasonable decision just by looking at the value of CPU Shares alone. We should leave the CPU scheduling decisions to the OS.
Add a new flag, `UseContainerCpuShares`, to restore the old behaviour.
Specification
-------------
- Add a new flag `UseContainerCpuShares`
- Update the meaning of the existing flag `PreferContainerQuotaForCPUCount`
Changes in os/linux/globals_linux.hpp:
+ product(bool, UseContainerCpuShares, false, \
+ "Include CPU shares in the CPU availability calculation." \
Compatibility Risks
-------------
### Kubernetes:
- Kubernetes requires that if either "CPU requests" or "CPU limits" are set, then both must be set. As a result, because of [JDK-8197589](https://bugs.openjdk.java.net/browse/JDK-8197589), by default the JVM will ignore CPU Shares. The JVM already does what this CSR specifies.
- If neither "CPU requests" or "CPU limits" are set, Kubernetes runs the container with no (upper) CPU limit, and minimal CPU Shares. Before this CSR, the JVM will limit itself to a single CPU. After this CSR, the JVM may use as much CPU as given by the OS (subject to competition with other active processes within containers). If this new behavior is not what the user wants, they should explicitly use "CPU requests"/"CPU limits" of their Kubernetes deployments instead of relying on the previous JVM behavior.
### Other Linux-based container orchestration environments
- In general, after this CSR, out of the box, a JVM process will be able to use more CPU resources than given by the OS scheduler. If the user wants to limit the active processor count of the JVM process within a container they should use the appropriate mechanisms of the container orchestration environments to set the desired limits. For example, use a limit based on CPU quotas or CPU sets. Another option is to override the default container detection mechanism by explicitly specifying `-XX:ActiveProcessorCount=<n>` on the command-line.
As a stop-gap measure, if the user cannot immediately modify their configuration settings per the above suggestions, they can use the flag `-XX:+UseContainerCpuShares` to bring back the behavior before this CSR. Note that this flag is intended to be used only for short term transition purposes and will be obsoleted in JDK 20.