JDK-8281571 : Do not use CPU Shares to compute active processor count
  • Type: CSR
  • Component: hotspot
  • Sub-Component: runtime
  • Priority: P3
  • Status: Closed
  • Resolution: Approved
  • Fix Versions: 19
  • Submitted: 2022-02-10
  • Updated: 2022-03-29
  • Resolved: 2022-02-26
Related Reports
CSR :  
Relates :  
Description
Summary
-------

Modify HotSpot's Linux-only container detection code to not use *CPU Shares* (the "cpu.shares" file with cgroupv1 or "cpu.weight" file with cgroupv2, exposed through the `CgroupSubsystem::cpu_shares()` API) to limit the number of active processors that can be used by the JVM. Add a new flag (immediately deprecated), `UseContainerCpuShares`, to restore the old behaviour; and deprecate the existing `PreferContainerQuotaForCPUCount` flag.

Problem
-------

Since [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115), if the JVM is executed inside a container, it tries to respect the container's CPU resource limits. For example, even if `/proc/cpuinfo` states the machine has 32 cpus, `os::active_processor_count()` may return a smaller value because `CgroupSubsystem::active_processor_count()` returns a CPU limit as configured by the cgroup pseudo file-system. As a result, thread pools in the JVM such as the garbage collector worker threads or the `ForkJoin` common pool are given a smaller size.

However, the current implementation of `CgroupSubsystem::active_processor_count()` also uses `CgroupSubsystem::cpu_shares()` to compute an upper limit for `os::active_processor_count()`. This is incorrect because:

- In general, the amount of CPU resources given to a process running within a container is based on the ratio between (a) the CPU Shares of the current container, and (b) the total CPU Shares of all active containers running via the container engine on a host. Thus, the JVM process cannot know how much CPU it will be given by only looking at its own process's CPU Shares value.

- The ratio between (a) and (b) varies over time, depending on how many other processes within containers are active during each scheduling period. A one-shot static value computed at JVM start-up cannot capture this dynamic behavior.

- [JDK-8216366](https://bugs.openjdk.java.net/browse/JDK-8216366) documents why the `1024` hard-coded constant is being used within the JVM. The referenced review thread uses Kubernetes as (one) justification for using CPU Shares as an **upper** bound for CPU resources. Yet, Kubernetes uses CPU Shares to implement its "CPU request" mechanism. It refuses to schedule a container on a node which would exceed the node's total CPU Shares capacity (`number_of_cores * 1024`). Hence, Kubernetes' notion of "CPU request" is a **lower** bound -- the container running the JVM process would be given at least the amount of CPU requested, potentially more. The JVM using CPU Shares as an **upper** bound is in conflict with how Kubernetes actually behaves.

The JVM's use of CPU Shares has lead to CPU underutilization (e.g., [JDK-8279484](https://bugs.openjdk.java.net/browse/JDK-8279484)).

Also, special-casing of the `1024` constant results in unintuitive behavior. For example, when running on a cgroupv1 system:

 - docker run ... --cpu-shares=512 java ....    ==> os::active_processor_count()  = 1
 - docker run ... --cpu-shares=1024 java ...   ==> os::active_processor_count()  = 32 (total CPUs on this system)
 - docker run ... --cpu-shares=2048 java ...   ==> os::active_processor_count()  = 2

When the `--cpu-shares` option is being set to `1024`, the JVM cannot decide whether `1024` means "at least one CPU" (Kubernetes' interpretation) or "--cpu-shares is unset" (Docker's  interpretation -- docker sets CPU shares to `1024` if the `--cpu-shares` flag is **not** specified on the command-line).

Solution
--------

Out of the box, the JVM will not use CPU Shares in the computation of `os::active_processor_count()`.

As describe above, the JVM cannot make any reasonable decision just by looking at the value of CPU Shares alone. We should leave the CPU scheduling decisions to the OS.

Add a new flag (immediately deprecated), `UseContainerCpuShares`, to restore the old behaviour; and deprecate the existing `PreferContainerQuotaForCPUCount` flag.

Specification
-------------
- Add a new flag `UseContainerCpuShares`
- Update the meaning of the existing flag `PreferContainerQuotaForCPUCount`
- Both flags are marked as deprecated in JDK 19, to be obsoleted in JDK 20 and expired in JDK 21. Uses of these flags are discouraged.

Changes in os/linux/globals_linux.hpp:

    +  product(bool, UseContainerCpuShares, false,                           \
    +          "(Deprecated) Include CPU shares in the CPU availability calculation."       \
    +                                                                        \
    +  product(bool, PreferContainerQuotaForCPUCount, true,                  \
    +          "(Deprecated) Calculate the container CPU availability based on the value" \
    +          " of quotas \(if set\), when true. Otherwise, if "            \
    +          " UseContainerCpuShares is true, use the CPU"                 \
    +          " shares value, provided it is less than quota.")             \
    +
    
Compatibility Risks
-------------

###  Kubernetes:

- Kubernetes requires that if either "CPU requests" or "CPU limits" are set, then both must be set. As a result, because of [JDK-8197589](https://bugs.openjdk.java.net/browse/JDK-8197589), by default the JVM will ignore CPU Shares. The JVM already does what this CSR specifies.
- If neither "CPU requests" or "CPU limits" are set, Kubernetes runs the container with no (upper) CPU limit, and minimal CPU Shares. Before this CSR, the JVM will limit itself to a single CPU. After this CSR, the JVM may use as much CPU as given by the OS (subject to competition with other active processes within containers). If this new behavior is not what the user wants, they should explicitly use "CPU requests"/"CPU limits" of their Kubernetes deployments instead of relying on the previous JVM behavior.

###  Other Linux-based container orchestration environments

- In general, after this CSR, out of the box, a JVM process will be able to use more CPU resources than given by the OS scheduler. If the user wants to limit the active processor count of the JVM process within a container they should use the appropriate mechanisms of the container orchestration environments to set the desired limits. For example, use a limit based on CPU quotas or CPU sets. Another option is to override the default container detection mechanism by explicitly specifying `-XX:ActiveProcessorCount=<n>` on the command-line.

As a stop-gap measure, if the user cannot immediately modify their configuration settings per the above suggestions, they can use the flag `-XX:+UseContainerCpuShares` to bring back the behavior before this CSR. Note that this flag is intended to be used only for short term transition purposes and will be obsoleted in JDK 20.



Comments
jdk18u version of this CSR: JDK-8283356
29-03-2022

Moving to Approved subject to a release note being written. Thanks for the analysis and discussion in the CSR comments.
26-02-2022

Likewise. Added myself as a reviewer too.
17-02-2022

I did another editorial pass to expand the summary and solution sections, with a tweak to the specification section. Note the flag descriptions in globals.hpp are updated to say "(Deprecated)". I've added myself as a reviewer (required before a CSR request can be proposed or finalized).
17-02-2022

> David Holmes sounds fair. If Severin Gehwolf is OK with this, then I'll update the description to add a new flag. This flag and PreferContainerQuotaForCPUCount will both be deprecated to be removed in 20. [~iklam] Sounds good to me.
16-02-2022

> > My main concern about this would be potential process thrashing of some deployments after this patch. > > I don't follow how this relates to the suggestion of adding a new flag. The deployed patch would behave the same out-of-the-box in both cases. Sorry for the confusion. It wasn't meant as a reply to your compatibility flag proposal, but to the CSR in question here (ignore cpu shares). Previously on an unlimited (via quote) kubernetes cluster and no "cpu requests", the default is to set cpu shares to 2 resulting in the JVM to use only a single cpu core. Post patch, it would not limit the JVM at all and - depending on the host system - may end up using 64 cores on a 64 core machine after. Yet, it's unlikely that with a share of only 2, the container would get enough CPU cycles to actually make much progress. It's a corner case, but a concern regardless. The work-around in such a case would be to specify `-XX:ActiveProcessorCount=1` instead in those cases. Hope that makes sense
16-02-2022

`UseContainerCpuShares` seems a good name. Thanks.
16-02-2022

[~dholmes] sounds fair. If [~sgehwolf] is OK with this, then I'll update the description to add a new flag. This flag and PreferContainerQuotaForCPUCount will both be deprecated to be removed in 20. For the name of the new flag, since no current flag ends with `Enabled`, instead of `ContainerCpuSharesEnabled `, how about `UseContainerCpuShares`?
16-02-2022

[~iklam] The compatibility issue is not about "changing the command-line" but about the ability to restore the existing behaviour. In JDK 20 there would be no way restore the behaviour so they would have to figure out how to deploy their app and perhaps configure the container (if under their control), differently. We give them 6 months to get around to figuring that out by providing the workaround in JDK 19.
16-02-2022

[~dholmes] I corrected my previous comment. Actually, I am not convinced that changing the meaning of `PreferContainerQuotaForCPUCount`, or adding a new flag, is necessary or helpful. This CSR will change the out of box behavior to ignore cpu shares. If an app relies on the old behavior to throttle its CPU usage, it must change its command-line by adding an extra JVM flag, which is deprecated by this CSR. After we remove the deprecated flags, the app must change its command-line again by doing it the proper way (with `-XX:ActiveProcessorCount=<n>`, etc.) If the app needs to change its command-line, it may just as well do it the proper way the first time around.
15-02-2022

I'd rather not add a new flag. Perhaps change the meaning of `PreferContainerQuotaForCPUCount`? int share = !PreferContainerQuotaForCPUCount ? cpu_shares() : -1; So users can bring back the current (in my opinion misconceived) behavior. My vote is to do the complete removal since we have 2 releases before the next LTS. However, if we decide to keep the code there for safety, we should deprecate this flag for eventual removal. To limit the thread pool sizes, the user should use CPU quota, or explicitly use `-XX:ActiveProcessorCount=<n>`, or `-Dutil.concurrent.ForkJoinPool.common.parallelism=<n>`.
15-02-2022

> My main concern about this would be potential process thrashing of some deployments after this patch. I don't follow how this relates to the suggestion of adding a new flag. The deployed patch would behave the same out-of-the-box in both cases. [~iklam] I think you meant `int share = !PreferContainerQuotaForCPUCount ? cpu_shares() : -1;` That may be a possible interpretation - though the later code that currently uses the flag would also have to change. However that flag is only supposed to relate to the case where both quota and shares are set, not affect when only shares is set. I'd need to see the new code and the new flag explanation. A google search (especially these days) is not a strong indicator of anything unfortunately.
15-02-2022

BTW, the addition of the `PreferContainerQuotaForCPUCount` flag had a much larger impact that the current CSR, since it effectively makes the JVM ignore CPU shares on most Kubernetes deployments (where both CPU shares and quotas are set). I did a google search and found no mentions like, "oh I had a serious problem that could only be fixed by `-XX:-PreferContainerQuotaForCPUCount`, which means no one was actually relying on CPU shares to limit threadpool sizes.
15-02-2022

> I remain concerned that the proposed code change may impact users in > unexpected ways. Do we actually need to introduce a new flag, > immediately deprecated, that disables use of shares by default, but > which can also be used to restore the old behaviour. It's a possibility. It would certainly be the safer option. My main concern about this would be potential process thrashing of some deployments after this patch.
15-02-2022

[~iklam] I've done another more thorough review pass of this CSR. Happy to give my review approval once we decide what to do with `PreferContainerQuotaForCPUCount` and document it.
15-02-2022

I remain concerned that the proposed code change may impact users in unexpected ways. Do we actually need to introduce a new flag, immediately deprecated, that disables use of shares by default, but which can also be used to restore the old behaviour. The new flag could be used very simply as: `int share = ContainerCpuSharesEnabled ? cpu_shares() : -1;` `PreferContainerQuotaForCPUCount` could be deprecated at the same time.
15-02-2022

> Why would we need that code change? PreferContainerQuotaForCPUCount is true by default and means we prefer to use Quota over Shares if both are set. As long as it remains true then the fact shares are ignored has no affect on behaviour. It is only if explicitly set to false that we have to decide whether to ignore it, or chose to use shares anyway. Sorry, you are right of course. We'd need code in case of `PreferContainerQuotaForCPUCount == false`. It should mention that it's deprecated and a no-op if it is not removed as part of this CSR.
15-02-2022

[~iklam] The proposal is to remove cpu_shares handling, right? Therefore, we need to explain how `PreferContainerQuotaForCPUCount` is being dealt with too in this CSR. My suggestion would be to change interface kind to `add/change/modify command line option` and propose to remove `PreferContainerQuotaForCPUCount` option, as post-patch there would be no cpu_shares "quota" to fall back to. Failing that, we'd need to mark it deprecated and mention it'll be a no-op until we can remove it.
15-02-2022

[~sgehwolf] - As you suggested, I removed v1/c2 specific files and use the `CgroupSubsystem::xxx()` APIs instead. - I left Kubernetes/Docker/1024 in there because these are in the [code](http://hg.openjdk.java.net/jdk/jdk/rev/3cdf4d5148a8) committed via [JDK-8216366](https://bugs.openjdk.java.net/browse/JDK-8216366) - I used docker/cgroupv1 as an example to illustrate the problem with 1024. The same kind of problem will probably affect other containers as well, but with different parameter settings. - For your point (B), "other orchestration frameworks relying on taking cpu shares as an upper bound", I think it's not possible to use CPU shares to enforce an upper bound. It can only be used to give a guaranteed lower bound.
15-02-2022

Whatever the take on `PreferContainerQuotaForCPUCount` is (within this CSR), it should be mentioned what it is.
15-02-2022

Why would we need that code change? `PreferContainerQuotaForCPUCount` is true by default and means we prefer to use Quota over Shares if both are set. As long as it remains true then the fact shares are ignored has no affect on behaviour. It is only if explicitly set to false that we have to decide whether to ignore it, or chose to use shares anyway.
15-02-2022

If we keep `PreferContainerQuotaForCPUCount`, then we'd have to add code as part of this change to set it to false (and print a warning) if the user has `-XX:+PreferContainerQuotaForCPUCount` set. My preference would be to remove it as part of this CSR, though. Based on [~iklam]'s reasoning, I tend to lean towards removing cpu shares code. We should bite the bullet and admit that using cpu shares for a cpu **limit** (a.k.a upper bound) was a mistake and ought to be corrected.
15-02-2022

I think we can deprecate PreferContainerQuotaForCPUCount separately. The effect of this CSR is to render setting PreferContainerQuotaForCPUCount to false a no-op and that would need to be documented via a Release Note. Alternatively ... does it make any sense to keep using the cpu shares values, but only if PreferContainerQuotaForCPUCount is set to false?
15-02-2022

[~iklam] A couple of comments: - Since the JDK supports cgroup v1 and cgroup v2, we should try to avoid mentioning cgroup version specific interface files (e.g. /sys/fs/cgroup/cpu,cpuacct/ which is cgroup v1 specific) - Similar to the previous point, handling of 1024 constant is slightly different between cgroup v1 and cgroup v2. Ideally we'd use version agnostic verbiage. It would be good to avoid mentioning of `Docker` too. There is podman and other engines too. Perhaps we also should try to avoid mentioning Kubernetes? - By actually removing CPU shares from the processor count calculation we essentially would make JDK-8197589 obsolete. I.e. flag `PreferContainerQuotaForCPUCount` would need to be removed too in that case as `-XX:-PreferContainerQuotaForCPUCount` would have no effect after the patch. - From a compatibility standpoint the primary use-cases that are affected is: A) Kubernetes, without setting any cpu limits, neither requests nor limits, and relying on the fact that the JVM will map this to a single CPU core. B) Any other orchestration frameworks relying on taking cpu shares as an upper bound. As far as Kubernetes (A) is concerned, I could think of setting `requests.cpu` explicitly in the deployment config so that a limits.cpu gets set too, where then the cpu quota calculation then kicks in. B) is a lot harder. Though, using ActiveProcessorCount in those cases seems reasonable. Most frameworks I've seen are Kubernetes-based.
14-02-2022

Made slight editorial pass. My main concern from a compatibility perspective is knowing what a user has to change to restore any observed change in behaviour/performance that this change introduces, keeping in mind the user affected by this may not be in a position to control the configuration of the container.
14-02-2022

Discussion thread: https://mail.openjdk.java.net/pipermail/hotspot-dev/2022-February/057510.html
14-02-2022