JDK-8281181 : Do not use CPU Shares to compute active processor count
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2022-02-03
  • Updated: 2022-12-21
  • Resolved: 2022-03-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 17 JDK 18 JDK 19
11.0.17-oracleFixed 17.0.5Fixed 18.0.2Fixed 19 b13Fixed
Related Reports
CSR :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8288367 :  
Description
Container runtimes support the concept of "CPU shares" to divide available CPU resources among competing containers. This bug description uses Docker as an example, but the bug affects other runtimes as well.

Docker has a "--cpu-shares" option [3] which controls the pseudo file cpu.shares [1] with cgroupv1 and cpu.weight [2] with cgroupv2.

Excerpt from [1] "cpu.shares: The weight of each group living in the same hierarchy, that translates into the amount of CPU it is expected to get. Upon cgroup creation, each group gets assigned a default of 1024. The percentage of CPU assigned to the cgroup is the value of shares divided by the sum of all shares in all cgroups in the same level."

From the above excerpt, it's clear that cpu.shares should be interpreted as relative values. For example, if we have processes A and B that are both actively executing and are assigned these cpu.shares:

    A = 100, B = 100, or
    A = 1000, B = 1000

Then A and B will both get half of the available CPU resources, because they have the same cpu.shares value. The exact numerical value of cpu.shares doesn't matter.

Also, if process B is idle, then process A will get all available CPUs, regardless of the cpu.shares value.

However, since JDK-8146115, the JDK interprets cpu.shares as an absolute number that limits how many CPUs the current process can use [4, 5, 6]:

0 ... 1023     = 1 CPU
1024            = (no limit)
2048            = 2 CPUs
4096            = 4 CPUs

This incorrect interpretation can cause CPU underutilization:

(a)  on machines with lots of physical CPUs -- see attachments cpu-shares-bug.sh, cpu-shares-bug.log.txt, and the first comment below.

(b)  if small values are chosen for the CPU shares (see JDK-8279484, where cpu.weight is set to 1 by Kubernetes).

(c) if all other containers are idle but the actively executing container is artificially constrained.

Also, the somewhat arbitrary interpretation of "1024 means no limit" can lead to unexpected behaviors. E.g., if A is set to 1024 and B is set to 2048, and the programs are running on a 16 core machine, A will end up using most of the CPUs, contrary to the user's expectation.

P.S., A good write up of the same problem facing OpenJ9 can be found at [7]


==================================
References:

[1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/glommer/memcg/+/cpu_stat/Documentation/cgroups/cpu.txt

[2] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

[3] https://docs.docker.com/config/containers/resource_constraints/

[4] https://github.com/iklam/jdk/blame/ec63957f9d103e86d3b8e235e79cabb8992cb3ca/test/hotspot/jtreg/containers/docker/TestCPUAwareness.java#L62

[5] https://github.com/iklam/jdk/blame/d4546b6b36f9dc9ff3d626f8cfe62b62daa0de01/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp#L236

[6] https://github.com/iklam/jdk/blame/f54ce84474c2ced340c92564814fa5c221415944/src/hotspot/os/linux/cgroupSubsystem_linux.cpp#L505

[7] https://github.com/eclipse-openj9/openj9/issues/2251
Comments
I don't think auto-scaling is at all feasible. First we have no way to track how many CPU's a process is given in any time period. Second, shrinking pools etc is something that needs to be done gradually over time to avoid ping-pong affects (and often pools are not configured to shrink easily, if at all). It is also not generally true that things will run better if you scaled down the pools - it will highly depend on actual workload and nature of the tasks. > It's up to the people deploying the JVMs to set the JVM parameters correctly Yes. And it is up to these systems to provide the right tuning options to allow this.
21-12-2022

I think the bigger overall question is -- how can the JVM auto-scale based on the various cgroup control values. If all we know is the system has 256 CPUs, and the JVM process has a cpu.weight of 100, I don't think we can make any decision, as we don't know how many other processes are running, and what their cpu weights are. The value 100 by itself is meaningless, as discussed in JDK-8281571. The JVM could try to read the cpu.weight of all other processes, but it may not have permission. Also, such calculations should be performed by the OS schedule, not by each individual JVM processes. At this point, the JVM simply doesn't have enough information to make the decisions by itself. It's up to the people deploying the JVMs to set the JVM parameters correctly. An ideal solution would be for the JVM to monitor how much CPU it's actually allocated for a given period. If it's consistently getting no more than 4 CPUs, then it should scale down its threadpools, etc, accordingly, Such auto-scaling would need to be adjusted periodically. Such auto-scaling isn't just useful for containers. It's useful for the JVMs running outside of containers as well.
20-12-2022

> BTW, I think if you want to limit CPU usage, CPU limits should be used, not CPU share. This advice is might be even worse in practice, as cgroup limits will actually lead to CPU throttling if the process consumes the allocated _time slices_. Before this ticket, in Kubernetes deployments, the _CPU requests_ is based on _shares_ (or _weight_), and the number of shares is used to schedule a deployment on node with enough capacity. I believe a good practice is to use the shares/weight or another mechanism to somewhat guide the JVM utilization but don't limit the CPU so the JVM process can _burst_ when needed, e.g. when collecting. The advantage of the cpu shares / weight is that there's a single configuration place (in Kubernetes : cpu requests) without requiring to also modify the JVM options. Of course the outstanding with shares remain and the practice needs to be updated with this ticket now being delivered.
20-12-2022

I created JDK-8299037 "Make UseContainerCpuShares true by default for JDK 8u 11u 17u" Let's discuss on hotspot-runtime-dev@openjdk.org
19-12-2022

[~mbaesken] Yes, I get what this could cause. active_processor_count, is influenced by a variety of factors in containers. There are 4 different CLI options for container engines to influence it: --cpus, --cpu-quota/cpu-period, --cpu-shares, --cpuset-cpus. The first and the second map to the same. Question is why cpu-shares should have an influence (other than for historical reasons how it used to behave)? The CSR (JDK-8281571) states, why it shouldn't have. Question is why any one of the other 3 methods isn't used to get the desired results in those container frameworks. [~iklam] That would be best, IMHO, yes. It would avoid the issue of unnecessary divergence among JDKs. It should be fine to keep for JDK 21 (would be worth mentioning in upgrade notes for such releases).
19-12-2022

Should we "fix" the update releases by making -XX:+UseContainerCpuShares enabled by default as suggested by [~bdutheil] above? BTW, I think if you want to limit CPU usage, CPU limits should be used, not CPU share.
19-12-2022

I got some feedback from one of our support colleagues. In some instance this change led to an active_processor_count of 64 instead of 1, and this leads to more GC threads (caused by ergonomics). And this even changed the GC algorithm from serialGC to G1GC in JDK11+. This might mean 200MB more memory usage (with a 3GB Container Memory), more usage of native memory too. The CF MemoryCalculator tool seems not to take this much into account, so we were running into some container OOMs. Especially bad it is for rather small containers but with > 2GB (below 2GB we stay with serialGC in case of more CPUs). Maybe the calculation of active_processor_count, in relation to CPU Shares, should be looked into.
19-12-2022

Yes, that was my thinking too if we were to backport it to OpenJDK 8u. That keeps the majority of the users undisturbed, while giving the ones who need it an option to have cpu shares ignored. If anyone has some data as to why the JDK should keep considering CPU shares as one way to limit cpu core observability when run on containers, I'd very much like to hear those use-cases. Thanks!
19-12-2022

I think the back-port should have left the default behavior unchanged too, i.e. the flag -XX:+UseContainerCpuShares is turned on by default (on the older JDKs)
19-12-2022

Backporting would change default behavior and break customers deployements. Both of these are in the guidelines [1]: "The "first, do no harm" principle applies: we must not break things." "All fixes that significantly improve stability, security or performance and do not change behavioural compatibility [1]will be considered for jdk8u." [1] https://wiki.openjdk.org/display/jdk8u/Guidelines+for+working+on+jdk8u
15-12-2022

I think we should - at least consider - backporting this to OpenJDK 8u. The reasons why cpu shares should be ignored haven't changed, IMO. If users actually want to limit CPU, use other means like cpu quota.
15-12-2022

I believe it was customers who manager their own containers in our case. We have recommended both -XX:ActiveProcessorCount=n and -XX:+UseContainerCpuShares depending on the situation. I commented here mostly as a persistent reminder for all of us as there was a comment on PR sent to the mailing list[1] about maybe back porting this to OpenJDK8. [1]https://mail.openjdk.org/pipermail/jdk8u-dev/2022-November/015875.html
14-12-2022

The backport to OpenJDK11 was done because we wanted to have OpenJDK close to OracleJDK (where the backports happened before). But I agree, the change caused a couple of issues on our side as well, and we had to recommend the UseContainerCpuShares flag to some customers.
14-12-2022

[~dlutker] I take it this is because on EC2[1], cpu limits map to cpu-shares only? You are aware of the -XX:+UseContainerCpuShares work-around, correct? [1] https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html
14-12-2022

We have had a few customer contacts as a result of this change getting back-ported to JDK11. This changes the default behavior when running in containers and customers have started getting OOM on the container or the JVM due to an increased number of threads getting created. There have also have reports of increased CPU utilization also from these additional threads getting created. I don't think this should have been backported and bringing it into JDK8 will likely cause significantly more problems for our customers.
14-12-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1196 Date: 2022-07-01 13:25:49 +0000
01-07-2022

jdk11 backport request I would like to have the patch in OpenJDK 11u-dev as well (to be similar to 11.0.17-oracle which has the patch already). The 17u/18u backport needs slight adjustment for 11 (copyright headers and some diff in strides). Backport CSR exists already for 11-pool (JDK-8286220) .
01-07-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/517 Date: 2022-06-29 08:12:12 +0000
29-06-2022

jdk17 backport request I would like to have the patch in OpenJDK 17u-dev as well (to be similar to 17.0.3.0.1-oracle which has the patch already). The 18u backport (https://github.com/openjdk/jdk18u/commit/a5411119c383225e9be27311c6cb7fe5d1700b68) applies cleanly. PR : https://github.com/openjdk/jdk17u-dev/pull/517 Backport CSR exists already for 17-pool (JDK-8282931).
29-06-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk18u/pull/79 Date: 2022-03-30 05:24:13 +0000
30-03-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk18u/pull/78 Date: 2022-03-29 22:42:45 +0000
29-03-2022

Changeset: e07fd395 Author: Ioi Lam <iklam@openjdk.org> Date: 2022-03-04 20:14:11 +0000 URL: https://git.openjdk.java.net/jdk/commit/e07fd395bdc314867886a621ec76cf74a5f76b89
04-03-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7666 Date: 2022-03-02 20:01:46 +0000
02-03-2022

I think we might be dealing with a little technical debt here. JDK-8216366 and specifically this reply from Bob is of most interest: http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036093.html Then also considering this from the Kubernetes docs: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#cgroup-v2 """ There should not be any noticeable difference in the user experience when switching to cgroup v2, unless users are accessing the cgroup file system directly, either on the node or from within the containers. """ First, there is no difference between cgroups v1 and cgroups v2 when the default value for cpu-shares value is being used. However, for cgroups v1 the cpu-shares value maps 1:1 to the cgroup interface file. For cgroup v2 that isn't the case so you need to appy the inverse of [1] to the cpu-shares CLI value (as you've noticed) in order to achieve that. -------------------------- cgroup v1: -------------------------- $ sudo podman run --rm -ti --cpu-shares=1024 -v $(pwd)/openjdk-17.0.2+8/:/opt/jdk:z fedora:35 /opt/jdk/bin/java -Xlog:os+container=trace --version [0.009s][trace][os,container] OSContainer::init: Initializing Container Support [0.010s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers [0.010s][trace][os,container] Path to /memory.use_hierarchy is /sys/fs/cgroup/memory/memory.use_hierarchy [0.011s][trace][os,container] Use Hierarchy is: 1 [0.011s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes [0.011s][trace][os,container] Memory Limit is: 9223372036854771712 [0.011s][trace][os,container] Non-Hierarchical Memory Limit is: Unlimited [0.011s][trace][os,container] Path to /memory.stat is /sys/fs/cgroup/memory/memory.stat [0.012s][trace][os,container] Hierarchical Memory Limit is: 9223372036854771712 [0.012s][trace][os,container] Hierarchical Memory Limit is: Unlimited [0.014s][trace][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us [0.014s][trace][os,container] CPU Quota is: -1 [0.014s][trace][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us [0.014s][trace][os,container] CPU Period is: 100000 [0.014s][trace][os,container] Path to /cpu.shares is /sys/fs/cgroup/cpu,cpuacct/cpu.shares [0.014s][trace][os,container] CPU Shares is: 1024 [0.014s][trace][os,container] OSContainer::active_processor_count: 8 [0.014s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 8 [0.014s][debug][os,container] container memory limit unlimited: -1, using host value [0.015s][debug][os,container] container memory limit unlimited: -1, using host value [0.034s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 8 [0.052s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/memory.limit_in_bytes [0.052s][trace][os,container] Memory Limit is: 9223372036854771712 [0.052s][trace][os,container] Non-Hierarchical Memory Limit is: Unlimited [0.052s][trace][os,container] Path to /memory.stat is /sys/fs/cgroup/memory/memory.stat [0.052s][trace][os,container] Hierarchical Memory Limit is: 9223372036854771712 [0.052s][trace][os,container] Hierarchical Memory Limit is: Unlimited [0.053s][debug][os,container] container memory limit unlimited: -1, using host value [0.133s][trace][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us [0.133s][trace][os,container] CPU Quota is: -1 [0.133s][trace][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_period_us [0.133s][trace][os,container] CPU Period is: 100000 [0.133s][trace][os,container] Path to /cpu.shares is /sys/fs/cgroup/cpu,cpuacct/cpu.shares [0.133s][trace][os,container] CPU Shares is: 1024 [0.133s][trace][os,container] OSContainer::active_processor_count: 8 openjdk 17.0.2 2022-01-18 OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8) OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode, sharing) [sgehwolf@t580-laptop builds]$ nproc 8 -------------------------- cgroup v2: -------------------------- $ sudo podman run --rm -ti --cpu-shares=2600 -v $(pwd)/jdk-17.0.1+12/:/opt/jdk:z fedora:35 /opt/jdk/bin/java -Xlog:os+container=trace --version [0.000s][trace][os,container] OSContainer::init: Initializing Container Support [0.001s][debug][os,container] Detected cgroups v2 unified hierarchy [0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max [0.001s][trace][os,container] Raw value for memory limit is: max [0.001s][trace][os,container] Memory Limit is: Unlimited [0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max [0.001s][trace][os,container] Raw value for CPU quota is: max [0.001s][trace][os,container] CPU Quota is: -1 [0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max [0.001s][trace][os,container] CPU Period is: 100000 [0.001s][trace][os,container] Path to /cpu.weight is /sys/fs/cgroup//cpu.weight [0.001s][trace][os,container] Raw value for CPU shares is: 100 [0.001s][debug][os,container] CPU Shares is: -1 [0.001s][trace][os,container] OSContainer::active_processor_count: 4 [0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4 [0.001s][debug][os,container] container memory limit unlimited: -1, using host value [0.001s][debug][os,container] container memory limit unlimited: -1, using host value [0.002s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4 [0.008s][debug][os,container] container memory limit unlimited: -1, using host value [0.014s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4 [0.023s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max [0.023s][trace][os,container] Raw value for memory limit is: max [0.023s][trace][os,container] Memory Limit is: Unlimited [0.023s][debug][os,container] container memory limit unlimited: -1, using host value openjdk 17.0.1 2021-10-19 OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12) OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing) [0.024s][debug][os,container] container memory limit unlimited: -1, using host value $ nproc 4 The actual intention is to get the same results for CPU limits (be it shares/limit or both) with the same CLI options passed to docker/podman. Example 2 CPUs: $ sudo podman run --rm -ti --cpu-shares=2048 -v $(pwd)/jdk-17.0.1+12/:/opt/jdk:z fedora:35 /opt/jdk/bin/java -Xlog:os+container=trace --version | grep -E '(active_processor_count|Detected)' [0.001s][debug][os,container] Detected cgroups v2 unified hierarchy [0.001s][trace][os,container] OSContainer::active_processor_count: 2 [0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 2 [0.002s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 2 [0.014s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 2 $ sudo podman run --rm -ti --cpu-shares=2048 -v $(pwd)/openjdk-17.0.2+8/:/opt/jdk:z fedora:35 /opt/jdk/bin/java -Xlog:os+container=trace --version | grep -E '(active_processor_count|Detected)' [0.001s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers [0.001s][trace][os,container] OSContainer::active_processor_count: 2 [0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 2 [0.002s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 2 [0.010s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 2 We assert that in our container tests. Yes, for 1024 (cg1) and for 2600 (cg2) in the shares only case it seems weird, but that's where overrides come in (like ActiveProcessorCount). In the Kubernetes case it's quite common to have cpu limit (quota) and cpu requests (shares) both specified. Shares alone usually is quite rare. [1] https://github.com/containers/crun/blob/main/crun.1.md#cpu-controller
03-02-2022

Relying on an arbitrary constant like PER_CPU_SHARES=1024 adds extra dependency into the JVM code. When docker runs on top of cgroupv1, --cpu-shares=1024 might have translated to cpu.shares=1024, which is the default value for cpu.shares in cgroups v1. However, cpu.shares is replaced by cpu.weight in cgroupv2. When docker runs on top of cgroupv2, you need to specify --cpu-shares=2600 in order to get the default value of 100 for cpu.weight in cgroupv2 Therefore, if a Java-based container had relied on the fact that "--cpu-shares=1024 means no CPU quota", after switching to cgroups v2, this container will be limit to 1 CPU. =============================================================================== $ CMD="echo 'System.out.println(\"getParallelism() = \" + ForkJoinPool.commonPool().getParallelism())' | jshell -J-Xlog:os+container=trace::none -" $ docker run --cpu-shares=1024 --rm -it docker.io/library/openjdk:17-jdk-slim bash -c "$CMD" | egrep -i '(getPar)|(Raw.*CPU.sha)' | sort -r | uniq Raw value for CPU shares is: 39 getParallelism() = 1 $ docker run --cpu-shares=2600 --rm -it docker.io/library/openjdk:17-jdk-slim bash -c "$CMD" | egrep -i '(getPar)|(Raw.*CPU.sha)' | sort -r | uniq Raw value for CPU shares is: 100 getParallelism() = 31
03-02-2022

The issue of how to interpret cpu shares has been raised and discussed a number of times. As stated in the review thread for 8146115: > Since the dynamic selection of CPUs based on cpusets, quotas and shares > may not satisfy every users needs, I’ve added an additional flag to allow the > number of CPUs to be overridden. This flag is named -XX:ActiveProcessorCount=xx. To quote myself from an internal email chain: "But I think the fundamental problem here - as I've pointed out since day one of container support - is that there is no one-size-fits-all general-purpose conversion from shares/quotas to "number of available processors". The number of threads you want to create relates more to the number of distinct processors you can be concurrently scheduled on, than to the proportion of time for which you may be scheduled." See also: JDK-8197589 The cpu shares value is expected to be relative to the 1024 available shares per-cpu, but the calculation doesn't do that: share_count = ceilf((float)share / (float)PER_CPU_SHARES); it is missing multiplication by the number of CPUs available! It should be: share_count = ceilf( ((float)share / (float)CPU_PER_SHARES) * cpu_count)
03-02-2022

How is this a different issue to JDK-8279484?
03-02-2022

The attachments cpu-shares-bug.sh and cpu-shares-bug.log.txt show an example of CPU underutilization: two docker containers are executed in parallel, each with --cpu-shares=100, on a host with 32 CPUs - When running a native program (/bin/stress), each container utilizes about 16 CPUs - When running a Java program using ForkJoinPool, each container can utilize only 2 single CPU. #---------------------------------------------------------------------- # (2) See how much CPU we can get from with the JavaStress program with two # containers, each with --cpu-shares=100 # baseline: run a single container without --cpu-shares Runtime.getRuntime().availableProcessors() = 32 ForkJoinPool.commonPool().getParallelism() = 31 Elapsed = 10001 ms, counter = 280710 real 0m10.257s user 5m7.419s sys 0m3.891s ---------------------------------------- + cat JavaStress.1.log Runtime.getRuntime().availableProcessors() = 1 ForkJoinPool.commonPool().getParallelism() = 1 Elapsed = 10000 ms, counter = 29208 real 0m10.154s user 0m22.193s sys 0m0.501s ---------------------------------------- + cat JavaStress.2.log Runtime.getRuntime().availableProcessors() = 1 ForkJoinPool.commonPool().getParallelism() = 1 Elapsed = 10000 ms, counter = 32792 real 0m10.150s user 0m22.438s sys 0m0.340s
03-02-2022