JDK-8227006 : [linux] Runtime.availableProcessors execution time increased by factor of 100
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 8u191,10,11,12,13,14
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2019-06-28
  • Updated: 2021-01-04
  • Resolved: 2019-10-31
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 13 JDK 14 JDK 8 Other
11.0.10-oracleFixed 13.0.6Fixed 14 b22Fixed 8u281Fixed openjdk8u282Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
This does seem to be somewhat hardware dependent - some VMs don't exhibit the problem, and we haven't seen this on Windows yet. We've seen this on Linux 64-bit Openstack VMs, in at least 3 different setups.

We have observed a significant performance regression in the Runtime.availableProcessors native API call.

On JDK versions prior to 1.8b191, a simple program can achieve upwards of 4 million calls to Runtime.availableProcessors a second on our test hardware.
On JDK build 1.8b191 and all later major and minor versions, including 11, the maximum it can achieve is around 40,000 calls a second - a performance drop of 100x.

This causes performance issues for CompletableFuture.waitingGet, which calls Runtime.availableProcessors in a tight loop as part of its spinwait - which is how we originally found the issue, as our application was exhibiting significant performance problems in our asynchronous code.

REGRESSION : Last worked in version 8u192

Run the attached program on JDK version 1.8b182, and 1.8b191, on a 64bit Linux machine.

The performance should be the same
The performance drops by a factor of 100.

---------- BEGIN SOURCE ----------
    public static void main(String[] args) throws Exception {
        AtomicBoolean stop = new AtomicBoolean();
        AtomicInteger count = new AtomicInteger();

        new Thread(() -> {
            while (!stop.get()) {

        try {
            int lastCount = 0;
            while (true) {
                int thisCount = count.get();
                System.out.printf("%s calls/sec%n", thisCount - lastCount);
                lastCount = thisCount;
        finally {
---------- END SOURCE ----------

None possible.



Fix Request (OpenJDK 8u): Posting this comment on behalf of Jon Dowland. Please approve backporting this OpenJDK 8u. We see 8u281. The JDK 11 patch doesn't apply cleanly since JDK-8232207 isn't applicable to JDK 8u and debug logging code is different. The OpenJDK 8u patch has been posted for review and has been reviewed by Severin Gehwolf. Testing: Reproducer from this bug. Risk should be low. RFR: https://mail.openjdk.java.net/pipermail/jdk8u-dev/2020-November/012969.html webrev: https://cr.openjdk.java.net/~shade/8227006/webrev.8u.01/

Fix Request (13u) Requesting backport to 13u for parity with 11u. The patch applies cleanly. Tested with tier1, containers tests and reproducer from this bug.

Fix request (11u) -- will label after testing completed. I would like to downport this for parity with 11.0.10-oracle. Applies clean.

URL: https://hg.openjdk.java.net/jdk/jdk/rev/c440a6b4e096 User: bobv Date: 2019-10-31 20:36:20 +0000

The reason for this is that os::Linux::active_processor_count() is expensive too. After this patch it's only called if the cached value has expired. The best-case run for Runtime.getRuntime().availableProcessors() is by setting the processor count statically with -XX:ActiveProcessorCount=X. A run with that gives these results for unpatched (UseContainerSupport flag doesn't matter in this case): $ java -XX:ActiveProcessorCount=8 Test 28421365 calls/sec 28726169 calls/sec 28415790 calls/sec 28385206 calls/sec 28362712 calls/sec 28528136 calls/sec 28730125 calls/sec 28747870 calls/sec 28751748 calls/sec 28433800 calls/sec 28737019 calls/sec 28723038 calls/sec

Adding a single cache of active_processor_count with a 20ms timeout results in the following improvements. ORIGINAL SCORES 39346 calls/sec 40007 calls/sec 39708 calls/sec 39732 calls/sec 39736 calls/sec 39629 calls/sec DISABLE CONTAINER SUPPORT (-XX:-UseContainerSupport) 1106795 calls/sec 1135589 calls/sec 1123782 calls/sec 1123605 calls/sec 1123343 calls/sec 1123135 calls/sec 20MS TIMEOUT CACHE 9867172 calls/sec 10077977 calls/sec 9989674 calls/sec 9983963 calls/sec 9984400 calls/sec 9985898 calls/sec 9985123 calls/sec 9985608 calls/sec So the cache actually improves performance beyond the pre-container support by 8x.

-XX:-UseContainerSupport: 2046950 calls/sec 2077786 calls/sec 2061594 calls/sec 1965376 calls/sec 2046164 calls/sec 2046057 calls/sec Patch from the previous comment and -XX:+UseContainerSupport: 1799371 calls/sec 1813010 calls/sec 1784729 calls/sec 1784715 calls/sec 1792153 calls/sec 1812422 calls/sec 1811513 calls/sec

Based on the idea of JDK-8232207. If we cache the container limits for some time (e.g. 100ms in this case), and Runtime.getRuntime().availableProcessors() is being called in a tight loop we see some improvement. 2 orders of magnitude slower becomes ~10% slower. Yet updates would be visible reasonably fast, and it works inside/outside containers. Based on the webrev in: http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-October/039776.html with this PoC: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8227006/02/webrev/

JDK-8232207 seems related.

If we could determine that docker update is not used in practice, I would just implement caching and skip the .dockerenv checking altogether. There's just no practical way to determine if this feature is used. The docker update docs state that this feature is not available for Windows containers but I just checked and this feature works on the newer kata containers. Do we really want to add another VM flag (-XX:-CacheResourceLimits)?? We could make a case that getting updated values for resource limits requires a VM restart since we don't resize the Java Heap or adjust the thread pools we've created. The only problem with this is the fact that Runtime.availableProcessors is supposed to report updated values. "This value may change during a particular invocation of the virtual machine. Applications that are sensitive to the number of available processors should therefore occasionally poll this property and adjust their resource usage appropriately." A compromise might be to cache memory and CPU but if the number of available physical processors changes after we've cached the computed value, we'd re-read cpu-shares and cpu-quota as well to come up with a new value to cache. This would provide the pre-container functionality, plus the container support without the performance impact.

Checking for .dockerenv and /run/.containerenv seems reasonable. That way, host systems don't pay the extra penalty with default -XX:+UseContainerSupport. It'd be a good first step. Solving the perf problem for containers itself could be a next step perhaps. Speaking about that, I'm questioning how much the "docker update" command is actually used in practise. Currently, that's the only argument against caching limit values at init time. The usual container/cloud mantra seems to be to kill the running container if it's misbehaving and fire up another instance of it. FWIW, podman doesn't have an "update" command AFAICS. Perhaps it would be nice if it was at least configurable. That way, some (most?) container use cases wouldn't pay the extra bill for the cases where they don't use "docker update" and use restarts/redeployments instead.

I don't think checking for unlimited memory and full processor access is the best solution. These limits can change at any time which means we'd have to do even more checking for each call to get the availableProcessor count. I tried caching the full path to the /proc files we need in order to satisfy the call to get the availableProcessors and this only saved about 8%. I think it might be best to check for the /.dockerenv file in order to determine if we are containerized. This file seems to be present even in the most recent version of docker and show up with container of all container images (ubuntu, oraclelinux, fedora, debian). If we also want to support podman, we also need to look for /run/.containerenv in addition to /.dockerenv. The downside of this approach is that it doesn't solve the performance problem when running in containers. We'll have to check if the reporter of this bug was runnning on a Linux host and not in containers.

Another issue of OSContainer::is_containerized() returning true on the host system is that hs_err log files will get cgroup info printed. I believe the intention of that was to only show up when actually run in a container. Perhaps is_containerized() should be false iff: OSContainer::memory_limit_in_bytes() == unlimited && OSContainer::active_processor_count() == os::Linux::active_processor_count()

No, I haven't see shared mount points but I don't think it's wise to make assumptions about use cases that are possible which we haven't seen. All of the propagation types I've seen so far have "master:X" master:X This mount is a slave to shared peer group X.

Right, thanks. Have you seen evidence of 2.? The main idea of that patch would be to get OSContainer::is_containerized() to return false on the host system and true inside a container.

Although it's a cleaver idea, it has a few drawbacks. 1. It doesn't solve the performance problem when Java is run in containers. 2. I'd be concerned that the shared mount points might be used when there are multiple containers in a single Kubernetes pod sharing resources. This might happen when side-cars are used. I'd be more inclined to try to reduce the extra overhead by caching the full path to the /proc file we are after. We currently cache the subsystem path but construct the path to the individual files each time.

Here is a wild idea. It looks like host systems have cgroups mounted as shared, while containerized systems don't seem to be. "shared:XX" in /proc/self/mountinfo. I haven't seen this being the case when actually running in a container. So here is a PoC patch: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8227006/01/webrev/ It fixes the attached test program and passes container tests (podman and docker) here, except for TestDockerCpuMetrics.java which seems to assume Metrics.systemMetrics() returns non-null on host systems - no longer the case. Host system: $ ./build/linux-x86_64-server-release/images/jdk/bin/java -Xlog:os+container=trace -version [0.000s][trace][os,container] OSContainer::init: Initializing Container Support [0.000s][debug][os,container] Found optional filesystem tag 'shared:5' which indicates non-containerized system. openjdk version "14-internal" 2020-03-17 OpenJDK Runtime Environment (build 14-internal+0-adhoc.sgehwolf.openjdk-head) OpenJDK 64-Bit Server VM (build 14-internal+0-adhoc.sgehwolf.openjdk-head, mixed mode, sharing) $ ./build/linux-x86_64-server-release/images/jdk/bin/java -XshowSettings:system -version Operating System Metrics: No metrics available for this platform openjdk version "14-internal" 2020-03-17 OpenJDK Runtime Environment (build 14-internal+0-adhoc.sgehwolf.openjdk-head) OpenJDK 64-Bit Server VM (build 14-internal+0-adhoc.sgehwolf.openjdk-head, mixed mode, sharing) Container: $ sudo podman run -ti --rm --memory 500M fedora-30-jdk:latest /bin/bash [root@8db76d216e8d /]# ./jdk/bin/java -Xlog:os+container=debug -version [0.001s][info][os,container] Memory Limit is: 524288000 openjdk version "14-internal" 2020-03-17 OpenJDK Runtime Environment (build 14-internal+0-adhoc.sgehwolf.openjdk-head) OpenJDK 64-Bit Server VM (build 14-internal+0-adhoc.sgehwolf.openjdk-head, mixed mode, sharing) [root@8db76d216e8d /]# ./jdk/bin/java -XshowSettings:system -version Operating System Metrics: Provider: cgroupv1 Effective CPU Count: 8 CPU Period: 100000us CPU Quota: -1 CPU Shares: -1 List of Processors, 8 total: 0 1 2 3 4 5 6 7 List of Effective Processors, 8 total: 0 1 2 3 4 5 6 7 List of Memory Nodes, 1 total: 0 List of Available Memory Nodes, 1 total: 0 CPUSet Memory Pressure Enabled: false Memory Limit: 500.00M Memory Soft Limit: Unlimited Memory & Swap Limit: 1000.00M Kernel Memory Limit: Unlimited TCP Memory Limit: Unlimited Out Of Memory Killer Enabled: true openjdk version "14-internal" 2020-03-17 OpenJDK Runtime Environment (build 14-internal+0-adhoc.sgehwolf.openjdk-head) OpenJDK 64-Bit Server VM (build 14-internal+0-adhoc.sgehwolf.openjdk-head, mixed mode, sharing) Thoughts?

I don't think polling the file system and checking the modification time or inotify would work for the /sys (or /proc) file systems.

See also this related discussion (showing up via jshell which seems to call os::available_memory() frequently via the JIT): http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2019-September/035829.html So anything calling any of the three hotspot runtime functions are affected on Linux: os::active_processor_count os::physical_memory os::available_memory I wonder if the dynamic update nature of these things in a container context are used much. Perhaps more fine-grained control would be good to have. For example cache values on the premise that restarts would be needed for dynamic updates to take effect (a common use-case in cloud environments). It could be controllable by a flag, DynamicCgroupsUpdates, for example. Default to false. Another alternative could be to try hooking this up via inotify. It's a Linux-only API, but so is the container detection code.

ILW = MML = P4

The only other idea I have is to check date/time on the cgroup files to see if they have been recently updated. We can use this information to cache the values and avoid openning/reading/parsing of the data. I'm not sure how much time this would save.

Unfortunately my solution won't work. Docker provides an update command that can alter the resource limits of a running container. https://docs.docker.com/engine/reference/commandline/update/

Can we infer that if all the cgroup values we read are "default" or "unlimited" then we are not running in a container? Can those things be changed via cgroup APIs when the process is running?

So far I haven't found a reliable way of detecting that our process is running in a container. - Checking for the existence of /.dockerenv or "docker" in /proc/1/cgroup is very docker specific. - Checking to see if cat /proc/1/sched == 1 works on Ubuntu but doesn't work on Oracle Linux.

I'm more concerned that is_containerized() returns true when you are not running in a container!

I don't think we can assume things are constant.

Here's a possible change that eliminates the additional overhead. This change does assume that the cpu resource limits are constant. I've measured it with the supplied test case and I no longer see any regression. --- a/src/hotspot/os/linux/osContainer_linux.cpp Wed Jun 26 13:18:38 2019 -0400 +++ b/src/hotspot/os/linux/osContainer_linux.cpp Mon Jul 01 16:06:23 2019 +0000 @@ -581,22 +581,36 @@ * return: * number of CPUs */ +static bool cpu_limits_initialized = false; +static int quota_cache; +static int period_cache; +static int share_cache; +static int last_processor_count; +static int last_active_cpu_count; + int OSContainer::active_processor_count() { int quota_count = 0, share_count = 0; - int cpu_count, limit_count; + int limit_count; int result; - cpu_count = limit_count = os::Linux::active_processor_count(); - int quota = cpu_quota(); - int period = cpu_period(); - int share = cpu_shares(); + if (cpu_limits_initialized) { + int cpu_count = os::Linux::active_processor_count(); + if (cpu_count == last_active_cpu_count) { + return last_processor_count; + } + } - if (quota > -1 && period > 0) { - quota_count = ceilf((float)quota / (float)period); + last_active_cpu_count = limit_count = os::Linux::active_processor_count(); + int quota_cache = cpu_quota(); + int period_cache = cpu_period(); + int share_cache = cpu_shares(); + + if (quota_cache > -1 && period_cache > 0) { + quota_count = ceilf((float)quota_cache / (float)period_cache); log_trace(os, container)("CPU Quota count based on quota/period: %d", quota_count); } - if (share > -1) { - share_count = ceilf((float)share / (float)PER_CPU_SHARES); + if (share_cache > -1) { + share_count = ceilf((float)share_cache / (float)PER_CPU_SHARES); log_trace(os, container)("CPU Share count based on shares: %d", share_count); } @@ -616,8 +630,11 @@ limit_count = share_count; } - result = MIN2(cpu_count, limit_count); + result = MIN2(last_active_cpu_count, limit_count); log_trace(os, container)("OSContainer::active_processor_count: %d", result); + + last_processor_count = result; + cpu_limits_initialized = true; return result; }

Nit: -XX:-UseContainerSupport is too aggressive for a workaround, as it disables a lot. -XX:ActiveProcessorCount=# is a much more lightweight workaround.

Hi Bob, this is P3 due to regression in 8u191/10 GA. Feel free to change, as it is specific to linux and "-XX:-UseContainerSupport" as available workaroud.

A few comments. - Linux systems that have cgroups enabled typically run every process under control of cgroups. Thus they appear as if they are "containerized". I don't want to be docker specific and look for docker environment variables. Our support is cgroup specific. - Since there is a work-around of using -XX:-UseContainerSupport, should this bug be a P3? - I agree with David that we cannot cache the value of available processors since the spec states that the value can change. - The calulation required to determine the available processor count requires reading several files including cpu_shares, cpu_quota and cpu_period. I'd have to do some reasearch but we may be able to cache these values which would speed up this operation. We could also cache the last value of os::Linux::active_processor_count and the last value returned from OSContainer::active_processor_count. If os::Linux::active_processor_count doesn't change simply return the last returned value. This should result in a very small incremental performance difference from the non Container speed.

I just filed JDK-8227019 at the same time. :)

I submitted JDK-8227018 for dealing with CompletableFuture in 8u.

My system apparently runs with cgroups available, so container detection code believes it is running in container. That, I believe, is a bug. It is very unfortunate it goes into the OSContainer::active_processor_count(), which costs a lot. I guess fixing that would not help the original issue, as reporter seems to run with containers. I looked briefly at improving OSContainer::active_processor_count(), but it seems to be bottlenecked at cgroup reads from the (pseudo)filesystem, so no help there. Also looked at JDK-8134851, which removes Runtime.availableProcessors from the path we care about, but that is not easily backportable. It seems to me caching the active_processors_count is lesser evil here.

Note that CompletableFuture performance issue mentioned in the report is fixed with JDK-8134851: there is no Runtime.availableProcessors() on that path since 9.

I've reset the bug so that it will be seen by our triage team.

The usage in CompletableFuture only checks for processors > 1 and so can be easily factored out into a static field. The only problematic case is a dynamic switch from >1 processors to exactly 1 and that is an unlikely scenario. If it did occur we will get a small slowdown. I think a separate bug can be filed for that.

Okay, I am out of options at the moment then. Maybe Bob has more ideas here.

Caching the value can lead to incorrect behaviour which is not acceptable.

activeProcessorCount() is not expected to be a "hot" method, but the additional cost from the container support code is unfortunate. However, I don't agree with the proposed timed-based caching mechanism. If native code is modifying the available processors then the result needs to be reflected by availableProcessors() immediately. > I also think the code mistakenly believes we are running in containerized mode (OSContainer::is_containerized() returns true), Yes I think this is a bug. is_containerized() returns true if all of the cgroup filesystem entry points exist, but that in itself does not mean you are executing in a container.

Another workaround: override -XX:ActiveProcessorCount=# manually. This would not enter the slow path.

This patch caches the active processors count regardless of container support: http://cr.openjdk.java.net/~shade/8227006/webrev.01/ It drops Runtime.availableProcessors() timing down to 0.05us (300x better than -XX:+UseContainerSupport, 10x better than -XX:-UseContainerSupport).

I also think the code mistakenly believes we are running in containerized mode (OSContainer::is_containerized() returns true), and calls to OSContainer::active_processor_count(), which is expensive. activeProcesssors() cost ~15us on default jdk/jdk for me, and ~0.5us with -XX:-UseContainerSupport.

This is a regression started from 8u191 b02 /10 b34 due to JDK-8146115 (Improve docker container detection and resource configuration usage) 8u191 b01 - Pass 8u191 b02 - Fail <== regression started from here. 11 GA - Fail 12/13/14 - Fail I did some debugging to find the root cause of the performance degrade. Call to "OSContainer::is_containerized()" is bit expensive which is intern taking time. As i am running in non-containerized environment on linux, with below patch, i get the required performance back diff -r 5ce081828739 src/hotspot/os/linux/os_linux.cpp --- a/src/hotspot/os/linux/os_linux.cpp Sat Jun 29 05:26:55 2019 +0200 +++ b/src/hotspot/os/linux/os_linux.cpp Sat Jun 29 20:55:02 2019 -0700 @@ -5339,11 +5339,11 @@ } int active_cpus; - if (OSContainer::is_containerized()) { +/* if (OSContainer::is_containerized()) { active_cpus = OSContainer::active_processor_count(); log_trace(os)("active_processor_count: determined by OSContainer: %d", active_cpus); - } else { + } else*/ { active_cpus = os::Linux::active_processor_count(); } Hi [~bobv] please review.