JDK-8286030 : Avoid JVM crash when containers share the same /tmp dir
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2022-05-02
  • Updated: 2025-04-08
  • Resolved: 2022-07-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 17 JDK 20 JDK 8
11.0.19-oracleFixed 17.0.7-oracleFixed 20 b07Fixed 8u381Fixed
Related Reports
Causes :  
Relates :  
Sub Tasks
JDK-8307467 :  
Description
There are some Kubernetes setups that share the same /tmp directory across multiple containers. Such a scenario is currently not supported by the JDK and crashes may happen.

(original report) ========================
We've been seeing intermittent SIGBUS failures on linux with jdk11.  They
all have this distinctive backtrace:

C  [libc.so.6+0x12944d]
V  [libjvm.so+0xcca542]  perfMemory_init()+0x72
V  [libjvm.so+0x8a3242]  vm_init_globals()+0x22
V  [libjvm.so+0xedc31d]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x1ed
V  [libjvm.so+0x9615b2]  JNI_CreateJavaVM+0x52
C  [libjli.so+0x49af]  JavaMain+0x8f
C  [libjli.so+0x9149]  ThreadJavaMain+0x9

Initially, we suspected that /tmp was full but that turned out to not be the case.  After a few more instances of the crash and investigation, we believe we know the root cause.

The crashing applications are all running in a K8 pod, with each JVM in a
separate container:

container_type: cgroupv1 (from the hs_err file)

/tmp is mounted such that it's shared by multiple containers.  Since these
JVMs are running in containers, we believe what happens is the namespaced (i.e. per container) PIDs overlap between different containers - 2 JVMs, in separate containers, can end up with the same namespaced PID.  Since /tmp is shared, they can now "contend" on the same perfMemory file since those file names are PID based.

Once multiple JVMs can contend on the same file, a SIGBUS can arise if one JVM has mmap'd the file and another ftruncate()'s it from under it (e.g.
https://github.com/openjdk/jdk11/blob/37115c8ea4aff13a8148ee2b8832b20888a5d880/src/hotspot/os/linux/perfMemory_linux.cpp#L909 ).

As for possible solutions, would it be possible to use the global PID instead of the namespaced PID to "regain" the uniqueness invariant of the PID? Also, might it make sense to flock() the file to prevent another process from mucking with it?

(Reported by Vitaly Davidovich -- 
https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-April/054921.html )

Manual reproducer:
https://github.com/openjdk/jdk/compare/master...iklam:jdk:8286030-test-case-for-jvm-crash-when-containers-share-tmp-dir?expand=1
Comments
[~ibereziuk] Please don't add jdk8u-fix-yes labels. That label is being used for approval of OpenJDK 8u backports.
04-05-2023

Fix request [11u] I backport this for parity with 11.0.19-oracle. Medium risk. Should only affect perf mem use cases. A clear fix, we should definitely take it. I had to resolve a bit because of different file layout. Test passes on linux. SAP nightly testing passed.
14-02-2023

Fix request [17u] I backport this for parity with 17.0.7-oracle. Medium risk. Should only affect perf mem use cases. A clear fix, we should definitely take it. I had to do a simple resolve. Test passes and fails without the patch. SAP nighlty testing passed.
10-02-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1716 Date: 2023-02-10 11:14:46 +0000
10-02-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/1150 Date: 2023-02-09 15:00:27 +0000
09-02-2023

I created a draft PR for the fix proposed two comments above: https://github.com/openjdk/jdk/pull/9226 . However, it seems pretty complex and risky (could have backward compatibility issues). Please see that PR for details. I've decided to use this bug for only avoiding the crash (which only happens very rarely by my findings in the previous comment). A comprehensive fix would need to be done in JDK-8289883.
03-11-2022

Changeset: 84f23149 Author: Ioi Lam <iklam@openjdk.org> Date: 2022-07-18 04:10:08 +0000 URL: https://git.openjdk.org/jdk/commit/84f23149e22561173feb0e34bca31a7345b43c89
18-07-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9406 Date: 2022-07-07 06:01:58 +0000
07-07-2022

I couldn't easily reproduce the error with an unmodified JDK. It seems like the crash happens only at very precise timing, when two JVM have created the hsperfdata file at almost the exact time with os::open(). I created a manual reproducer. Please see this link for details: https://github.com/openjdk/jdk/compare/master...iklam:jdk:8286030-test-case-for-jvm-crash-when-containers-share-tmp-dir?expand=1
29-06-2022

Updated proposed fix: https://mail.openjdk.org/pipermail/hotspot-runtime-dev/2022-May/055295.html
14-06-2022

I agree with [~dholmes] that changing the directory for writing the hsperf files is not the way to go. We already have well specified locations for storing the rendezvous files: /proc/*/root/tmp/hsperfdata_*/ The problem we have is that we cannot use getpid() in the filenames because with cgroup-based containers, getpid() is no longer globally unique. That's why I propose using a UUID instead of getpid() in the filenames. See http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2022-May/055050.html Let's continue the discussion on hotspot-runtime-dev. Have two parallel threads in both JBS and e-mail makes the discussion difficult.
05-05-2022

To repeat a comment from JDK-8189674: The JVM doesn't use the -Djava.io.tmpdir setting. We have tried in the past (JDK-6938627) and it isn't workable (JDK-7009828). The tools and the VM that is the target of the tools have to agree on well known file locations and that can't happen if the "tmp" location can be customized.
05-05-2022

The namespace applies to pods. Pods don't generally share /tmp; containers do (at least sometimes as discussed here). Containers are isolated processes within a single pod. I guess conceivably the multiple processes could be rearchitected as pods and things changed for that, but that's a severe restriction on the use of Java within K8S. It does seem like the PID solution is a non-starter. java.io.tmpdir seems a quite simple solution instead.
04-05-2022

There is also the option of sharing the namespace for multiple containers that need to communicate within a pod https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/#configure-a-pod I believe that would also avoid this issue? The trouble really is that with running multiple JVMs in separate containers, yet share /tmp the uniqueness of the PID goes out the window. AFAIK, there is no way to figure out the host's pid of a pid within a container. host => container pid works, not the other way round, though. container pid => host pid.
04-05-2022

We have the same issue. In a kubernetes environment, /tmp is often mounted using the K8S feature emptyDir, which mounts the same directory from the host into each container within a pod. (https://kubernetes.io/docs/concepts/storage/volumes/). This allows the root filesystem to be mounted read-only for security reasons. It would be great if the hotspot files (hsperf_xxx, .java, etc.) were all placed wherever java.io.tmpdir points to.
03-05-2022

[Per Vitaly Davidovich]: I can't comment on the JBS, but another workaround (which we're employing) is -XX:+PerfDisableSharedMem. Per my understanding, this will prevent certain tools from locating the JVM instance but still allows something like `jcmd` to connect (via an explicitly supplied pid) and read the perf counters.
03-05-2022

> /tmp is mounted such that it's shared by multiple containers. This seems to be the root cause of the issue. If /tmp is being shared by containers, each of which usually only run one process - and the user of the container process perhaps matches too - it becomes *very* likely to clash. I.e. /tmp/hsperdata_root/1 trying to be used by more than one container. Question is why that's being done? Usually in containers the root filesystem, including tmp is in a container-unique filesystem which avoids this whole problem.
03-05-2022

Work around: If the /tmp/hsperfdata_$USER/<pid> files are not needed (e.g., you don't need to use jcmd to access the containerized JVM processes), you can disable them with the -XX:-UsePerfData flag. This will avoid the crash.
03-05-2022