JDK-8193521 : glibc wastes memory with default configuration
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 9,10
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • OS: linux
  • Submitted: 2017-12-14
  • Updated: 2024-02-06
  • Resolved: 2020-01-21
Related Reports
Relates :  
Relates :  
Description
By default, glibc allocates a new 128 MB malloc arena for every thread (up to a certain limit, by default 8 * processor count).
This is good for few threads which perform a lot of concurrent mallocs, but it doesn't fit well to the JVM which has its own memory management and rather allocates fewer and larger chunks.
(See glibc source code libc_malloc which calls arena_get2 in malloc.c and _int_new_arena in arena.c.)
Using only one arena significantly reduces virtual memory footprint. Saving memory seems to be more valuable for the JVM itself than optimizing concurrent mallocs.

Note: The first malloc in each thread triggers a 128MB mmap which typically is the initialization of thread-local storage.
#0  __mmap (addr=addr@entry=0x0, len=len@entry=134217728, prot=prot@entry=0, flags=flags@entry=16418, fd=fd@entry=-1, offset=offset@entry=0) at ../sysdeps/unix/sysv/linux/wordsize-64/mmap.c:33
#1  0x00007ffff72403d1 in new_heap (size=135168, size@entry=2264, top_pad=<optimized out>) at arena.c:438
#2  0x00007ffff7240c21 in _int_new_arena (size=24) at arena.c:646
#3  arena_get2 (size=size@entry=24, avoid_arena=avoid_arena@entry=0x0) at arena.c:879
#4  0x00007ffff724724a in arena_get2 (avoid_arena=0x0, size=24) at malloc.c:2911
#5  __GI___libc_malloc (bytes=24) at malloc.c:2911
#6  0x00007ffff7de9ff8 in allocate_and_init (map=<optimized out>) at dl-tls.c:603
#7  tls_get_addr_tail (ti=0x7ffff713e100, dtv=0x7ffff0038890, the_map=0x6031a0) at dl-tls.c:791
#8  0x00007ffff6b596ac in Thread::initialize_thread_current() () from openjdk10/lib/server/libjvm.so

There are basically 2 issues:
- virtual memory: Gets so much larger than needed. This is an issue for users with reduced ulimit, cloud applications in containers, embedded.
- physical memory: We're not wasting so much, but if the JVM handles all performance critical allocations by its own management, it should be worth saving.

Comments
Runtime Triage: This is not on our current list of priorities. We will consider this feature if we receive additional customer requirements.
21-01-2020

"Normally active processor count (at least on Linux) was correct even before 8182070 since cgroups set affinity mask. " That's only true if CPU limits are imposed via --cpuset-cpus in docker, though. The common case in today's cloud world is --cpu-shares and --cpu-quota. $ sudo docker run -d fedora-28-jdks-hellowait:v1 java HelloWait b6904fb5ff3fc744077e08946da8b652b8a782519135964afc80ac223442932c $ sudo docker exec -ti b6904fb5ff3fc744077e08946da8b652b8a782519135964afc80ac223442932c /bin/bash [root@b6904fb5ff3f /]# jps 1 HelloWait 38 Jps [root@b6904fb5ff3f /]# taskset -p 1 pid 1's current affinity mask: ff $ sudo docker run --cpuset-cpus=1,0 -d fedora-28-jdks-hellowait:v1 java HelloWait ba96aa6fa1d2b016ad1c0306bbe6cc28cdde47b8d279762bcf722363edb54d0a $ sudo docker exec -ti ba96aa6fa1d2b016ad1c0306bbe6cc28cdde47b8d279762bcf722363edb54d0a /bin/bash [root@ba96aa6fa1d2 /]# jps 1 HelloWait 34 Jps [root@ba96aa6fa1d2 /]# taskset -p 1 pid 1's current affinity mask: 3 [root@ba96aa6fa1d2 /]# exit $ sudo docker run --cpu-quota=200000 -d fedora-28-jdks-hellowait:v1 java HelloWait 5f949f0f4e69426e4c6f23225b13be2cfcdbcb9abdf7af3ccdafd841478a6008 $ sudo docker exec -ti 5f949f0f4e69426e4c6f23225b13be2cfcdbcb9abdf7af3ccdafd841478a6008 /bin/bash [root@5f949f0f4e69 /]# jps 1 HelloWait 31 Jps [root@5f949f0f4e69 /]# taskset -p 1 pid 1's current affinity mask: ff [root@5f949f0f4e69 /]# java RuntimeProc >>> Available processors: 2 <<<< [root@5f949f0f4e69 /]# cat RuntimeProc.java public class RuntimeProc { public static void main(String[] args) { int availProc = Runtime.getRuntime().availableProcessors(); System.out.println(">>> Available processors: " + availProc + " <<<<"); } } [root@5f949f0f4e69 /]#
16-01-2019

Normally active processor count (at least on Linux) was correct even before 8182070 since cgroups set affinity mask. The user might also have tuned MALLOC_ARENA_MAX via environment variables, so we should not blindly set it. So only if we are using glibc and using glibc malloc and not a custom launcher and user have not set env vars I think it's fine to set it. Which makes the launcher the best place IMHO.
16-01-2019

It is virtual address space. The reason why I had created this issue is that a user had complained about insane virtual memory occupation for a tiny application which starts quite some threads but shouldn't consume much memory. In some environment (container), processes with too high virtual memory usage may get killed. I think Severin's proposal would be helpful for this case. I guess the situation has improved a little bit since we're starting fewer compiler and gc threads, now.
16-01-2019

The 128 MiB are used to obtain a 64 MiB-aligned address space window. In that window, only a few pages are accessed in the beginning. We have seen reports (which we could not reproduce) that this is sufficient to trigger hugepages in an x86-64 Linux kernel, in which case more than a few 4 KiB pages are used, of course. If you workloads that show pathological behavior in glibc malloc, please contact me or DJ Delorie <dj@redhat.com>. We are collecting such workloads and try to improve malloc behavior where we can.
16-01-2019

The 128MB should only be virtual address space? Or is someone touching does pages? Also note mallopt is glibc specific using another allocator and/or another libc it don't work/don't exists. Another issue is custom launcher, we could be screwing the hosting application by changing it. If there is a reason to set I think it should be done in our launcher.
16-01-2019

I don't expected DirectByteBuffer to be much of an issue because if you allocate many of those, you will run into issues anyway because their freeing is delayed. With few arenas, a lot of typical JNI usage scenarios could be impacted negatively because they malloc/free on the same thread. If the thread happens to share an arena with another thread which does the same thing, this will result in a lot of contention in malloc (and free).
18-04-2018

If the JVM manages the number of glibc arenas to use on linux, I wonder what effect this has for apps which use DirectByteBuffer and, hence, native malloc a lot.
18-04-2018

Note that setting an environment variable will not have any effect on systems where the OpenJDK executables carry SELinux labels (so that they run in AT_SECURE=1 mode).
19-12-2017

FWIW, there is a 4th option. Set the appropriate max arena limit via mallopt based on the configured CPU slices a container gets: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2017-December/025728.html If left as-is and running JVM containers with, say 50 threads, on a physical host having 32 cores but each container only getting 2 via container limits then there is a risk of getting bad max arena configs. E.g. 50 arenas in each container, while it should actually be getting only 2*2 on 32 bit and 2*8 on 64 bit.
18-12-2017

Proposals to address this issue: 1. Change JVM to use mallopt(M_ARENA_MAX, 1) before starting additional threads. + Save memory without the need to configure anything. + Good for use cases where not too many performance critical concurrent mallocs occur. (I assume this to be typical because I haven't seen performance regressions.) - Possible performance regression for some native libraries which perform many cuncurrent mallocs. - Some people may want to configure glibc allocations differently. 2. Use environment variable MALLOC_ARENA_MAX=1. + Configurable solution. + No JVM change needed. - Knowledge about glibc malloc configuration required. - Knowledge about JVM memory management required and extra work to set the variable somehow. 3. Only use mallopt(M_ARENA_MAX, 1) in certain configuration. Could be used with small -Xms or other flag for lower memory usage. + Save memory without extra configuration. - Less transparent.
18-12-2017