Summary
-------
Enhance G1 to improve allocation performance on NUMA memory systems.
Non-Goals
---------
Extend NUMA-awareness to work on any OS other than Linux and Solaris,
which provide appropriate NUMA interfaces.
Motivation
----------
Modern multi-socket machines are increasingly NUMA, with not all memory
equidistant from each socket or core. The more traditional SMPs using
conventional dance-hall architectures are increasingly rare, except
perhaps at the very high end, perhaps because of the cost and difficulty
of scaling up such architectures and the resulting latency and bandwidth
limitations of their interconnects. Most modern OSes, starting with
Solaris about a decade ago, now offer interfaces through which the memory
topology of the platform can be queried and physical memory
preferentially mapped from a specific locality group. HotSpot's
ParallelScavengeHeap has been NUMA-aware for many years now, and this has
helped scale the performance of configurations that run a single JVM over
multiple sockets, presenting a NUMA platform to the JVM. Certain other of
the Hotspot collectors, most notably the concurrent ones, have not had
the benefit of this feature and have not been able to take advantage of
such vertical multi-socket NUMA scaling. Especially as large enterprise
applications run in large heap configurations and need the power of
multiple sockets, yet want the manageability advantage of running within
a single JVM, we'll see customers using our concurrent collectors
increasingly run up against this scaling bottleneck.
This JEP aims to extend NUMA-awareness to the heap managed by the G1
garbage collector.
Description
-----------
G1's heap is organized as a collection of fixed-size regions from what
currently happens to be a convex interval of the virtual address
space. Generations, or individual logical spaces (such as Eden, Survivor,
and Old), are then formed as dynamic disjoint subsets of this collection
of regions. A region is typically a set of physical pages, although when
using very large pages (say 256M superpages on SPARC), several regions
may make up a single physical page.
To make G1's allocation NUMA-aware we shall initially focus on the
so-called Eden regions. Survivor regions may be considered in a second
enhancement phase, but are not within the scope of this JEP. At a very
high level, we want to fix the Eden regions to come from a set of
physical pages that are allocated at specific locality groups
(henceforth, "lgrps"). The idea is analogous to the NUMA spaces used by
ParallelScavengeHeap. Let's call these "per-lgrp region pools", for lack
of a better phrase.
We envisage the lifetime of an Eden region to be roughly as follows:
- Each region starts off as an untouched region with no allocated
physical pages.
- Eden regions have backing pages allocated in specific locality
groups.
- Initially a region is untouched and is not associated with any
specific locality group.
- Each thread, when it starts out, queries and records its home lgrp,
(henceforth the "thread's lgrp", for short).
- When a TLAB request is made by a thread whose lgrp is L, we look in
the the per-lgrp region pool for L. If there is a current allocation
region in L, it is used to satisfy the TLAB allocation request. If
the current allocation region is NULL, or the free space in it is too
small to satisfy the TLAB request, then a new region is allocated out
of the region pool for L, and becomes the current allocation region
which will supply that and subsequent TLAB requests. This region has
been previously touched and already has pages allocated to it from
the lgrp L. If the region pool for L is empty, we check the global
pool to see if a free Eden region is available, and this region is
then assigned to pool L. At this point the region is untouched and
has no pages allocated to it (or was most recently madvised to
free). An appropriate lgrp API (either prescriptive or descriptive)
is used to ensure that physical pages for this region are allocated
in the local lgrp L.
- If there are no available regions in the global (untouched) Eden
pool, and Eden cannot be grown (for policy or other reasons), a
scavenge will be done. An alternative is to steal already biased but
unallocated regions from another lgrp, and migrate it to this lgrp,
but the suggested policy above follows the policy implemented in PS,
where such migration-on-demand was found to be less efficient than
adaptive migration following a scavenge (see below).
- At each scavenge, the occupancy of the per-lgrp pools is assessed and
an appropriately weighted medium-term or moving-window average is
used to determine if there are unused or partially-used regions that
must be madvised to free so as to adaptively resize the per-lgrp
pools.
- Humongous regions are naturally eliminated from this allocation
policy since such regions are not considered part of Eden anyway, so
nothing special will need to be done for such regions. (A reasonable
policy for such regions may be to interleave or randomly allocate
pages uniformly across all lgrps to optimize the worst-case
performance assuming uniform random access from each lgrp.)
ParallelScavengeHeap allocates pages from a survivor space in round-robin
fashion. As mentioned above, NUMA-biasing of survivor regions is not a
goal of this JEP.
When using large pages, where multiple regions map to the same physical
page, things get a bit complicated. For now, we will finesse this by
disabling NUMA optimizations as soon as the page size exceeds some small
multiple of region size (say 4), and deal with the more general case in a
separate later phase. When the page size is below this threshold, we
shall allocate and bias contiguous sets of regions into the per-lgrp Eden
pools. This author is not sufficiently familiar with current region
allocation policy, but believes that this will likely require some small
changes to existing region allocation policy in G1 to allow allocating a
set of regions at a time.
The `-XX:+UseNUMA command` line switch should enable the feature for G1
if `-XX:+UseG1GC` is also used. If the option is found to perform well
for a large class of programs, we may enable it by default on NUMA
platforms (as I think is the case for ParallelScavenge today). Other
options related to NUMA adaptation and features should be supported in
the same manner as for ParallelScavenge heap. We should avoid any
collector-specific options for NUMA to the extent possible.
Testing
-------
Normal testing (with `-XX:+UseNUMA` as appropriate) should flush out any
correctness issues. This JEP assumes the use of NUMA hardware for
testing. Targeted performance testing will be done, using a variety of
benchmarks and applications on a variety of NUMA and non-NUMA platforms.
Risks and Assumptions
---------------------
As in the case of the ParallelScavenge collector, an assumption of the
implementation here is that most short-lived objects are such that they
are accessed most often by the thread that allocated them. This is
certainly true of the majority of short-lived objects in most
object-oriented programs, as experience with ParallelScavenge has already
shown us. There is, however, some small class of programs where this
assumption does not quite hold. The benefits also depend on the
interplay of the extent of NUMA-ness of the underlying system and the
overheads associated with migrating pages on such systems, especially in
the face of frequent thread migrations when load is high. Finally, there
may be platforms platforms for which the appropriate lgrp interfaces are
either not publicly accessible or available, or have not been implemented
for other reasons.
There is some risk that the assignment of regions to specific lgrp pools
will reduce some flexibility in terms of moving regions between various
logical spaces, but we do not consider this a serious impediment.
Somewhat more seriously, the assignment of regions to lgrp pools will
cause some internal fragmentation within these pools, which is not
dissimilar to the case of ParallelScavengeHeap. This is a known issue
and, to the extent that the unit of lgrp-allocation in
ParallelScavengeHeap is a page and that of G1 is a region which may be
several (smaller) pages, we will typically not expect the G1
implementation to perform any better than the ParallelScavengeHeap one.