Improve G1 performance on large machines by implementing NUMA-aware memory allocation.
- It is not a goal to implement NUMA support collectors other than G1.
- It is not a goal to support operating systems other than Linux.
- It is not a goal to make other parts of G1 NUMA-aware, such as task queue stealing, remembered sets, or refinement.
Modern multi-socket machines increasingly have non-uniform memory access (NUMA), that is, memory is not equidistant from every socket or core. Memory accesses between sockets have different performance characteristics, with access to more-distant sockets typically having more latency.
The parallel collector, enabled by by `-XX:+UseParallelGC`, has been NUMA-aware for many years. This has helped to improve the performance of configurations that run a single JVM across multiple sockets. Other HotSpot collectors have not had the benefit of this feature, which means they have not been able to take advantage of such vertical multi-socket NUMA scaling. Large enterprise applications in particular tend run with large heap configurations on multiple sockets, yet they want the manageability advantage of running within a single JVM. Users who use the G1 collector are increasingly running up against this scaling bottleneck.
G1's heap is organized as a collection of fixed-size regions. A region is typically a set of physical pages, although when using large pages (via `-XX:+UseLargePages`) several regions may make up a single physical page.
If the `+XX:+UseNUMA` option is specified then, when the JVM is initialized, the regions will be evenly spread across the total number of available NUMA nodes.
Fixing the NUMA node of each region at the beginning is a bit inflexible, but this can be mitigated by the following enhancements. In order to allocate a new object for a mutator thread, G1 may need to allocate a new region. It will do so by preferentially selecting a free region from the NUMA node to which the current thread is bound, so that the object will be kept on the same NUMA node in the young generation. If there is no free region on the same NUMA node during region allocation for a mutator then G1 will trigger a garbage collection. An alternative idea to be evaluated is to search other NUMA nodes for free regions in order of distance, starting with the closest NUMA node.
We will not attempt to keep objects on the same NUMA node in the old generation.
Humongous regions are excluded from in this allocation policy. We will do nothing special for these regions.
Existing tests with the option `-XX:+UseNUMA` should flush out any correctness issues. We assume the use of NUMA hardware for testing.
There should be no performance difference to the original code when NUMA-aware allocation is turned off.
Risks and Assumptions
We assume that most short-lived objects are often accessed by the thread that allocated them. This is certainly true for majority of short-lived objects in most object-oriented programs. However, there are some programs where this assumption does not quite hold, so there may be performance regressions in some cases. In addition, the benefits also depend on the interplay of the extent of NUMA-ness of the underlying system and the frequency of threads being migrated between NUMA nodes on such systems, especially when load is high.