Bug ID: JDK-8198423 Improve metaspace chunk allocation

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 11
Priority: P3
Status: Resolved
Resolution: Fixed
Submitted: 2018-02-20
Updated: 2019-10-17
Resolved: 2018-03-09
JDK 11
11 b07Fixed
(This is an RFE to track work on the - still in draft - JEP JDK-8166690 [1].)

We (SAP) would like to contribute a patch which has been live in our port for some time. It improves the metaspace chunk allocation: reduces fragmentation and raises the chance of reusing free metaspace chunks.

This patch helps with a number of pathological cases where metaspace chunks are free but cannot be reused because they are of the wrong size. For example, the metaspace freelist could be full of small chunks, which would not be reusable if we need larger chunks. So, we could get metaspace OOMs even in situations where the metaspace was far from exhausted. Our patch adds the ability to split and merge metaspace chunks dynamically and thus remove the "size-lock-in" problem.

--

How this patch works:

1) When a class loader dies, its metaspace chunks are freed and returned to the freelist for reuse by the next class loader. With the patch, upon returning a chunk to the freelist, an attempt is made to merge it with its neighboring chunks - should they happen to be free too - to form a larger chunk. Which then is placed in the free list.

As a result, the freelist should be populated by larger chunks at the expense of smaller chunks. In other words, all free chunks should always be as "coalesced as possible".

2) When a class loader needs a new chunk and a chunk of the requested size cannot be found in the free list, before carving out a new chunk from the virtual space, we first check if there is a larger chunk in the free list.  If there is, that larger chunk is chopped up into n smaller chunks. One of them is returned to the callers, the others are re-added to the freelist.

(1) and (2) together have the effect of removing the size-lock-in for chunks. If fragmentation allows it, small chunks are dynamically combined to form larger chunks, and larger chunks are split on demand.

--

What this patch does not:

This is not a rewrite of the chunk allocator - most of the mechanisms stay intact. Specifically, chunk sizes remain unchanged, and so do chunk allocation processes (when do which class loaders get handed which chunk size). Almost everthing this patch does affects only internal workings of the ChunkManager.

--

Details:

Before the patch, the following rules held:
- All chunk sizes are multiples of the smallest chunk size ("specialized chunks")
- All chunk sizes of larger chunks are also clean multiples of the next smaller chunk size (e.g. for class space, the ratio of specialized/small/medium chunks is 1:2:32)
- All chunk start addresses are aligned to the smallest chunk size (more or less accidentally, see metaspace_reserve_alignment).
The patch makes the last rule explicit and more strict:
- All (non-humongous) chunk start addresses are now aligned to their own chunk size. So, e.g. medium chunks are allocated at addresses which are a multiple of medium chunk size. This rule is not extended to humongous chunks, whose start addresses continue to be aligned to the smallest chunk size.

The reason for this new alignment rule is that it makes it cheap both to find chunk predecessors of a chunk and to check which chunks are free.

When a class loader dies and its chunk is returned to the freelist, all we have is its address. In order to merge it with its neighbors to form a larger chunk, we need to find those neighbors, including those preceding the returned chunk. Prior to this patch that was not easy - one would have to iterate chunks starting at the beginning of the VirtualSpaceNode. But due to the new alignment rule, we now know where the prospective larger chunk must start - at the next lower larger-chunk-size-aligned boundary. We also know that currently a smaller chunk must start there (*).

In order to check the free-ness of chunks quickly, each VirtualSpaceNode now keeps a bitmap which describes its occupancy. One bit in this bitmap corresponds to a range the size of the smallest chunk size and starting at an address aligned to the smallest chunk size. Because of the alignment rules above, such a range belongs to one single chunk. The bit is 1 if the associated chunk is in use by a class loader, 0 if it is free.

When we have calculated the address range a prospective larger chunk would span, we now need to check if all chunks in that range are free. Only then we can merge them. We do that by querying the bitmap. Note that the most common use case here is forming medium chunks from smaller chunks. With the new alignment rules, the bitmap portion covering a medium chunk now always happens to be 16- or 32bit in size and is 16- or 32bit aligned, so reading the bitmap in many cases becomes a simple 16- or 32bit load.

If the range is free, only then we need to iterate the chunks in that range: pull them from the freelist, combine them to one new larger chunk, re-add that one to the freelist.

(*) Humongous chunks make this a bit more complicated. Since the new alignment rule does not extend to them, a humongous chunk could still straddle the lower or upper boundary of the prospective larger chunk. So we gave the occupancy map a second layer, which is used to mark the start of chunks.

--

The patch shows its best results in scenarios where a lot of smallish class loaders are alive simultaneously. When dying, they leave continuous expanses of metaspace covered in small chunks, which can be merged nicely.  However, if class loader life times vary more, we have more interleaving of dead and alive small chunks, and hence chunk merging does not work as well as it could.

For an example of a pathological case like this see example program: [5]

Executed like this: "java -XX:CompressedClassSpaceSize=10M -cp test3 test3.Example2" the test will load 3000 small classes in separate class loaders, then throw them away and start loading large classes. The small classes will have flooded the metaspace with small chunks, which are unusable for the large classes. When executing with the rather limited CompressedClassSpaceSize=10M, we will run into an OOM after loading about 800 large classes, having used only 40% of the class space, the rest is wasted to unused small chunks. However, with our patch the example program will manage to allocate ~2900 large classes before running into an OOM, and class space will show almost no waste.

Do demonstrate this, add -Xlog:gc+metaspace+freelist. After running into an OOM, statistics and an ASCII representation of the class space will be shown. The unpatched version will show large expanses of unused small chunks, the patched variant will show almost no waste.

Note that the patch could be made more effective with a different size ratio between small and medium chunks: in class space, that ratio is 1:16, so 16 small chunks must happen to be free to form one larger chunk. With a smaller ratio the chance for coalescation would be larger. So there may be room for future improvement here: Since we now can merge and split chunks on demand, we could introduce more chunk sizes. 

--

What does this patch cost (memory):

 - the occupancy bitmap adds 1 byte per 4K metaspace.
 - MetaChunk headers get larger, since we add an enum and two bools to it. Depending on what the c++ compiler does with that, chunk headers grow by one or two MetaWords, reducing the payload size by that amount.
- The new alignment rules mean we may need to create padding chunks to precede larger chunks. But since these padding chunks are added to the freelist, they should be used up before the need for new padding chunks arises. So, the maximally possible number of unused padding chunks should be limited by design to about 64K.

The expectation is that the memory savings by this patch far outweighs its added memory costs.

.. (performance):

We did not see measurable drops in standard benchmarks raising over the normal noise. I also measured times for a program which stresses metaspace chunk coalescation, with the same result.


[1] https://bugs.openjdk.java.net/browse/JDK-8166690
[2] http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November/000128.html
[3] https://bugs.openjdk.java.net/browse/JDK-8185034
[4] https://bugs.openjdk.java.net/browse/JDK-8176808
[5] https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip
Relates :	JDK-8202633 - Metachunk headers have grown too large
Relates :	JDK-8199667 - Unify metaspace list index handling and reinstantiate ChunkManager listindex gtest
Relates :	JDK-8212932 - [TESTBUG] Clean up TestVirtualSpaceNode test
Relates :	JDK-8166690 - Reduce metaspace waste by dynamically merging and splitting metaspace chunks.
Relates :	JDK-8221173 - JEP 387: Elastic Metaspace
Relates :	JDK-8199518 - test/hotspot/jtreg/runtime/SelectionResolution tests take a lot longer to run with fastdebug after JDK-8198423
Relates :	JDK-8198760 - Move global lock SpaceManager::_expand_lock to MutexLocker.cpp
Relates :	JDK-8218988 - Improve metaspace verifications