JDK-8326035 : Ahead-of-Time GC Agnostic Object Archiving
  • Type: JEP
  • Component: hotspot
  • Sub-Component: gc
  • Priority: P4
  • Status: Submitted
  • Resolution: Unresolved
  • Submitted: 2024-02-16
  • Updated: 2024-09-24
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Description
Summary
-------

An Ahead-of-Time (AOT) object archiving mechanism, agnostic to which Garbage Collector (GC) is selected at deployment time.

Goals
-----

The AOT cache delivered by <a href="https://openjdk.org/jeps/483">JEP 483: Ahead-of-Time Class Loading & Linking</a> embeds ahead of time computed state in an AOT cache, in order to start the JVM faster. This cache contains an object archive as well as other program state. Currently, the Z Garbage Collector (ZGC), does not support the object archiving mechanism of the AOT cache, making ZGC not fully supported. This JEP aims at addressing that. The primary goals of this JEP are:

 - Support object archiving for ZGC (and indeed any other GC)
 - A unified object archiving format and loader

Secondary goals:

 - Keep GC implementation details and policies separate from the object archiving mechanism

Non-Goals
---------

It is not a goal at this time, to:

 - Remove the existing GC-dependent object archiving mechanism
 - Unify AOT cache artifacts produced for `-XX:+UseCompressedOops` with `-XX:-UseCompressedOops`

While removing the existing GC-dependent object archiving mechanism of the AOT cache would allow detangling implementation details of other GCs from object archiving, we will not consider that at this time as there is not enough data to make such a decision yet.

Success Metrics
---------------

It should not take significantly longer for the JVM to start with the new GC-agnostic archived object loader, compared to the alternative GC-specific archive object loaders for Serial GC, Parallel GC and G1 GC.

Motivation
----------

Traditional GCs are famous for causing “tail latency” problems in Java workloads. By pausing application threads to collect garbage, some requests take significantly longer than they usually do. Applications may have a service level agreement (SLA), requiring tail latencies to be bounded for particular percentiles. For example an SLA could say that P99 response times (the 99th percentile) must be below 10 ms, meaning that the shortest response time among the 1% worst response times should not exceed 10 ms. ZGC is a low latency GC that has been available since JDK 15 (JEP 377). It greatly improves GC induced tail latency by performing GC work concurrently.

However, GC is not the only JVM mechanism that causes tail latency. Java workloads are often "scaled out" by starting new instances to handle more incoming requests. Requests sent to the new instance take significantly longer than requests sent to a warmed-up JVM. This also causes tail latency. JEP 483: Ahead-of-Time Class Loading & Linking improves startup/warmup induced tail latency by capturing much of the corresponding work in an AOT cache.

The AOT cache contains data about the state of the program from the training run. A non-trivial chunk of this data is the java.lang.Class objects for all the loaded classes in the program and the constant pool entries that were resolved during the training run. These objects are stored in an object archive in the AOT cache and get loaded into the Java heap at runtime. However, the object archiving mechanism used by the AOT cache is incompatible with ZGC. This is unfortunate as it forces latency conscious users to choose whether they would like their application to suffer from GC induced tail latency or startup/warmup induced tail latency.

In order to improve Java latencies, it is important to use a systems approach to engineering, where all components are designed to work together. This JEP addresses the mentioned incompatibility by introducing a GC agnostic object archiving mechanism for the AOT cache, allowing it to be used together with ZGC as well as any other GC. This way, users that wish to improve startup/warmup induced tail latency by using the AOT cache are no longer constrained to select a GC other than ZGC, likely making the overall tail latency of the system worse over time.

Description
-----------

The AOT cache captures program state computed during a training run of an application so that a subsequent deployment run can start faster. When running a training run of the <a href="https://github.com/spring-projects/spring-petclinic">Spring Petclinic 3.2.0</a> program, a 130 MB AOT cache file is created. This cache helps starting the program 42% faster. Among the 130 MB of program state captured, 12 MB consists of archived Java objects. These objects are java.lang.Class instances for the ~21 000 loaded and linked classes as well as all the resolved constants from the constant pools among other things. It is important for certain startup/warmup optimizations that these objects can be loaded from the object archive of the AOT cache to the Java heap of a deployment run. ZGC does not currently support said optimizations as the current object archiving system does not work for ZGC.

### Offline Layout Challenges ###

It is important that loading the archived objects onto the Java heap is fairly efficient, or the benefit the corresponding startup/warmup optimizations that object archiving enables become compromised. The current object archiving system of the AOT cache directly maps memory from an archive file straight into the Java heap, which is rather efficient. However, in order for this approach to work well, the layout in the file has to exactly match, bit by bit, what the GC (and the rest of the JVM) at runtime expects to see. There are three different layers of layout policies that might cause bits not to match. Any such bit mismatch causes challenges for the current approach for object archiving. The layout concerns are:

 1. **Heap layout.** The heap layout is a high level strategy for where in the heap a GC chooses to place objects of a particular size and class.
 2. **Field layout.** The field layout is concerned with where to store contents of fields within an object. It is not GC dependent.
 3. **Object reference layout.** This is the bit encoding strategy, for reference fields. It varies based on different optimization goals of different GCs.

These three layers of object layout policies can vary significantly between GC implementations and heap sizes. For each level of layout policy, there are various factors that can affect the bit pattern of how objects are represented in memory. For example:

 - There are currently six different pointer formats in HotSpot
 - There are various different heap layouts - contiguous, region based, discontiguous
 - Object alignment differs depending on object size for different GCs
 - Object location and grouping differs depending on object size for different GCs

These low level bit variations make it challenging to share the same archived object format from run to run. The main challenge is that all layers of layout decisions are performed ahead of time, even though they must fit in to potentially different constraints at runtime. It is inherent that different GC implementation strategies yield rather different layout policies. Having different archived object formats for different GCs might be okay when creating an object archive for a particular deployment. However, an object archive is also created for the default java launcher, allowing the JVM to start faster by default. In this scenario, it is challenging to predict when building the JDK, which GC a user is going to select. The object payload that is placed bit by bit in the archive easily gets tainted by layout constraints of the running JVM, that may or may not match the layout constraint of the JVM using the archive.

The proposal with this JEP, is to introduce a GC agnostic object archiving mechanism. It abstracts away the two layout concerns that are GC dependent: heap layout and object reference layout. This approach archives descriptions how an object might be materialized at runtime, instead of mapping the payload straight into the heap. This extra level of indirection allows GCs to materialize objects with the layout constraints relevant for the deployed JVM process. This mechanism allocates objects, initializes their payload and links objects together one by one in a way that allows full GC transparency. Loading objects in this way, is referred to as "object streaming" in this document. The new object archiving mechanism can be explicitly selected with the `-XX:+DumpStreamableObjects` JVM option, its use should be unnecessary for most users as it will be selected heuristically when relevant. Note however that the use of `-XX:UseCompressedOops` must be the same when creating the archive as when using it.

### Design Overview ###

The archived objects have a notion of "roots". The roots are objects that are referenced from other JVM entities that are part of the AOT cache. Each root object might have references that capture an arbitrary graph of objects. When the corresponding JVM entity is loaded, it asks the object archive for the corresponding root object. When the object archive hands out a reference to the in-heap object, it is expected that all transitively reachable objects have been materialized and may be safely accessed. Therefore, when a root object is requested, the archived object loader also loads and links all transitively reachable objects.

In a way, the problem of loading such object graphs efficiently and hiding delays from the running application, is in spirit rather similar to the problem of performing tracing GC. A tracing GC traverses all objects that are live transitively from roots, before it can determine where there is garbage. Doing this while hiding the latency of the traversal is something that ZGC has done with great success. The solution for ZGC, is performing the object graph traversal concurrently to the application. The design of this JEP, is similar in spirit. It materializes transitively reachable objects of each root, concurrently to the application. Loading of roots can be done lazily on-demand, but the bulk of the work can be done by an extra bootstrapping thread, while the main thread is starting the JVM. Lazy object loading is triggered when an ahead of time loaded JVM entity is first used and asks for a particular root object from the archived heap.

Objects typically have references to other objects. Therefore, the archived objects must encode references to other objects. This mechanism encodes object references in a GC agnostic way, using
the "object index", which describes the order in which an object has been laid out in the object archive. The object indices start at one for the first object, and the number 0 conveniently represents the null value. The object index of an object is a core identifier
of objects in this approach. These indices lend themselves perfectly for optimized table lookups, as tables may be implemented as a simple arrays. There is one such table mapping object indices to materialized Java heap objects, and another table mapping object indices to buffer offsets to the corresponding archived object. Therefore, it is convenient to encode object references as object indices as it can be efficiently mapped to both the corresponding Java heap object, and the corresponding descriptor in the object archive.

### Traversal Concurrency ###

In order to load a root object and all transitively reachable objects, the archived objects must be traversed with a traversal algorithm. The extra bootstrapping thread is going to iterate over all of the roots, and perform such traversal for every encountered root. However, this traversal schedule can be computed ahead of time. The objects are then laid out in the archive in the exact traversal order that the extra bootstrapping thread would like to traverse. This way, the object index and the traversal order becomes the same thing.

The immediate effect of this, is that the extra bootstrapping thread does not need to perform a more elaborate graph traversal that maintains data structures of the traversal. It simply traverses the objects from the archive linearly, knowing that this trivial linear order is the same order as the graph traversal order. This makes the traversal faster. However, the most important ability this buys us is the ability to partition the archived objects into three distinct partitions:

 1. Objects already transitively materialized by the extra bootstrapping thread
 2. Objects currently being materialized by the extra bootstrapping thread
 3. Objects not yet processed nor concurrently accessed by the extra bootstrapping thread

This partitioning of archived objects allows the extra bootstrapping thread to perform the bulk of its work, without interfering with the main thread. When the main thread performs lazy loading of a root that falls in the region not yet materialized, an explicit graph traversal will be performed for that particular root. During this traversal, most of the work can be done independently of the concurrent materialization from the extra bootstrapping thread. Only when encountering objects in partition number two, is there any need for synchronization. This happens quite rarely in practice. When encountering concurrently materializing objects, the main thread waits for the extra bootstrapping thread to finish materializing them. Since the extra bootstrapping thread uses an optimized traversal, it will typically be able to finish faster than the lazy materialization could anyway. These partition intervals are shifted like a wavefront atomically, under a lock. However, the bulk of the work is done outside of the lock.

In summary, this ahead of time ordering yields a fast iterative traversal from the extra bootstrapping thread while allowing laziness and concurrency with minimal amount of coordination. This way the extra bootstrapping thread can remove the bulk of the work of materializing the Java objects from the critical main thread.

### Object Linking ###

The table mapping object indices to Java heap objects is filled in when an object is allocated.
Materializing objects involves allocating the object, initializing its payload, and linking it with other
objects. Since linking an object requires objects it can reach through its reference fields to be at least allocated,
the iterative traversal of the extra bootstrapping thread will first allocate all of the objects in its currently materializing partition of objects, representing all objects not yet materialized, that are transitively reachable from the currently processed root. When all objects of the current partition have been allocated, we can perform payload initialization and linking in a second pass.

As for the lazy traversal based object materialization, links are filled in when its children have been traversed. In this context, it is possible to map object indices to Java heap objects.

One interesting benefit of object level linking, is that the mechanism can better deal with the ahead of time objects being linked with runtime allocated objects from the deployment run. For example, the current direct mapping based object loader, dumps the entire String Table. What dumping the string table buys us is the ability to keep track of an identity property of boolean nature, of certain string objects: whether they were the canonical interned string or not. In the streaming approach, we don't need to dump the entire string table. Instead, strings in the archive that were interned, have a bit set in a bit map, representing this identity property. When linking interned strings, we dynamically intern the string, which may yield linking to an ahead of time archive object, or a runtime interned string created by the deployment run JVM.

### Scalability ###

The streaming approach processes objects one by ones, rather than mapping memory from a file straight into the Java heap. It is worth discussing the scalability implications of that. A warm start is a start close in time to a previous start. Then the archived objects are still in the file system cache of the OS, and there is no need for IO reading the archived objects from disk, and even if IO is required the disk itself might have a RAM based cache over the slower medium, making the access time faster. Conversely, a cold start is a start that does not benefit from such caching. Typically, cloud deployments are cold starts.
 
In a **cold start**, there is no free lunch: every byte has a cost. When mapping a file straight into the heap, establishing the memory mapping might complete rapidly, but the main work of loading the data from disk is going to take time and cause stalls in application threads. As the application accesses objects from the archive that have not yet been materialized in memory by the OS, it needs to wait for the OS to materialize the page of memory that the object resides on. Since the streaming approach embraces that there will indeed be work per object and instead aims at offloading that cost from critical bootstrapping, it better hides the per byte cost of a cold start. It also lends itself more naturally to compression as similarly the decompression can be offloaded. Cold starts should ultimately benefit from a smaller artefact size.

As for **warm starts**, there is still no free lunch. Page faults induced by memory mapping still have a cost. The cost is greater in virtualized environments that have yet another layer of page table indirections. Having said that, memory mapping based object archiving seems to require slightly less CPU time in warm starts in general. But at least the streaming solution is capable of offloading the vast majority of work to a separate thread, so the wall clock startup time seems to stay competitive.

Should we eventually need to process archives so large that the extra boostrapping thread can't keep up, the approach has also been designed to allow parallelization in the future. That allows at least deployments with available CPU resources, to process the objects faster, if concurrency alone is insufficient. As for CPU constrained environments running large applications, the default would currently pick the previous mapping solution. Determining whether the GC-agnostic solution works well enough in such situations, is outside the scope of this JEP. However, running such huge applications on a heavily hardware constrained machine, does sound like a nieche use case.

Alternatives
------------

When implementing support for ZGC, it isn't strictly necessary to build a GC-agnostic solution. One possible solution would be to double down on more GC-specific logic, and have a specific ZGC object loader that lays out objects with a heap layout and pointer layout expected by ZGC. This has some notable disadvantages:

 - Not only the AOT cache uses object archiving. There is also a default object archive shipped with the JDK that gets used unless a user specifies -Xshare:off. With a ZGC specific solution, this would require an extra object archive for ZGC, inflating the size of the JDK unnecessarily, compared to a GC-agnostic solution.
 - Development on ZGC will be slowed down and complicated, by entangling GC implementation details with how we archive objects.

As for advantages of doubling down on ZGC-specific object archiving logic, it's a bit unclear. Presumably, the main advantage would be starting the JVM faster. However, from current experiments, it appears that the streaming object loader is very efficient without needing to introduce ZGC specific knowledge.

As for GC-agnostic object archiving, different approaches have been considered. Most of them involved materializing all objects at once very eagerly. This led to trouble when running on very small heap sizes, as GCs would be tempted to perform a GC after a significant part of the heap is allocated. However, the JVM is not yet in a state where it can perform GCs that early. Therefore, allowing laziness allows the mechanism to be more GC-agnostic.

Testing
-------

A large amount of object archiving tests have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.

Risks and Assumptions
---------------------

Since the bulk of the work due to object level linking is performed by an extra bootstrapping thread, there is an assumption that it is okay to have both the main thread and the extra bootstrapping thread run at the same time. Some severely constrained cloud environments might not be willing to give the JVM an extra core, even for a short period of time. This risks resulting in delayed startup. Having said that, using a concurrent GC such as ZGC in such a constrained environment, is not going to work very well either in general.

There is another risk: memory footprint. The existing heap archiving solution maps the archived objects straight into the Java heap. However, the streaming approach loads the heap archive to a temporary location in memory, while it materializes objects into the Java heap. Therefore, during bootstrapping, the archived heap footprint is higher due to its duplicate nature. However, plotting typical memory usage over time, the time of bootstrapping is typically far below the eventual memory footprint of the application when it starts running. Hence, there will only be a footprint regression, if the application never needs more memory (Java heap, native memory, code cache, etc), than the size of the archived objects. This seems rather unlikely.

Comments
Looks good.
03-04-2024

Thanks for the comments, [~kvn]. I edited the JEP as requested.
03-04-2024

Can the title be "CDS Object Streaming"? "CDS" assumes already that it is archive. Add labels: gc, cds, leyden I like that this is general (for all GCs) JEP. But I don't see clear statement that currently CDS does not support ZGC for objects archiving. May be in "Goals" mention it. My wording looks not good, may be you can come up with better statement: - Add support for CDS object archiving for the Z Garbage Collector (which is not supported currently) I assume "Secondary goals not visible to users are" both bullets apply to any GC and not only ZGC. Should "Success Metrics" also mention performance for ZGC with archived objects vs current state (no archived objects)? You did not mentioned `-XX:+UseCompressedClassPointers` which also affects object layout. "Loading of roots can therefore be done lazily, and the bulk of the work can be done in an extra CDS thread." It is not clear to me when this "extra" thread starts and how it coordinates with main thread. I assume current CDS code is executed in main thread during startup. But from your description this "extra" thread will run concurrently with main thread and any Java thread. What "lazy" means here? What requests trigger work in "extra" thread? You need Ioi and someone from GC group review it too (listed as reviewers in JEP).
02-04-2024