Bug ID: JDK-8239600 JEP 376: ZGC: Concurrent Thread-Stack Processing

JDK-8239600 : JEP 376: ZGC: Concurrent Thread-Stack Processing

Type: JEP
Component: hotspot
Sub-Component: gc

Priority: P3
Status: Closed
Resolution: Delivered
Fix Versions: 16

Submitted: 2020-02-21
Updated: 2021-03-07
Resolved: 2021-03-07

Related Reports

Relates :

JDK-8253180 - ZGC: Implementation of JEP 376: ZGC: Concurrent Thread-Stack Processing

Sub Tasks

JDK-8261332 :

Release Note: JEP 376: ZGC Concurrent Stack Processing - Resolved

Description

Summary
-------

Move ZGC thread-stack processing from safepoints to a concurrent phase.

Goals
-----

* Remove thread-stack processing from ZGC safepoints.
* Make stack processing lazy, cooperative, concurrent, and incremental.
* Remove all other per-thread root processing from ZGC safepoints.
* Provide a mechanism by which other HotSpot subsystems can lazily process stacks.

Non-Goals
---------

* It is not a goal to implement concurrent per-thread processing of non-GC safepoint operations, such as class redefinition.

Success Metrics
---------------

* The throughput cost of the improved latency should be insignificant.
* Less than one millisecond should be spent inside ZGC safepoints on typical machines.

Motivation
----------

The ZGC garbage collector (GC) aims to make GC pauses and scalability issues in HotSpot a thing of the past. We have, so far, moved all GC operations that scale with the size of the heap and the size of metaspace out of safepoint operations and into concurrent phases. Those include marking, relocation, reference processing, class unloading, and most root processing.

The only activities still done in GC safepoints are a subset of root processing and a time-bounded marking termination operation. The roots include Java thread stacks and various other thread roots. These roots are problematic, since they scale with the number of threads. With many threads on large machine, root processing becomes a problem.

In order to move beyond what we have today, and to meet the expectation that time spent inside of GC safepoints does not exceed one millisecond, even on large machines, we must move this per-thread processing, including stack scanning, out to a concurrent phase.

After this work, essentially nothing of significance will be done inside ZGC safepoint operations.

The infrastructure built as part of this project may eventually be used by other projects, such as Loom and JFR, to unify lazy stack processing.

Description
-----------

We propose to address the stack-scanning problem with a _stack watermark barrier_. A GC safepoint will logically invalidate Java thread stacks by flipping a global variable. Each invalidated stack will be processed concurrently, keeping track of what remains to be processed. As each thread wakes up from the safepoint it will notice that its stack is invalid by comparing some epoch counters, so it will install a _stack watermark_ to track the state of its stack scan. The stack watermark makes it possible to distinguish whether a given frame is above the watermark (assuming that stacks grow downward) and hence must not be used by a Java thread since it may contain stale object references.

In all operations that either pop a frame or walk below the last frame of the stack (e.g., stack walkers, returns, and exceptions), hooks will compare some stack-local address to the watermark. (This stack-local address may be a frame pointer, where available, or a stack pointer for compiled frames where the frame pointer is optimized away but frames have a reasonably constant size.) When above the watermark, a slow path will be taken to fix up one frame by updating the object references within it and moving the watermark upward. In order to make returns as fast as they are today, the stack watermark barrier will use a slightly modified safepoint poll. The new poll not only takes a slow path when safepoints (or indeed thread-local handshakes) are pending, but also when returning to a frame that has not yet been fixed up. This can be encoded for compiled methods with a single conditional branch.

An invariant of the stack watermark is that, given a callee which is the last frame of the stack, both the callee and the caller are processed. To ensure this, when the stack watermark state is installed when waking up from safepoints, both the caller and the callee are processed. The callee is armed so that returns from that callee will trigger further processing of the caller, moving the armed frame to the caller, and so on. Hence processing triggered by frame unwinding or walking always occurs two frames above the frame being unwound or walked. This simplifies the passing of arguments that have to be owned by the caller yet are used by the callee; both the caller and the callee frames (and hence the extra stack arguments) can be accessed freely.

Java threads will process the minimum number of frames needed to continue execution. Concurrent GC threads will take care of the remaining frames, ensuring that all thread stacks and other thread roots are eventually processed. Synchronization, utilizing the stack watermark barrier, will ensure that Java threads do not return into a frame while the GC is processing it.

Alternatives
------------

When it comes to dealing with stack walkers, we considered the alternative solution of sprinkling load barriers across the VM where object references are loaded from the stack. We dismissed this because it fundamentally could not guarantee that root processing of internal pointers into objects are processed correctly. The base pointer of an internal pointer must always be processed after an internal pointer, and stack walkers would risk violating that invariant. Therefore we chose the approach of processing the whole frame, if not already processed, via stack walking.

Testing
-------

The main code paths affected by this work are paths that other tests already stress to a great degree, so stress testing with the existing testing infrastructure should be sufficient.

Comments

Thanks for the explanations. For the title, how about ��ZGC: Concurrent Thread-Stack Processing��? A general audience will know what a ��thread stack�� is, but they might wonder whether an ��execution stack�� is something different. Since one millisecond is something that we can measure, I moved that goal into the Success Metrics section. To clarify the mention of the time-bounded marking termination operation I removed the word ��arbitrarily�� -- that was confusing to me since it wasn��t clear that it was arbitrary in the sense it��s chosen by you, rather than by an end user or a heuristic. To clarify the use of ��snapshot�� it turned out to be easy just to remove that word. The new description of the last-callee invariant makes sense. Let me know if this looks okay now, and I��ll move it to Candidate.

10-03-2020

Thanks for the editing, Mark. It looks good. ��ZGC: Concurrent Thread Processing�� would indeed be a more accurate title. But the stack processing is 90% of the problem I am solving. That is why I chose to focus on that in the title. Of course we can change this if you think thread processing is better. I agree the sub-millisecond safepoint latency goal is worth highlighting, and added that to the goals as you proposed. As for the time-bounded marking termination, this is a limited amount of marking we can perform in safepoints to deal with spurious resurrections due to weak references. We today have an arbitrarily chosen limit of 1 ms work to be performed inside of the safepoint before reverting back to a concurrent phase. That number was chosen based on what time our other safepoint operations were taking before. Now that this number is getting lower and lower, the time limit for safepoint-based marking termination will also be adjusted accordingly, such that our safepoints remain sub-millisecond for marking termination as well. I am okay with removing the mention of this, as it could be viewed as a technical detail that users do not need to know about. What I mean by snapshot of each stack is more on a conceptual level. When we safepoint, we have a snapshot of the state of all stacks. That exact state of the stacks will be processed, with the exact information it had in the safepoint, but the actual work is deferred and performed concurrently and cooperatively outside of the safepoints. We are not copying any stacks as part of the GC operations. Regarding the processing technique that keeps both callee and caller processed at any given time, I have updated my description of this in a way that I hope makes sense. This is a bit tricky; we initially process 2 frames, and then keep on processing 1 frame at a time that violates the invariant, but always 2 frames ahead of the unwinding frame. This way, the caller and callee relationship is invariant of our concurrent GC activities. Completing this JEP is not a prerequisite of making ZGC a production feature. Our latencies are already great. This work will essentially mark a point where we consider GC latencies induced by GC processing inside of safepoints to essentially be finished. That is a nice thing to have, but not a requirement for ZGC to be widely useful for our users.

07-03-2020

I��ve copy-edited the text to streamline the wording and remove the use of the passive voice where possible. Please check it over to see if I misunderstood anything. A few questions: Would ��ZGC: Concurrent Thread Processing�� be a more accurate title, since a (Java) thread includes both an execution stack plus other roots? Is the one-millisecond safepoint latency goal worth highlighting, either in the Goals or Success Metrics sections? You mention ��arbitrarily time-bounded marking termination operation.�� Is that operation sufficiently bounded that it won��t interfere with the one-millisecond goal? If so, then please either clarify that or omit mention of it. What do you mean by ��snapshot of each stack?�� You��re not copying the stacks, are you? The discussion of the last-callee invariant is confusing. You say that both the caller and the callee are processed when waking up from a safepoint, yet you also say that the callee is ��armed so that returns into the caller will process the caller.�� That suggests that only the callee is processed when waking up, and the caller is processed when the callee returns. Which is correct, and what is the true invariant? Is completing this JEP a prerequisite to making ZGC a production feature?

06-03-2020