JDK-8203469 : Faster safepoints
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 12
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2018-05-21
  • Updated: 2021-03-19
  • Resolved: 2019-02-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 13
13 b09Fixed
Related Reports
Relates :  
Sub Tasks
JDK-8214271 :  
Description
Background:
ZGC often does very short safepoint operations. For a perspective, in a
specJBB2015 run, G1 can have young collection stops lasting about 170 ms. While
in the same setup ZGC does 0.2ms to 1.5 ms operations depending on which
operation it is. The time it takes to stop and start the JavaThreads is relative
very large to a ZGC safepoint. With an operation that just takes 0.2ms the
overhead of stopping and starting JavaThreads is several times the operation.

High-level functionality change:
Serializing the starting over Threads_lock takes time.
- Don't wait on Threads_lock use the WaitBarrier.
Serializing the stopping over Safepoint_lock takes time.
- Let threads stop in parallel, remove Safepoint_lock.

Details:
JavaThreads have 2 abstract logical states: unsafe or safe.
- Safe means the JavaThread will not touch Java heap or VM internal structures
  without doing a transition and block before doing so.
        - The safe states are:
                - When polls armed: _thread_in_native and _thread_blocked.
                - When Threads_lock is held: externally suspended flag is set.
        - VM Thread have polls armed and holds the Threads_lock during a
          safepoint.
- Unsafe means that either Java heap or VM internal structures can be accessed
  by the JavaThread, e.g., _thread_in_Java, _thread_in_vm.
        - All combination that are not safe are unsafe.

We cannot start a safepoint until all unsafe threads have transitioned to a safe
state. To make them safe, we arm polls in compiled code and make sure any
transition to another unsafe state will be blocked. JavaThreads which are unsafe
with state _thread_in_Java may transition to _thread_in_native without being
blocked, since it just became a safe thread and we can proceed. Any safe thread
may try to transition at any time to an unsafe state, thus coming into the
safepoint blocking code at any moment, e.g., after the safepoint is over, or
even at the beginning of next safepoint.

The VMThread cannot tolerate false positives from the JavaThread thread state
because that would mean starting the safepoint without all JavaThreads being
safe. The two locks (Threads_lock and Safepoint_lock) make sure we never observe
false positives from the safepoint blocking code, if we remove them, how do we
handle false positives?

By first publishing which barrier tag (safepoint counter) we will call
WaitBarrier.wait() with as the threads safepoint id and then change the state to
_thread_blocked, the VMThread can ignore JavaThreads by doing a stable load of
the state. A stable load of the thread state is successful if the thread's
thread state is the same both before and after the load of the safepoint id and
safepoint id is current or InactiveSafepointCounter. If the stable load fails,
the thread is considered safepoint unsafe. It's no longer enough that thread is
have state _thread_blocked it must also have correct safepoint id and be 
_thread_blocked after.

Performance:
The result of faster safepoints is that the average CPU time for JavaThreads
between safepoints is higher, thus increasing the allocation rate. The thread
that stops first waits shorter time until it gets started. Even the thread that
stops last also have shorter stop since we start them faster. If your
application is using a concurrent GC it may need re-tunning since each java
worker thread have an increased CPU time/allocation rate. Often this means max
performance is achieved using slightly less java worker threads than before.
Also the increase allocation rate means shorter time between GC safepoints.
- If you are using a non-concurrent GC, you should see improved latency and
  throughput.
- After re-tunning with a concurrent GC throughput should be equal or better but
  with better latency. But bear in mind this is a latency patch, not a
  throughput one.
With current code a java thread is not to guarantee to run between safepoint (in
theory a java thread can be starved indefinitely), since the VM thread may
re-grab the Threads_locks before it woke up from previous safepoint. If the
GC/VM don't respect MMU (minimum mutator utilization) or if your machine is very
over-provisioned this can happen.
The current schema thus re-safepoint quickly if the java threads have not
started yet at the cost of latency. Since the new code uses the WaitBarrier with
the safepoint counter, all threads must roll forward to next safepoint by
getting at least some CPU time between two safepoints. Meaning MMU violations
are more obvious.

Some examples on numbers:
- On a 16 strand machine synchronization and un-synchronization/starting is at
  least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
  starting ~400->~100us.
  (Semaphore path is a bit slower than futex in the WaitBarrier on Linux).
- SPECjvm2008 serial (untuned G1) gives 10x (1 ms vs 100 us) faster
  synchronization time on 16 strands and ~5% score increase. In this case the GC
  op is 1ms, so we reduce the overhead of synchronization from 100% to 10%.
- specJBB2015 ParGC ~9% increase in critical-jops.

Comments
For your average java app in seconds (G1): Average time application was stopped: 0.00508362 s Time to stopped the JavaThreads : 0.00212166 s Stopping without the Safepoint_lock: Average time application was stopped: 0.00380104 s Time to stopped the JavaThreads : 0.000781617 s So stopping is 3x faster, which in this case give 20% faster safepoints on average. Starting the thread is 6 times as fast, but vary depending on workload and hardware
13-09-2018

Not statistical ensured (linux x64): G1 with biased locking: Futex: +3.22% critical jobs Sema : +3.43% critical jobs ParallelGC with biased locking: Futex: +11.66% critical jobs Sema : +9.68% critical jobs G1 WITHOUT biased locking: Futex: +3.13% critical jobs Sema : -1.95% critical jobs ParallelGC WITHOUT biased locking: Futex: +4.71% critical jobs Sema : +7.46% critical jobs Preview code: http://cr.openjdk.java.net/~rehn/8203469/preview_v2/webrev/ Rebase-merge: http://cr.openjdk.java.net/~rehn/8203469/preview_v3/webrev/ Passes tier1-5
17-08-2018