JDK-8236485 : Epoch synchronization protocol for G1 concurrent refinement
  • Type: Sub-task
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 15
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2019-12-21
  • Updated: 2022-09-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Blocks :  
Blocks :  
Relates :  
Description
Implement an epoch synchronization protocol between G1's concurrent refinement threads and mutator threads. The protocol is necessary for removing StoreLoad memory fence in G1's post-write barrier. The protocol guarantees that all Java heap stores in mutator threads prior to the initiation of the protocol are visible to the protocol-initiating thread when the protocol finishes. The implementation ensures that each mutator thread has satisfied at least one of the following conditions:
  - the mutator thread executed an operation that implies a StoreLoad fence;
  - the mutator thread established a release-acquire ordering with the protocol-initiating thread.

The protocol maintains the following data structures:
 - global_epoch: a global atomic counter;
 - T.epoch: a thread-local counter for a mutator thread T;
 - global_frontier: a minimum value across all thread-local counters for all mutator threads;

Each mutator thread copies current global_epoch to its local_epoch after executing certain runtime operations. For example, when the thread transitions from _thread_in_vm to _thread_in_Java state, when the thread does a handshake operation. These runtime operations happen frequently enough to make the protocol returns quickly in most cases. Note that the update to T.epoch by T, and the load of T.epoch from a remote protocol-initiating thread also establish a release-acquire ordering. Thus, it is not necessary for these runtime operations to imply a StoreLoad fence (although they usually imply).

A thread doing concurrent refinement initiates the protocol after cleaning the cards to refine (see G1RefineBufferedCards::refine()). The initiation of the protocol also provides full fence, which supersedes the existing StoreLoad fence after cleaning the cards.

The following pseudocode demonstrates how a refinement thread uses the protocol:

clean_cards(buffer);
required_frontier = atomic_inc(&global_epoch);  // required_frontier is the value after increment; provides full fence.
…  // doing some work to pre-process the card(s), e.g. sorting.
if (synchronize(required_frontier)) {
  refine_cleaned_cards(buffer);
} else {
  enqueue_deferred_cards(buffer, required_frontier);
}

bool synchronize(required_frontier) {
  if (!check_synchronize(required_frontier)) {
    // Issues a no-op asynchronous handshake (JDK-8238761) to a set of mutator threads.
    async_handshake();
  }
  // spin-wait by loading each mutator thread’s local_epoch
  while (!timeout) {
    if (check_synchronize(required_frontier)) {
      return true;
    }
  }
  return false;
}

bool check_synchronize(required_frontier) {
  if (required_frontier <= atomic_load(&global_frontier)) {
    return true;
  }
  current_frontier = MIN(atomic_load(&T.local_epoch)
                                      foreach mutator thread T potentially in _thread_in_Java state);
  if (current_frontier >= required_frontier) {
    // update global_frontier if current_frontier is larger
    global_f = atomic_load(&global_frontier);
    CAS(&global_frontier, global_f, MAX(global_f, current_frontier);
    return true;
  }
  return false;
}

Since we use asynchronous handshake and a spin loop with timeout, we need to support asynchronous refinement.
We add a global _deferred queue to G1DirtyCardQueueSet, in addition to the existing _completed and _paused queues. The _deferred queue is only processed when the thread cannot get a buffer from _completed or _paused queues. This effectively defers processing these cards while waiting for their epoch synchronization to finish.

In practice, with the default setting (in particular G1UpdateBufferSize=256), card refinement is infrequent and almost no buffer is enqueued into _deferred.
In a stress testing config (with G1UpdateBufferSize=4 and other knobs), a few buffers are enqueued to _deferred, but they almost never get processed by a refinement thread. I.e., they are deferred to be processed in G1RemSet::merge_heap_roots() in the next evacuation pause.

In the future, we would like to use sys_membarrier() syscall for OSes that support it. This could possibly avoid doing the async handshake and having the deferred queue.

Comments
https://github.com/openjdk/jdk/pull/4005 has a great discussion on the interaction between the epoch sync protocol and handshakes. I'll update the description once we have finalized the implementation.
20-05-2021