JDK-8226197 : Reduce G1���s CPU cost with simplified write post-barrier and disabling concurrent refinement
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9,10,11,12,13
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2019-06-15
  • Updated: 2020-09-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
While migrating our production services from CMS to G1, we found that G1���s complicated write post-barrier incurs considerable CPU cost. Currently the post-barrier is the following for a write ���p.f = q���:

if ((p xor q) >> LOG_REGION_BITS != 0) {  // if the write crosses region boundary
  if (q != null) {
    card_address = &card_table[addr_to_index(p)]
    if (*card_address != YOUNG) {
      store_load_fence;
      if (*card_address != DIRTY) {
        *card_address = DIRTY;
        T.dirtyCardQueue.enqueue(card_address);
      }
    }
  }
}

And for CMS the write barrier is only:
card_address = &card_table[addr_to_index(p)]
*card_address = DIRTY;

The complexity of G1���s write post-barrier is due to the need to support concurrently refinement threads. However, even if user has set -XX:G1ConcRefinementThreads=0, the write post-barrier remains the same. Ideally the write post-barrier could be much simpler if there is no concurrent refinement.

This RFE proposes to add a mode to G1 that uses a simplified write post-barrier:
if ((p xor q) >> LOG_REGION_BITS != 0) {
  if (q != null) {
    card_address = &card_table[addr_to_index(p)]
    *card_address = DIRTY;
  }
}

In this mode, G1 would disable concurrent refinement and per-Java-thread dirty card queue. G1 would need to process all dirty cards during a collection pause. Thus pause time could become longer, but as long as MaxGCPauseMillis is reasonably large with regard to the heap size, G1���s adaptive heuristics should still be able to adjust the young-gen size to meet the pause time goal.

This new mode would reduce G1���s CPU usage considerably. It will be particularly helpful for certain types of workloads, e.g.:
 - Workloads heavily tuned for CMS to minimize old-gen collections, and sensitive to CPU usage;
 - Workloads that mainly care about throughput and CPU usage;

I have implemented a prototype for this mode, and attached some preliminary results.

Comments
I agree that the the enqueuing should be suppressed if refinement is disabled. However, I'm skeptical whether the write barrier proposed in JDK-8226731 could save much overhead. I actually did many experiments to narrow down and measure the cost of each operation in the write post-barrier. The methodology is to disable all of G1's incremental collections, so it only triggers full GC. Then I can make arbitrary change to the write barrier without affecting correctness. I measured the performance with DaCapo with its large workload, and with "-Xms4g -Xmx4g -XX:NewRatio=1". The conclusion from those experiments is that the write post-barrier incurs 8% overhead on average, and there is no single operation that is the culprit of most of the overhead (i.e., the overhead is somewhat evenly distributed across all the operations in the write post-barrier). Surprisingly, removing the store-load fence or the enqueue operation does not reduce overhead much. This is likely due to the if checks for Dirty and Young cards filtered out many calls to the fence and enqueue. One configuration of my experiments generates the following write barrier, which is quite similar to the proposed one in JDK-8226731: 1) if (p and q in same region or q == NULL) -> exit 2) card_val = *card(p) 3) if (card_val = Young) -> exit 4) if (card_val == Dirty) -> exit 5) *card(p) = Dirty; 6) enqueue(card(p)) This configuration reduces the average overhead of the write barrier only to 6.6%.
29-06-2019

Attached patch (single-check-no-young-cards.diff) implements the barrier suggested by JDK-8226731 without the enqueue on top of the webrev posted to the mailing list; it adds the "if card(p) == dirty -> exit" check, and removes the use of the "Young" cards, instead marking all Young regions as "Dirty" to the same effect. Ie. if (p and q in same regions) -> exit if (q is NULL) -> exit if (card(p) == Dirty) -> exit *card(p) = Dirty This patch breaks use of "regular" G1 and is otherwise really ugly though :) It seems to pass very few gc tests and our internal perf benchmark suite, so it should be "okay" for initial perf testing. Note that with the imminent push of JDK-8213108 the original prototype and this change won't apply any more. In my testing so far the throughput is statistically equivalent (alpha=0.05, 5 runs each) to the throughput of the barrier suggested in this CR for all our perf benchmarks. Similar to the original patch it disables refinement, so its CPU usage should also be the same.
29-06-2019

The suggestion in JDK-8226731 only decreases the barrier to the following: 1) if (p and q in same regions and q not NULL) -> exit 2) if (card(p) == Old) -> exit 3) card(p) = Dirty; 4) enqueue(card(p)) This adds line 2) to the suggested barrier, i.e. check the card first whether it already contains a dirty card or not. This is not absolutely required for the barrier, but since it is exactly the code added by the UseCondCardMark option for other collectors, for the same reasons as stated there it would be beneficial to have (and needs to be implemented anyway to support UseCondCardMark) particularly on large machines/container environments.
29-06-2019

Another comment is that some pseudocode of the barrier in the above comments and JDK-8226731 seem inverted for the first condition: if (p and q in different regions and q not NULL) -> exit OR if (p and q in different regions) -> exit should become: if (p and q in same region or q == NULL) -> exit OR if (p and q in same region) -> exit
28-06-2019

I rerun my experiments and confirm that the JDK-8226731-barrier with "if (card(p) == Dirty) -> exit" does not incur noticeable CPU or throughput overhead in most cases, compared to the barrier proposed in this CR. The only significant differences are (using the same negative/positive notation as Thomas): bigramtester (20g): +2% DaCapo Jython (4g, over-provisioning): -1.6% Both benchmarks are special cases in my opinion (probably more true for jython). We could investigate the differences later.
28-06-2019

Rechecked the benchmarks (on x86-64 Linux only) where there previously were significant differences to "vanilla G1" (specjbb2015, specjvm2008.xml-transform, specjvm2008.xml-validation), each with fixed young gen, fixed size heap and AlwaysPreTouch enabled. Heap sizes derived from a previous run without heap size setting (i.e. approximately what ergonomics would select to keep a reasonable number of GCs in the mix). Percentages shown below are relative scores of JDK-8226731-barrier to suggested barrier; numbers in brackets are -Xmn/-Xmx sizes; -Xms is always equal to -Xmx. Negative percentage values mean that the JDK-8226731 barrier is slower (smaller score). 20 runs each, values are considered significant with alpha=0.05 specjbb2015 (2.4g/3.5g): -0.19% (not significant) xml-transform (10g/15g): -0.83% (significant) xml-validation (5g/7g): -0.66% (not significant) So the implementation of the JDK-8226731 barrier seems a very small bit slower for these benchmarks. However the difference seems to be acceptable given other benefits.
27-06-2019

That sounds good. I will look into merging this CR with JDK-8226731, and rebasing the implementation after JDK-8213108, and dynamically patching the barrier similarly as JDK-8210498, as well as fixing JDK-8226738.
26-06-2019

An alternative/complement to introducing a completely new and separate mode in G1 could be the changes suggested by JDK-8226731. Then only generating the enqueuing of new cards into the refinement buffer may need to be suppressed or made optional in some way (e.g. if refinement is active) in one or another way. It may be beneficial to make it runtime configurable in some way to allow G1 to switch dynamically as the situation allows. This need not necessarily be a flag checked at runtime (but may be the best option for e.g. Interpreter/C1), but could also be activated as necessary using available nmethod entry barriers (JDK-8210498).
25-06-2019