Bug ID: JDK-8130918 G1 barriers are laid out in full on critical path

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdResolved

Apparently, JDK-8026293 does not help this, and the full barrier is laid out on the critical path. 
This is very bad for instruction caches and static branch prediction, not to mention it makes the performance research harder.

This is clearly visible on a simple benchmark that stores the reference fields:
 http://cr.openjdk.java.net/~shade/8130918/G1Barriers.java

(Runnable JAR: http://cr.openjdk.java.net/~shade/8130918/benchmarks.jar, run with "-f 1 -prof perfasm:mergeMargin=200" to get the disassembly)

This is the hot loop in -XX:+UseParallelGC case:
  http://cr.openjdk.java.net/~shade/8130918/parallel.perfasm

And this is the hot loop in -XX:+UseG1GC case (now default in JDK 9):
  http://cr.openjdk.java.net/~shade/8130918/g1.perfasm

Please note the significant part of G1 barrier is cold, and we jump out on second branch.

Suggestion: move away the uncommon parts of the G1 barrier out of the critical path. This probably requires branch/value profiling to figure out the "store shape" profiles for each place store barrier is happening. In the pathological example in this issue, it would be nice to figure we always store null, and don't produce the rest of the barrier.

Fine. I have no reason to continue this any further. Closing.
21-01-2020
I quite agree with Thomas that the microbenchmark in this CR is not representative for the purpose of optimizing branch frequency. It only stress-tests storing a null reference via the result of a getfield, which is likely an uncommon case. JDK-8225776 profiled the branches through more realistic benchmarks, and we indeed saw 1% CPU improvement for an important production workload. That said, I think it is a valid request to make C2 able to figure out the putfield always stores null, so G1BarrierSetC2::post_barrier() can avoid emitting the post-write barrier. Perhaps we could ask the compiler team if such optimization is feasible?
21-01-2020
I was hoping JDK-8225776 would resolve this, but it still reproduces with JDK 14 EA: https://cr.openjdk.java.net/~shade/8130918/jdk14-g1.perfasm https://cr.openjdk.java.net/~shade/8130918/jdk14-parallel.perfasm
21-01-2020
That was probably a blanket reassignment after I left? Going to check if this is still applicable for current G1, maybe we now leverage some data, or make better guesses about the layout.
30-01-2017
[~dmcox], why is this assigned to you?
30-01-2017
Added a suggestion, reopening.
15-07-2015
Or maybe the suggestion is to just put the generated code blocks somehow out of line of the main loop (and do not do a call). Or something completely different, like this is supposed to be a CR to investigate the actual impact on more representative applications further.
15-07-2015
This CR does not give any suggestion what is supposed to be done here, just stating some facts given some random benchmark that may exhibit a particular bad situation. I believe the suggestion is about somehow trying to determine the best tradeoff between completely inlining and adding calls at some point given execution feedback. While this is very interesting for the gc team, changing compiler policies seems to be something that the compiler team would be best suited to do, so I am changing the sub-component.
15-07-2015

Relates :	JDK-8131299 - G1 write post-barrier can operate on raw pointers
Relates :	JDK-8026293 - Schedule part of G1 pre-barrier late
Relates :	JDK-8225776 - Optimize branch frequency of G1's write post-barrier in C2