Apparently, JDK-8026293 does not help this, and the full barrier is laid out on the critical path.
This is very bad for instruction caches and static branch prediction, not to mention it makes the performance research harder.
This is clearly visible on a simple benchmark that stores the reference fields:
http://cr.openjdk.java.net/~shade/8130918/G1Barriers.java
(Runnable JAR: http://cr.openjdk.java.net/~shade/8130918/benchmarks.jar, run with "-f 1 -prof perfasm:mergeMargin=200" to get the disassembly)
This is the hot loop in -XX:+UseParallelGC case:
http://cr.openjdk.java.net/~shade/8130918/parallel.perfasm
And this is the hot loop in -XX:+UseG1GC case (now default in JDK 9):
http://cr.openjdk.java.net/~shade/8130918/g1.perfasm
Please note the significant part of G1 barrier is cold, and we jump out on second branch.
Suggestion: move away the uncommon parts of the G1 barrier out of the critical path. This probably requires branch/value profiling to figure out the "store shape" profiles for each place store barrier is happening. In the pathological example in this issue, it would be nice to figure we always store null, and don't produce the rest of the barrier.