JDK-8226738 : Fold out-of-region and NULL checks in G1 post barrier
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 14
  • Priority: P4
  • Status: Closed
  • Resolution: Withdrawn
  • Submitted: 2019-06-25
  • Updated: 2019-06-27
  • Resolved: 2019-06-27
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdResolved
Related Reports
Relates :  
Description
The NULL check and region check in the post barrier can be combined to a single check using ANDN.

So provided a new value q stored to field q, instead of: 

if (q != NULL) { 
  if (((q ^ p) >> RegionLog) != 0) { 
    ... 
  } 
} 

the following is equivalent:

if (((q &~ p) >> RegionLog) != 0) { 
  ...
} 

Where the &~ may use a single ANDN instruction. On x86_64, with use of BMI instructions which introduces a three operand ANDN, the two checks become 3 instructions including the branch:

andn tmp, q, p;
sar tmp, RegionLog;
jne slower_path

(BMI Instructions are available since Haswell/Bulldozer, i.e. 2013+, so there should be a fair amount of processors supporting it already)

Otherwise something like

mov tmp, p
not tmp
and tmp, q
sar tmp, RegionLog
jne slower_path

(or whatever the compiler would generate) would do the trick.

This not only reduces code, but replaces two typically taken branches to one even more often taken branch.

(Observation from [~eosterlund])
Comments
The suggested formula is not correct given some counter-examples, so withdrawing at this time.
27-06-2019

Some other notes: 1. According to branch profiling in JDK-8225776, the two branches are actually frequently taken. 2. I attached some results I did before for removing the null check in C2. The overhead for the null check seems negligible (FullBarriers vs PrePostBarrierNoNull). So if it is difficult to get it correct, maybe the performance gain is not worth the effort?
27-06-2019

I doubt if the two barriers are exactly equivalent. There seems to be a corner case of all bits being 1 for p. Suppose 8-bit address and RegionLog = 4, and p = (1111 0000) and q = (1101 0000), then: (q ^ p) >> RegionLog = 0000 0010 (non-zero) but (q &~ p) >> RegionLog = ((1101 0000) & (0000 1111)) >> 4= 0 (zero). The following cases are not equivalent either: p = (0001 0000), q = (0000 0100) p = (1101 0000), q = (0001 0000)
27-06-2019