JDK-8162928 : Micro-optimizations in scanning the remembered sets
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2016-08-02
  • Updated: 2018-06-21
  • Resolved: 2017-06-02
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 10
10 b21Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
During recent work the following worthwhile micro-optimizations for scanning remembered sets (or in general, cards) have been found:

- HeapRegion::oops_on_card_seq_iterate_careful is faster than using HeapRegionDCTOC during scan rs.

- HeapRegion::oops_on_card_seq_iterate_careful can be sped up by allowing for specialization for the use cases during gc vs. during mutator time by specialization.

E.g. a lot of extra checks can go away for such a specialization, like the filter_young one, the g1h->is_gc_active(), the card_ptr != NULL, the various checks whether we are scanning into an unparseable point etc.

- HeapRegion::oops_on_card_seq_iterate_careful() always does at least one unnecessary call to HeapRegion::block_size().
I.e. the one done while positioning the cursor at the object starting at or spanning into the card in question is not reused in the entry of the iteration loop.

HeapRegion::block_size() is very expensive in G1.

  - one can aggressively specialize HeapRegion::block_size() for the use case during gc: 
    - addr can not be >= top(), dropping the check
    - the repeated calculation of g1h->concurrent_mark()->prevMarkBitMap() is very expensive. Its load should be hoisted out of the oops_on_card_seq_iterate_careful() main loop and passed in from a local variable.
    - further, the information that the object is dead should be returned from block_size() (or a specialized one). After determining block_size(), oops_on_card_iterate() again does an expensive lookup of the prev mark bitmap to check whether the object is dead and looks up the mark bitmap again.

- need to look at the called methods, if it is appropriate to make them more amenable to inlining (some short, called methods are in cpp files)

- HeapRegion::block_is_obj() could be aggressively specialized for RS scan too: the first check for whether the given address is in a continues humongous region can be hoisted out of the entire oop iteration loop into oops_on_card_seq_iterate_careful();

- HeapRegion::is_obj_dead() could be specialized too: e.g. the is_archive check can be hoisted out to top-level (and actually, since archive regions do not contain any references to non-archive regions) is superfluous

Comments
The fix for JDK-8166607 changes G1 BOT uses of klass_or_null to instead use klass_or_null_acquire. This isn't necessary for the call in the revised version of oops_on_card_seq_iterate_careful from that change, but there can be (and are) other callers. The performance impact of unnecessary barriers in oops_on_card_xxx will hopefully not be too bad on non-TSO systems; the number reduces to one as the BOT gets filled in. But splitting or otherwise conditionalizing would likely provide improvement.
30-09-2016

The fix for JDK-8166607 addresses some of these issues.
30-09-2016

I do not think the storeload is also not required during gc. We are okay with scanning a few cards multiple times, but doing the potentially more problematic machine wide memory synchronization for every card that is scanned.
23-09-2016

Another problem is HeapRegion::block_start[_const]() which use the slow and too generic block_size() too.
02-08-2016