JDK-8273309 : G1: Handling objects that fail evacuation very slow
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 18
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2021-09-03
  • Updated: 2021-11-09
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
Some measurements with specjbb2015 fixed ir=6000, 16gb heap and inducing evacuation failures using G1EvacuationFailureALot (everything else related normal) shows that evacuation failures lengthen the "Object Copy" phase a lot.

See the attached graph that shows "Object Copy" time for evacuation failures and without within that run.

With object pinning we need this to be almost as fast, so investigate what could be done here.

Some ideas:
  * move out absolutely unnecessary code from object copy, ie. statistics gathering for JFR; could be done when removing the self-forwards, or in an extra stage later (i.e. recording all object sizes and do the statistics separately in parallel)
  * potentially always record preserved marks for failed objects; checking whether it is actually required may be more costly than just doing that all the time (would somewhat work together with JDK-8254739)
  * maybe other suggested changes (splitting processing "large" objects, JDK-8271870) helps because this is a work distribution problem in the end
  * avoid the constant lookup of various data via the HeapRegion* pointer; we have the G1HeapRegionAttr available
  * usual micro-optimizations like look for code duplication along the call chain

Of course, first analyze what is responsible for the slowdown.
Comments
The reported problem is mostly caused by an artifact of the G1EvacuationFailureALot code: calling G1YoungGCEvacFailureInjector::evacuation_should_fail() when evacuation failure handling is active (needing significant changes in the code to enable in product code, so not a product issue) is very slow (on the machine under test), potentially because of the many non-atomic accesses to the global variable that tracks if an object should fail. Making this check more local mostly fixes this issue, however measurements indicate that at this time handling failed objects is still slower than regular objects (something like using 0.14-0.18% of wall time of do_copy_to_survivor_space() for 0.10% of failing objects).
23-09-2021

Note that the bad time may be an effect of the testing method via G1EvacuationFailureALot: this typically causes lots of regions to fail, which can cause huge heap pressure, causing much smaller young gens with associated effects (i.e. less objects die, and then we do just another evacuation failure :) ); so maybe this is the case here.
10-09-2021