JDK-8256265 : G1: Improve parallelism in regions that failed evacuation
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 16
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2020-11-12
  • Updated: 2022-09-22
  • Resolved: 2022-09-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20 b16Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Currently G1 assigns a thread per failed evacuated region. This can in effect serialize the whole process as often (particularly with region pinning) there is only one region to fix up.

Try to improve parallelism when walking over the regions by e.g. recording potential entry points with live object starts when evacuation failure happens for sub-areas of the region during evacuation failure handling.

Note that the BOT should NOT be used as it is not created for young gen regions; using it would recreate it at that point which is basically a full region walk.


Latest implementation scans regions in chunks to bring parallelism, it's based on JDK-8278917 which changes to uses prev bitmap to mark evacuation failure objs.



(I can not upload the attachments.)
Here's the summary of performance data based on latest implementation, basically, it brings better and stable performance than baseline at "Post Evacuate Cleanup 1/remove self forwardee" phase. (Although some regression is spotted when calculate the results in geomean, becuase one pause time from baseline is far too small than others.)

The performance benefit trend is:
 - pause time (Post Evacuate Cleanup 1) is decreased from 76.79% to 2.28% for average time, from 71.61% to 3.04% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=8)
 - pause time (Post Evacuate Cleanup 1) is decreased from 63.84% to 15.16% for average time, from 55.41% to 12.45% for geomean, when G1EvacuationFailureALotCSetPercent is changed from 2 to 90 (-XX:ParallelGCThreads=<default=123>)
( Other common Evacuation Failure configurations are:
-XX:+G1EvacuationFailureALot -XX:G1EvacuationFailureALotInterval=0 -XX:G1EvacuationFailureALotCount=0 )


(unit is ms, please ignore the too precise float number, I guess it should be due to copying from excel)
============= -XX:ParallelGCThreads=8 ==============
-XX:G1EvacuationFailureALotCSetPercent=2
5.785714286	1.342857143	76.79%       // AVG
3.491874057	0.991334526	71.61%       // GEOMEAN
-XX:G1EvacuationFailureALotCSetPercent=5
6.457142857	5.257142857	18.58%       // AVG
3.751329198	4.842453032	-29.09%      // GEOMEAN
-XX:G1EvacuationFailureALotCSetPercent=10
7.7	5.857142857	23.93%                       // AVG
6.438810941	4.01726936	37.61%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=20
24.58571429	19.61428571	20.22%       // AVG
23.91174748	19.30619931	19.26%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=30
38.94285714	36.62857143	5.94%       // AVG
37.68215788	34.95941805	7.23%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=50
61.64285714	58.01428571	5.89%       // AVG
58.48164395	54.11292376	7.47%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=70
82.91428571	79.92857143	3.60%       // AVG
77.74442286	74.19262861	4.57%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=90
96.02857143	93.84285714	2.28%       // AVG
89.42483883	86.70240129	3.04%       // GEOMEAN



============= -XX:ParallelGCThreads=<default=123> ==============
-XX:G1EvacuationFailureALotCSetPercent=2
5.728571429	2.071428571	63.84%       // AVG
4.55615823	2.031440622	55.41%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=5
9.371428571	3.4	63.72%                       // AVG
8.921695255	3.189298958	64.25%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=10
10.37142857	4.842857143	53.31%       // AVG
9.36881833	4.583811393	51.07%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=20
15.18571429	10.72857143	29.35%       // AVG
14.81067789	10.2772486	30.61%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=30
21.01428571	18.74285714	10.81%       // AVG
20.70456459	18.26632857	11.78%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=50
34.58571429	30.07142857	13.05%       // AVG
33.57743078	28.85928625	14.05%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=70
47.17142857	40.34285714	14.48%       // AVG
45.28903319	39.17734325	13.49%       // GEOMEAN

-XX:G1EvacuationFailureALotCSetPercent=90
58.71428571	49.81428571	15.16%       // AVG
54.65737668	47.85414364	12.45%       // GEOMEAN

Comments
Changeset: 15cb1fb7 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2022-09-15 09:57:16 +0000 URL: https://git.openjdk.org/jdk/commit/15cb1fb7885a2fb5d7e51796552bae5ce0708cf5
15-09-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9980 Date: 2022-08-23 13:29:31 +0000
25-08-2022

I took over this change given it seems abandoned. There is some fairly significant rework to be done due to the single bitmap change :) I'll file a new PR for this, obviously crediting you for the very good work done so far.
10-08-2022

Hi Thomas, Thanks a lot for gathering the perf data. Sure, I will look into ergonomics part.
26-02-2022

Fwiw, I attached a graph showing the impressive progress from JDK 17 to 18 for all these evacuation failure handling changes (20220217-evac-failure-improvements.png): this is specjbb2015 fixed IR with induced evacuation failures (defaults, only EvacuationFailureALot enabled). The blue line shows expected flat pause times if there are no evacuation failures; the purple one is JDK 17, the yellow one JDK 18. As you can see, JDK 17 is pretty bad, taking ~160% more gc time than without failures. JDK 18 decreases that to ~25% more. The reason is that there are more garbage collections due to continuously dumping regions into old gen - so there are more (short) garbage collections. After this change we should probably look more into the ergonomics part of this patch series, maybe even start experimenting with enabling region pinning on a fork/branch to get realistic pinning usage data.
23-02-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7047 Date: 2022-01-12 09:03:45 +0000
22-01-2022

Attachment <parallel.evac-failure(v2)-parallel-gc-threads(default=123).simple-one.png> anti-virus scan found a possible virus. The system has removed attachment. Please check the file before attempting to upload it again
22-01-2022

Attachment <parallel.evac-failure(v2).png> anti-virus scan found a possible virus. The system has removed attachment. Please check the file before attempting to upload it again
21-01-2022

The test based on lastest implementation + JDK-8277736 shows that (for details, please check the attachments): - when ParallelGCThreads=32, when G1EvacuationFailureALotCSetPercent <= 50, the parallelism bring more benefit than regression; - when ParallelGCThreads=128, whatever G1EvacuationFailureALotCSetPercent is, the parallelism bring more benefit than regression; other related evac failure vm options: - G1EvacuationFailureALotInterval=1 - G1EvacuationFailureALotCount=1 For the situation like G1EvacuationFailureALotCSetPercent > 50 and ParallelGCThreads=32 , we could fall back to current implmentation, or further optimize the thread sizing at this phase if necessary.
02-12-2021