JDK-8278824 : Uneven work distribution when scanning heap roots in G1
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 15,16,17,18,19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2021-12-14
  • Updated: 2021-12-30
  • Resolved: 2021-12-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 18
18 masterFixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
During an upgrade from JDK11 to JDK15 (and JDK17) we experience an increase in garbage collection latency from approximately 10-15ms at p99.9 to approximately 45-50ms. Logs showed the object copy phase consuming most of that time.

Enabling task queue stats on the build and increasing the log level showed that one evacuation thread is doing far more work than the other threads.

For example, in this cycle we see one thread spend 67.2ms on Object Copy, while the others spend far less time:

{code}
148905 [2021-12-06T22:12:38.338+0000][debug][gc,phases        ] GC(581)     Object Copy (ms):              Min: 13.3, Avg: 16.6, Max: 67.2, Diff: 53.8, Sum: 546.7, Workers: 33
148906 [2021-12-06T22:12:38.338+0000][trace][gc,phases,task   ] GC(581)                                    14.8 15.4 15.3 16.0 14.7 14.6 16.6 15.3 16.3 15.2 67.2 13.8 14.1 16.5 14.5 14.3 14.1 14.9 15.3 15.8 13.3 15.6 15.0 15.6 14.7 15.6 14.6 14.3 15.1 15.2 14.4 13.5 14.9
{/code}

Looking back, we see a pattern in these cycles where the thread doing most of the work scans only a 2 or 3 "blocks" (here, just 2):
{code}
148899 [2021-12-06T22:12:38.338+0000][debug][gc,phases        ] GC(581)       Scanned Blocks:                Min: 2, Avg: 1203.9, Max: 1875, Diff: 1873, Sum: 39730, Workers: 33
148900 [2021-12-06T22:12:38.338+0000][trace][gc,phases,task   ] GC(581)                                       949  1838  1397  1548  1450  1875  821  1067  1312  1463  2  1282  1319  38  1177  1199  819  1170  897  1343  1860  1070  1059  1552  1217  1296  1068  1092  1645  1166  1002  1140  1597
{/code}

And in the task queue stats this thread performs an order of magnitude more operations on the task queue:
{code}
149005 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats    ] GC(581)   8      51757      51143       9173     121241       9379          0          0
149006 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats    ] GC(581)   9      37328      36045       6291     113112       8332          0          0
149007 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats    ] GC(581)  10     350079      77644          0        132          0     267666     226166
149008 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats    ] GC(581)  11      38889      37932       7100     110650       7940          0          0
149009 [2021-12-06T22:12:38.339+0000][trace][gc,task,stats    ] GC(581)  12      53762      53174       9375     110255       8637          0          0
{/code}

We traced the origin of this behavior to: https://bugs.openjdk.java.net/browse/JDK-8213108

We tried changing `ParGCArrayScanChunk`, but to no avail. Introducing a flag to override the ergonomics that select chunks per region was effective.
Comments
Fix request for JDK17u. The backport had a conflict in a comment line, but the actual code change was a clean and simple merge. This change addresses a performance regression introduced in JDK14 caused by an uneven distribution of root scanning work. This change increases the number of chunks per region, allowing a more fine grained distribution of work across the scanning threads.
30-12-2021

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk17u-dev/pull/39 Date: 2021-12-28 19:30:51 +0000
28-12-2021

Changeset: b4b0328d Author: William Kemper <wkemper@openjdk.org> Committer: Paul Hohensee <phh@openjdk.org> Date: 2021-12-28 17:32:00 +0000 URL: https://git.openjdk.java.net/jdk18/commit/b4b0328d62d9a9646f2822c361e41001bf0d4aa0
28-12-2021

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk18/pull/66 Date: 2021-12-22 19:48:10 +0000
22-12-2021

Thanks for reporting this issue - nice find! This is as you correctly noted an issue with work distribution during the Object Copy phase. There are known issues with work stealing that we've been working on specifically in the last few weeks; the graph mentioned below shows the current results (fwiw, it also gives good results without this change). So this change improves upon by making the initial distribution of work better, which so far seems a good solution for this particular case. After reproducing this particular issue and internal discussion we think that adding a new flag is something to avoid at all here. The main reason is that improving the defaults for the number of scan chunks seems side-effect free so far - tests so far do not show a regression either, and the additional memory usage (and the effort to manage this scan chunk table memory) seems negligible. The graph attached to the CR shows pause times for your reproducer on jdk11, jdk17, and jdk17 with values of G1RemSetScanChunksPerRegion from 64 to 1024; the selected 256 seems to work very well for this case :) and that improved-steal-multi that shows a recent prototype for the mentioned task queue changes (which covers more cases than this one). So our suggestion is to, for the 16m regions, set the default number of chunks per region to 128 or 256, depending on further testing results, for JDK 18 (and then use this to backport to 17). When we are ready to post the task queue changes (probably JDK19), we might want to reconsider these defaults again.
15-12-2021

If you let the repro run for 5 minutes or so, you will begin to see the difference in pause times between JDK11 and JDK15+: JDK17 Max Object Copy Times (ms, sorted): 35.6 35.2 34.6 34.2 33.7 33.3 31.2 26.2 26.1 24.1 22.4 22.0 21.6 21.2 20.6 19.6 19.5 18.7 18.0 17.7 17.6 17.1 16.7 15.9 15.7 15.7 15.6 15.3 14.8 14.5 14.2 14.2 14.1 12.0 10.6 JDK11 32.4 32.3 29.5 28.7 28.4 27.7 27.7 27.5 26.1 22.3 21.9 21.6 20.5 20.5 20.1 19.8 19.5 17.3 17.0 16.2 15.7 15.2 15.0 14.5 14.4 14.3 14.3 13.5 13.3 13.1 13.1 12.9 11.9 11.8 11.7 11.2 11.1 JDK17 w/ -XX:G1RemSetScanChunksPerRegion=256 29.2 21.6 20.7 20.2 19.4 19.4 19.3 19.1 18.8 18.7 17.5 17.3 17.0 16.1 15.6 15.5 14.6 14.5 14.3 14.1 13.9 13.6 13.5 12.3 12.1 11.1 10.9 10.6 10.1 10.1 10.1
14-12-2021