JDK-8288966 : Better handle very spiky promotion in G1
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 17,19
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2022-06-22
  • Updated: 2024-05-15
  • Resolved: 2022-08-22
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20 b12Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Description
There are sometimes huge spikes in promotion, up to a few GBs; the current PLAB sizing algorithm is not meant to work well with such spiky behavior.

This leads to lots of direct allocations (some tests show ~10%) which is seems a lot of direct allocations in total (i.e. 10% of a few GBs of direct allocations is still a lot, while it does not really matter for e.g. 200MB) (random number).

Maybe decreasing the amount of direct allocations improves object copy performance in such cases; one option is to resize PLABs in such cases, or adjust the allowed direct allocations/waste.
Comments
We saw this issue in one of our production applications. A series of young collections with 0 promotion to old, each time reducing the old PLAB size, then a mixed collection with 2GB of promotion to old has to suffer with a tiny old PLAB size. It showed huge contention on the freelist lock, I assume because tiny PLAB size means all workers are thundering to allocate their next PLAB (or direct allocation) so they all contend when the region fills up. I have tested a simple reproducer which shows the same issue against tip and it is mitigated, so I believe your change fixes it, but I haven't yet tested it on the real application which is running 17. If you're interested in the logs from the production issue, or the reproducer, let me know. Thanks!
15-05-2024

Changeset: 7b5f9edb Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2022-08-22 09:07:34 +0000 URL: https://git.openjdk.org/jdk/commit/7b5f9edb59ef763acca80724ca37f3624d720d06
22-08-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9726 Date: 2022-08-03 12:46:45 +0000
03-08-2022

The attached graph shows pause time spikes in SPECjbb2015 during hbIR finding (at the beginning) or report generation (at the end) only on (our) aarch64 machines before/after suggested changes. The reason for this problem seems to be very high contention on the atomic that is used to do direct/PLAB allocation into regions with an untrained or mistrained old gen PLAB prediction (SPECjbb2015 seems to promote a few GB now and then, followed by promoting nothing for a few GCs, doing like 500k+ PLAB and direct allocations at that point). Adding some "PLAB boosting" fixes the issue, although care must be taken to not be too aggressive (like current Shenandoah code which seems to just try to double the PLAB every refill) as this can cause a lot of wasted space and hence increased GC activity.
03-08-2022