Bug ID: JDK-8310031 Parallel: Implement better work distribution for large object arrays in old gen

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 17,21

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2023-06-14
Updated: 2025-02-27
Resolved: 2023-10-24

JDK 17	JDK 21	JDK 22
17.0.12Fixed	21.0.3Fixed	22 b21Fixed

Currently Parallel GC young gc distributes work to find dirty cards  on stripe basis (65k areas of memory): if an object starts in a stripe assigned to a thread, that thread owns that object to find dirty cards exclusively.

This is a problem with large objArrays that limits parallelism: a single worker thread will own that objArray, limiting throughput.

That should also fix the difference between parallel gc and g1 gc in pause times (4-5x+) for DelayInducer (JDK-8062128) found in JDK-8309960.

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk17u-dev/pull/2230 Date: 2024-02-24 10:00:00 +0000
27-02-2025
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/14846 Date: 2023-07-12 08:05:59 +0000
24-01-2025
[21u, 17u] [~rrich]Do we need release notes for this?
06-03-2024
Fix request (17u) I would like to backport this as a performance bug fix. We received bug reports from users which have some young pauses of 30s, and even up to 50s (normally <1s) running large Gerrit instances (200GB heap, 100 gc threads). We have tried to tune ParallelGC. Reducing the number of gc threads helps to make the pause time spikes smaller but this makes average pause times longer. Requires backports of: https://bugs.openjdk.org/browse/JDK-8280030 (https://github.com/openjdk/jdk17u-dev/pull/2226) https://bugs.openjdk.org/browse/JDK-8278893 (https://github.com/openjdk/jdk17u-dev/pull/2227) https://bugs.openjdk.org/browse/JDK-8282094 (https://github.com/openjdk/jdk17u-dev/pull/2228) All hunks except the following 2 applied after a trivial preparation change. The 1st hunk of psCardTable.hpp did not apply because of different context. Resolved by inserting the new lines. The 2nd hunk of psScavenge.cpp did not apply because of different context. Resolved by inserting the new lines. Finally a few trivial changes are required (renaming and the like). Risk is medium. We've done the downstream backport already many weeks ago. I've tested on x86_64: jdk:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" langtools:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" Local CI Testing: The fix passed our CI testing (e.g. 2024-02-25): JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests (also with ParallelGC). Testing was done with fastdebug builds on the main platforms and also on Linux/PPC64le.
25-02-2024
@zgu yes I'm planning a 17u backport.
07-02-2024
@rrich Any plan to backport to 17u?
07-02-2024
Fix request (21u) EDIT (2024-01-16): as adviced I've cancelled the 21u pr and created a new pr targeting 21u-dev. I would like to backport this as a performance bug fix. We received bug reports from users which have some young pauses of 30s, and even up to 50s (normally <1s) running large Gerrit instances (200GB heap, 100 gc threads). We have tried to tune ParallelGC. Reducing the number of gc threads helps to make the pause time spikes smaller but this makes average pause times longer. The backport applies cleanly. The risk is low because of this and the thourough testing and reviewing in head. Manual testing on x86_64: make test TEST=langtools:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" make test TEST=jdk:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" make test TEST=hotspot:tier1 The fix passed our CI testing: JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests (also with ParallelGC). All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le.
16-01-2024
A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/160 Date: 2024-01-12 09:19:27 +0000
16-01-2024
This needs a jdk21u-dev PR before being applicable for approval. Please re-apply for approval once the PR is there.
14-12-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u/pull/328 Date: 2023-11-06 16:49:57 +0000
08-11-2023
Changeset: 4bfe2268 Author: Richard Reingruber <rrich@openjdk.org> Date: 2023-10-24 07:05:56 +0000 URL: https://git.openjdk.org/jdk/commit/4bfe226870a15306b1e015c38fe3835f26b41fe6
24-10-2023
There is (at least) another issue in task distribution that might cause this on object arrays in young gen, JDK-8311163. There may be other issues that have been improved in G1 but not in Parallel GC. Reapplying the backed out JDK-8309960 may also help.
03-07-2023
The inverse scaling might be caused by the overhead of task stealing or by the card marks done by different threads for adjacent array elements.
26-06-2023
Hi Thomas, if you don't mind I would like to take this item. I've already got a working implementation: https://github.com/openjdk/jdk/compare/master...reinrich:jdk:ps_parallel_scanning_of_large_arrays_in_old which needs some more stress testing and I want to do the renaissance benchmark to prove that it does not introduce a performance regression. We've encountered extremely long scavenge pauses in a very large Gerrit/Git instance (256 cores, 256 GB RAM I think). The pauses were up to 50s with 100 gc threads. Shorter with less threads but then the short pauses got longer. I'll attach a micro benchmark that reproduces the issue: 10x - 50x longer scavenge pauses (depending on the system) when running with 10 gc threads instead of just one thread. I.e. scavenge performance scales inversely with the number of threads. With the fix mentioned above performance increases when adding threads (about 4x). Testing so far: GHA (jdk11, jdk17, jdk22), hotspot/gc, jdk:tier1, langtools:tier1. The latter 2 with -XX:+UseParallelGC.
26-06-2023

Relates :	JDK-8350130 - Performance Degradation with Default ParallelGC in Hotspot JDK on Ubuntu
Relates :	JDK-8311163 - Parallel: Improve large object handling during evacuation
Relates :	JDK-8309960 - ParallelGC young collections very slow in DelayInducer
Relates :	JDK-8320252 - Regression > 3% in SPECjvm2008-Serial-ParGC on Mac aarch64
Relates :	JDK-8320165 - Parallel: Full GC code is very slow due to quadratic calc_new_address