JDK-8310031 : Parallel: Implement better work distribution for large object arrays in old gen
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 17,21
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-06-14
  • Updated: 2024-03-11
  • Resolved: 2023-10-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21 JDK 22
17.0.12Fixed 21.0.3Fixed 22 b21Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8322645 :  
Description
Currently Parallel GC young gc distributes work to find dirty cards  on stripe basis (65k areas of memory): if an object starts in a stripe assigned to a thread, that thread owns that object to find dirty cards exclusively.

This is a problem with large objArrays that limits parallelism: a single worker thread will own that objArray, limiting throughput.

That should also fix the difference between parallel gc and g1 gc in pause times (4-5x+) for DelayInducer (JDK-8062128) found in JDK-8309960.
Comments
[21u, 17u] [~rrich]Do we need release notes for this?
06-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/2230 Date: 2024-02-24 10:00:00 +0000
25-02-2024

Fix request (17u) I would like to backport this as a performance bug fix. We received bug reports from users which have some young pauses of 30s, and even up to 50s (normally <1s) running large Gerrit instances (200GB heap, 100 gc threads). We have tried to tune ParallelGC. Reducing the number of gc threads helps to make the pause time spikes smaller but this makes average pause times longer. Requires backports of: https://bugs.openjdk.org/browse/JDK-8280030 (https://github.com/openjdk/jdk17u-dev/pull/2226) https://bugs.openjdk.org/browse/JDK-8278893 (https://github.com/openjdk/jdk17u-dev/pull/2227) https://bugs.openjdk.org/browse/JDK-8282094 (https://github.com/openjdk/jdk17u-dev/pull/2228) All hunks except the following 2 applied after a trivial preparation change. The 1st hunk of psCardTable.hpp did not apply because of different context. Resolved by inserting the new lines. The 2nd hunk of psScavenge.cpp did not apply because of different context. Resolved by inserting the new lines. Finally a few trivial changes are required (renaming and the like). Risk is medium. We've done the downstream backport already many weeks ago. I've tested on x86_64: jdk:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" langtools:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" Local CI Testing: The fix passed our CI testing (e.g. 2024-02-25): JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests (also with ParallelGC). Testing was done with fastdebug builds on the main platforms and also on Linux/PPC64le.
25-02-2024

@zgu yes I'm planning a 17u backport.
07-02-2024

@rrich Any plan to backport to 17u?
07-02-2024

Fix request (21u) EDIT (2024-01-16): as adviced I've cancelled the 21u pr and created a new pr targeting 21u-dev. I would like to backport this as a performance bug fix. We received bug reports from users which have some young pauses of 30s, and even up to 50s (normally <1s) running large Gerrit instances (200GB heap, 100 gc threads). We have tried to tune ParallelGC. Reducing the number of gc threads helps to make the pause time spikes smaller but this makes average pause times longer. The backport applies cleanly. The risk is low because of this and the thourough testing and reviewing in head. Manual testing on x86_64: make test TEST=langtools:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" make test TEST=jdk:tier1 TEST_VM_OPTS="-XX:+UseParallelGC" make test TEST=hotspot:tier1 The fix passed our CI testing: JTReg tests: tier1-4 of hotspot and jdk. All of Langtools and jaxp. JCK, SPECjvm2008, SPECjbb2015, Renaissance Suite, and SAP specific tests (also with ParallelGC). All testing was done with fastdebug and release builds on the main platforms and also on Linux/PPC64le.
16-01-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/160 Date: 2024-01-12 09:19:27 +0000
16-01-2024

This needs a jdk21u-dev PR before being applicable for approval. Please re-apply for approval once the PR is there.
14-12-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u/pull/328 Date: 2023-11-06 16:49:57 +0000
08-11-2023

Changeset: 4bfe2268 Author: Richard Reingruber <rrich@openjdk.org> Date: 2023-10-24 07:05:56 +0000 URL: https://git.openjdk.org/jdk/commit/4bfe226870a15306b1e015c38fe3835f26b41fe6
24-10-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/14846 Date: 2023-07-12 08:05:59 +0000
12-07-2023

There is (at least) another issue in task distribution that might cause this on object arrays in young gen, JDK-8311163. There may be other issues that have been improved in G1 but not in Parallel GC. Reapplying the backed out JDK-8309960 may also help.
03-07-2023

The inverse scaling might be caused by the overhead of task stealing or by the card marks done by different threads for adjacent array elements.
26-06-2023

Hi Thomas, if you don't mind I would like to take this item. I've already got a working implementation: https://github.com/openjdk/jdk/compare/master...reinrich:jdk:ps_parallel_scanning_of_large_arrays_in_old which needs some more stress testing and I want to do the renaissance benchmark to prove that it does not introduce a performance regression. We've encountered extremely long scavenge pauses in a very large Gerrit/Git instance (256 cores, 256 GB RAM I think). The pauses were up to 50s with 100 gc threads. Shorter with less threads but then the short pauses got longer. I'll attach a micro benchmark that reproduces the issue: 10x - 50x longer scavenge pauses (depending on the system) when running with 10 gc threads instead of just one thread. I.e. scavenge performance scales inversely with the number of threads. With the fix mentioned above performance increases when adding threads (about 4x). Testing so far: GHA (jdk11, jdk17, jdk22), hotspot/gc, jdk:tier1, langtools:tier1. The latter 2 with -XX:+UseParallelGC.
26-06-2023