JDK-8336707 : Contention of ForkJoinPool grows when stealing works
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.concurrent
  • Affected Version: 23,24
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2024-07-18
  • Updated: 2024-11-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
Since JDK-8322732, there is a growth of calls to Unsafe.loadFence in ForkJoinPool.runWorker. For architectures that use weak memory modules, load fences mean memory barriers that require core-to-core traffics. Thus, there are regressions in some testcases.

The attached files are flame graphs of a MultiJVM mode of SPECjbb2015. Compared with flamegraph-before-8322732.svg, the cpu cycles of runWorker itself increase obviously in flamegraph-since-8322732.svg.

There is a simple workaround for this example as the attached patch, which changes two continue statements to break statements when the current thread fails to steal works, so there is less loops and less calls to Unsafe.loadFence. With the patch, the cpu cycles largely reduce as shown in flamegraph-patched-8322732.svg. A thorough patch may require more works.
Comments
Thanks [~lliu]. The changes led to more shutdown contention, that I'll also address, but should not negatively impact anything else.
01-11-2024

Hi [~dl], thanks for your great work. I think the PR at 2efcb44 solves the regression I captured.
01-11-2024

Hi [~lliu]. The updates should now be stable. Could you retest performance?
29-10-2024

Thanks [~lliu]. The version you tested still needs work to avoid potential stalls for ExecutorService tasks. I'll commit and test more before asking you to recheck.
23-10-2024

Hi [~dl], your PR at 0e3e9de recovers the performance in the SPECjbb2015 workload mentioned in the description, while there are multiple workloads I care about, and some of those have regressions with your PR. But I found that if we take the rescan back after w.topLevelExec based on your PR, all known regressions can disappear. I'm going to post the patch as 0001-Take-the-reduction-of-interference-back.patch for you to review, as I'm not sure whether the condition of the rescan can be improved, for example, judging some local variables before accessing q.base.
23-10-2024

[~lliu] could you retest latest commit to see if changes are heading in the right direction? Thanks!
18-10-2024

Thanks for SpecJBB results. This clearly needs more work. The underlying issue is that to conform to ExecutorService interface (especially under Loom), there are cases in which scanning must be exhaustive rather than adapting under contention (as they previously did). There should be ways of doing so that do not impact usages that do not need it, but it is taking a while to come up with a scheme that also meets all of the other constraints and expectations on FJP.
17-10-2024

Hi [~dl], sorry for replying late. I collected the percentages of cpu cycles of ForkJoinPool on the workload of SPECjbb2015. The involved method is mainly runWorkers with a small amount of deactivate/awaitWork. There are four versions of ForkJoinPool and their percentages: - Pre-8322732: ~2.26% - 8322732: ~8.86% - 8322732 + 0001-Minor-changes-when-t-is-null-v2.patch: ~3.9% - 8322732 + PR-21507 at a1ad02f: ~8.28% So I would expect the cost on ForkJoinPool itself be close to ~4% or ideally ~2%.
17-10-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/21507 Date: 2024-10-14 20:58:43 +0000
14-10-2024

I committed what seems to be the best compromise wrt over/under scanning/signalling. If other tests by Eric and/or others confirm, I'll file the PR. https://github.com/DougLea/jdk/tree/JDK-8336707
10-10-2024

Hi [~dl] I will do my best to test this in our in-house perf suite in the next couple days.
24-09-2024

I created a trial version (without internal doc changes yet) in https://github.com/DougLea/jdk/tree/JDK-8336707 Any help testing this would be very welcome.
23-09-2024

We can reproduce some regressions with JDK-8322732 with some Renaissance tests. We use the JMH version https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.15.0/renaissance-jmh-0.15.0.jar Especially with FjKmeans and FutureGenetic, we see regressions of 0-13% depending on the platform - Mac minis both x64 and ARM, and both Intel and AMD VMs and bare metal OCI servers of various core count. If more come up I will update this JBS.
27-08-2024

The main cause is possible O(#workers-squared) flailing during ramp-down/up patterns, that was put into place to ensure that small-#workers pools would stay live and responsive. Your patch disables this. But I've been experimenting with compromises, and will have a trial version out soon.
20-08-2024

Hi [~dl], what's your opinion on the patch I posted above? I found it helps SPECjbb2015 and "reactors" in Renaissance on AArch64 platforms, but there are negative effects on some microbenchmarks. One is ForkJoinPoolThresholdAutoSurplus with threshold=2. The contention of "ctl" grows in signalWork called from ForkJoinPool.push. Do you have any suggestions about the case? Thanks.
20-08-2024

Thanks for the report. Spin/scan performance tradeoffs vary across platforms. It would be great if others can help performance check updates when I start a PR (hopefully soon).
30-07-2024