JDK-8043575 : Dynamically parallelize reference processing work
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8u20,9
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2014-05-20
  • Updated: 2023-08-21
  • Resolved: 2018-06-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11
11 b19Fixed
Related Reports
Blocks :  
Blocks :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The reference processing phases can, depending on the number of soft references and the amount of retained objects take a significant amount of time.

To decrease the duration of the phase it is possible to use the ParallelRefProcEnabled switch - however that needs to be done manually.

The goal of this CR is to dynamically turn on parallel reference processing for the different gc phases.

One way to do this is to estimate the amount of work as fraction of the total gc pause time, and if this crosses a threshold, enable parallel reference processing automatically.
Comments
URL: http://hg.openjdk.java.net/jdk/jdk/rev/8f1d5d706bdd User: tschatzl Date: 2018-06-18 10:12:25 +0000
18-06-2018

The closed as duplicate JDK-7068229 contains a few more thoughts.
13-06-2018

Excellent! We have worked this around for zero-task case in Shenandoah: http://mail.openjdk.java.net/pipermail/shenandoah-dev/2017-June/002611.html. But more comprehensive fix for small number of references, like Jon suggested above, will benefit this even more. Our benchmarks do care about pause times on this scale, so I would happily test any suggested patch.
06-06-2017

Any progress with it? We hit it already, but the other way around: with parallel ref processing turned on by default, we sometimes lose valuable pause time when there is no work -- JDK-8181214 -- which seems to be the special case of what is proposed here.
06-06-2017

I have a prototype but I want to refine it. And yes, if there's not enough references to process (not only zero case), single thread shows better performance.
06-06-2017

I meant that I prefer not to address Kim's concern in this CR. But I didn't mention how to follow up his concern too. I filed JDK-8173211 for it.
23-01-2017

Considering that ParallelRefProcEnabled also parallelizes reference enqueuing, although it has "RefProc" in it, and other arguments like that reference enqueuing is an integral part of reference processing, I would consider automatically enabling parallelization of enqueuing as part of more general "automatically enabling parallelizing reference processing work" work. Improving the parallelization of reference enqueuing may be best done as part of a different CR though, as there are several options for it - instead of straightforward improvement of the existing enqueuing parallelization code (e.g. removing that DCQS enqueuing bottleneck, which may yield most benefits already), it may be useful to move that work into the reference processing phases. Btw, there is a similiar issue in the evacuation code already, see JDK-8162929.
23-01-2017

My understanding of this CR is that parallelizing the work for discovered references only (ReferenceProcessor::process_discovered_references). BTW, good point for enqueue_discovered_references.
20-01-2017

For G1 there is a problem with parallelizing enqueue_discovered_references. That enqueue applies the barrier set's write_ref_field to the discovered links being added to the pending list. If the reference is young, this isn't a problem because the write barrier doesn't need to do anything in that case. But if the reference is old, it's going to record a dirty card, and that's going to be in the shared global DirtyCardQueue, since the call is from a non-Java thread. So a locking DCQ.enqueue per reference, and severe lock contention when parallelized. One possiblity might be a special purpose variant of write_ref_field for use here, with most collectors just forwarding to their normal write_ref_field implementation, but G1 using per-worker DCQs.
19-01-2017

The lengths of the discovered lists is available and could be used to estimate the amount of work. The divisor of 1000 was picked out of the air. diff --git a/src/share/vm/gc/shared/referenceProcessor.cpp b/src/share/vm/gc/shared/referenceProcessor.cpp --- a/src/share/vm/gc/shared/referenceProcessor.cpp +++ b/src/share/vm/gc/shared/referenceProcessor.cpp @@ -845,13 +845,16 @@ // of the test. bool must_balance = _discovery_is_mt; + size_t total_list_count = total_count(refs_lists); + + uint number_of_workers = num_q(); + set_active_mt_degree(total_list_count / 1000 + 1); + if ((mt_processing && ParallelRefProcBalancingEnabled) || must_balance) { balance_queues(refs_lists); } - size_t total_list_count = total_count(refs_lists); - if (PrintReferenceGC && PrintGCDetails) { gclog_or_tty->print(", " SIZE_FORMAT " refs", total_list_count); } @@ -899,6 +902,7 @@ } } + set_active_mt_degree(number_of_workers); return total_list_count; }
11-06-2015