Bug ID: JDK-8162929 Enqueuing dirty cards into a single DCQS during GC does not scale

JDK-8162929 : Enqueuing dirty cards into a single DCQS during GC does not scale

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 9

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2016-08-02
Updated: 2020-12-12
Resolved: 2019-07-19

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 14
14 b07Fixed

Related Reports

Blocks :	JDK-8227719 - G1 Pending cards estimation too conservative in cost prediction
Relates :	JDK-8209974 - Eliminate shared PtrQueues
Relates :	JDK-8212826 - Make PtrQueue free list lock-free
Relates :	JDK-8230327 - Make G1DirtyCardQueueSet free-id init unconditional
Relates :	JDK-8173211 - G1: Enqueuing dirty cards during reference enqueuing in phase 3 does not scale
Relates :	JDK-8237143 - Eliminate DirtyCardQ_cbl_mon
Relates :	JDK-8258142 - Simplify G1RedirtyCardsQueue
Relates :	JDK-8076584 - Parallelism used for redirty logged cards needs better control.
Relates :	JDK-8230332 - G1DirtyCardQueueSet _notify_when_complete is always true

Description

While looking at some more demanding large microbenchmarks (e.g. BigRamtester, 20g heap, 1M regions) enqueueing dirty cards during GC in G1ParScanThreadState::update_rs incurs a significant amount of wait (idle) time.

The reason is that enqueuing completed buffers takes a global lock (basically ending up in PtrQueue::handle_zero_index() and PtrQueueSet::enqueue_complete_buffer();  there is also some strange locking/unlocking going on in PtrQueue::locking_enqueue_completed_buffer()).

That does not scale beyond a few threads.

The problem is harder than it seems because after providing a per-thread DCQS, performance does not improve a lot. The stalling is moved to the malloc() calls done when allocating new DCQ buffers.

Comments

URL: https://hg.openjdk.java.net/jdk/jdk/rev/8ae33203d600 User: kbarrett Date: 2019-07-19 21:54:42 +0000
19-07-2019
The strange locking/unlocking in locking_enqueue_completed_buffer was eliminated by JDK-8182703, and that (now vestigial) function was eliminated by JDK-8214144.
25-01-2019
JDK-8212826 made the allocation of new buffers lock-free when there are buffers in the free list. And if the free-list is empty so that malloc is called, that benefits from thread-local optimizations often made there. The completed buffer lock remains a problem.
25-01-2019
Providing per-thread DCQS with a new free list for every thread has the additional problems, that it causes unbounded assignment of these buffers to the "main" list, making it grow and grow. I.e. the main lists' free DCQ buffers are not often reused.
18-10-2016