JDK-8162929 : Enqueuing dirty cards into a single DCQS during GC does not scale
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2016-08-02
  • Updated: 2020-12-12
  • Resolved: 2019-07-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 14
14 b07Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
While looking at some more demanding large microbenchmarks (e.g. BigRamtester, 20g heap, 1M regions) enqueueing dirty cards during GC in G1ParScanThreadState::update_rs incurs a significant amount of wait (idle) time.

The reason is that enqueuing completed buffers takes a global lock (basically ending up in PtrQueue::handle_zero_index() and PtrQueueSet::enqueue_complete_buffer();  there is also some strange locking/unlocking going on in PtrQueue::locking_enqueue_completed_buffer()).

That does not scale beyond a few threads.

The problem is harder than it seems because after providing a per-thread DCQS, performance does not improve a lot. The stalling is moved to the malloc() calls done when allocating new DCQ buffers.
Comments
URL: https://hg.openjdk.java.net/jdk/jdk/rev/8ae33203d600 User: kbarrett Date: 2019-07-19 21:54:42 +0000
19-07-2019

The strange locking/unlocking in locking_enqueue_completed_buffer was eliminated by JDK-8182703, and that (now vestigial) function was eliminated by JDK-8214144.
25-01-2019

JDK-8212826 made the allocation of new buffers lock-free when there are buffers in the free list. And if the free-list is empty so that malloc is called, that benefits from thread-local optimizations often made there. The completed buffer lock remains a problem.
25-01-2019

Providing per-thread DCQS with a new free list for every thread has the additional problems, that it causes unbounded assignment of these buffers to the "main" list, making it grow and grow. I.e. the main lists' free DCQ buffers are not often reused.
18-10-2016