JDK-8278475 : G1 dirty card refinement by Java threads may get unnecessarily paused
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 18
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2021-12-09
  • Updated: 2022-02-07
  • Resolved: 2022-02-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b08Fixed
Related Reports
Relates :  
Relates :  
Description
When doing dirty card refinement, if an STS suspension request is active then processing a card buffer is aborted and the remainder of the buffer is pushed on the DCQS paused list. It's not just G1ConcurrentRefineThreads that check the STS state there, Java threads do too.

When coming out of a safepoint, Java threads are activated before the STS suspension request is cleared. So a Java thread could be activated, enqueue a buffer, decide it needed to do some refinement work so obtain a buffer from the queue, notice the STS suspension request, and pause the buffer until after the *next* safepoint. And of course that could keep happening until the STS suspension request finally gets cleared.

I haven't found anything that "fails" because of that oddity. But it could lead to there being paused buffers that are inaccessible to refinement from the end of the safepoint where that happens until the next safepoint. And paused buffer contents are counted in num_cards, to encourage refinement work to move them back to the normal queue. This could lead to spinning by refinement threads, with there being enough oddly paused cards to indicate there is work to do, but being unable to obtain any work from the empty normal queue and the inaccessible "next" paused list.  That seems pretty unlikely, but still.

A possible fix might be to change the STS suspension request flag to be based on the safepoint counter. That way we don't have two different request "flags". That would require some reordering in safepoint entry that would need careful examination.

Comments
Changeset: 1f6fcbe2 Author: Kim Barrett <kbarrett@openjdk.org> Date: 2022-02-01 15:44:26 +0000 URL: https://git.openjdk.java.net/jdk/commit/1f6fcbe2f3da4c63976b1564ec2170e4757fadcc
01-02-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7148 Date: 2022-01-19 22:31:19 +0000
19-01-2022

There is an easier way to avoid this problem than messing with the STS suspension request flag. When handling a full buffer we decide whether to do mutator refinement. In addition to the existing tests, don't do the mutator refinement if there is an active STS suspension request. Going in to a safepoint this will avoid wasting time transferring buffers from the queue to the next paused list. Coming out of a safepoint, this will prevent the problem discussed by this bug.
14-12-2021