JDK-8205353 : SATB compaction hides unmarked objects until final-mark
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 11
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2018-06-19
  • Updated: 2021-01-27
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Description
The SATB queue compaction feature can keep unmarked objects in the queue until the pause. This may results in very long "Finalize Mark" times.

Verified for SPECjbb2015 (1.22s long Finalize Marks with Shenandoah, but G1 uses the exact same code), but also suspected the cause for Kitchensink/ReferenceStress spurious 7s+ long Finalize Mark (with G1)
Comments
In Shenandoah, we discovered that some workloads are affected by SATB traffic being too low to make the forced enqueues work. I think it would affect any heuristics that decides on forcing the enqueue without knowing about the time. We changed it to time-based triggers that work better: http://mail.openjdk.java.net/pipermail/shenandoah-dev/2018-July/006630.html http://hg.openjdk.java.net/shenandoah/jdk/rev/480dbbcc9dae
05-07-2018

This is day-one behavior and not causing crashes, so changing to enhancement.
21-06-2018

fyi: ZGC does a handshake to force flushing out remaining buffers, and (time-)limited marking.
20-06-2018

Oh, I am not arguing that there are better solutions. The question for me right now is whether we should fix it now in a potentially hacky way (and which way) and do the right thing later (i.e. time-based, and/or just continue with another marking phase if Final Mark takes too long). I am not in any way arguing that decrementing G1SATBBufferEnqueueingThresholdPercent by one is "natural" btw. :) Or just ignore this problem for now (for G1 at least) since in the worst case you can also change G1SATBBufferEnqueueingThresholdPercent manually to make the issue disappear (I would think). As for the contention on the enqueuing, and the enqueuing itself, the queue code is known to have several deficiencies, and at least I know that there is a lot to gain with more clever enqueuing (both implementation wise as well as frequency wise) from old experiments of mine.
20-06-2018

Yeah, but decreasing G1SATBBufferEnqueueingThresholdPercent actually introduces implicit/explicit decrement step anyway. The fact that it may be hidden -- by assumption that decrement by one is "natural" (why would you think it is?) -- does not make the actual parameter go away. The good thing about "skips" is that we can crank it up to thousands if SATB traffic is high and most oops are dead, without compromising enqueueing performance. In retrospect, we really want to avoid contention on enqueueing (that is guarded by the mutex), which probably means we want to cap the absolute frequency of forced enqueues, this is why time-bound cap feels more intuitive.
20-06-2018

The suggestion with adapting G1SATBBufferEnqueueingThresholdPercent over time was an alternative to adding another imo as unintiutive option as the number of skips has the same issues as just decaying G1SATBBufferEnqueueingThresholdPercent: it is not directly bound to time either. As for the question how to decrease that threshold, I would just linearly decrease G1SATBBufferEnqueueingThresholdPercent by one every time - that would, given current defaults (G1SATBBufferEnqueueingThresholdPercent is 60, can be up to 100), have roughly the same impact as the default value of ShenandoahSATBBufferMaxEnqueueSkips (=50). All these quick-fixes are just that compared to continuing with another concurrent phase.
20-06-2018

I do agree that connecting the flushes to some notion of time is more intuitive. Counting "skips" does feel incomplete, and the optimal default for that option really depends on the SATB traffic, SATB buffer size, etc. I think we actually want to have SATB buffers to be enqueued at least every N ms. In Shenandoah review thread [1], I mentioned the first prototype with a timestamp, and it worked, but I was afraid about adding timestamps on the hotpath, because some systems have scalability problems with timers. We can do logical time, e.g. some epoch counter that SATB compaction path checks against, and periodic task could increment/decrement that counter. It would still introduce the option though, but now nominated in more intuitive "time", which we can guess easily. For example, it seems intuitive we'd want to flush at least each 1ms to make sure >10ms concurrent marks are able to see the hidden objects. Decreasing G1SATBBufferEnqueueingThresholdPercent feels like a more complicated version of what we have with "skips" counter, and it still has the question "by how much to decrease", which again does not avoid the option. Deciding based on oop types in the buffer seems to be awkward, because we can always hide large graphs behind innocuous objects. [1] http://mail.openjdk.java.net/pipermail/shenandoah-dev/2018-June/006410.html
20-06-2018

The ultimate fix would probably be, if final mark takes too long, go back into concurrent operation.
20-06-2018

Or, for a given buffer, we decrease the G1SATBBufferEnqueueingThresholdPercent every time it gets pushed back to the application. (Just wary of adding another option :))
20-06-2018

The patch suggested by [~shade] may still suffer from the same issue, although it won't keep objects unmarked for that long; an alternative could be, while compacting looking at whether we have (large?) j.l.O arrays in the queue. This would of course mean more overhead during filtering, but would guarantee that j.l.O arrays will be processed asap for obvious reasons. It will not help with single references that keep alive a large object graph. Another option, instead of taking the number of retries as a measure for "an old buffer", would be using an actual timestamp in the buffer; that one could even be regularly checked by some watcher thread that somehow forces emptying.
20-06-2018

This issue can be workarounded by changing (lowering) G1SATBBufferEnqueueingThresholdPercent.
20-06-2018

Reported by [~shade], proposed fix for Shenandoah at http://cr.openjdk.java.net/~shade/shenandoah/satb-prompt/webrev.02/
19-06-2018