Bug ID: JDK-8230706 Waiting on completion of strong nmethod processing causes long pause times with G1

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 11,14

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2019-09-06
Updated: 2020-07-15
Resolved: 2019-10-24

JDK 11	JDK 14
11.0.6-oracleFixed	14 b21Fixed

On our internal Cassandra benchmark setup once or twice per run pause times are exceptionally long.

This seems to be always caused by one thread in the Thread Roots scan phase, i.e. the log for that phase looks as follows:

[2019-09-05T13:22:07.788+0100][942.036s][1567686127788ms][trace][gc,phases      ] GC(271)       Thread Roots (ms):        Min:  0.0, Avg: 85.2, Max: 1954.3, Diff: 1954.3, Sum: 1959.1, Workers: 23
[2019-09-05T13:22:07.788+0100][942.036s][1567686127788ms][trace][gc,phases,task ] GC(271)                                  0.0  0.3  0.3  0.2  0.2  1.0  0.3 1954.3  0.2  0.2  0.2  0.3  0.2  0.3  1.1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

Seems to occur since JDK11 (not jdk8u231 at least), and only seen G1, not others like ZGC.

On that setup, THP is off, and also occurs if logs are created on a ramdisk. Sys time is zero.

Hi Thomas, is there any chance we can get the change you pushed to 11.0.6-oracle? That would be really helpful! Thanks, Goetz.
11-11-2019
[~shade]: sorry for the late reply: a different, much more simple change that fixes most of the issue with some minor? drawbacks has been pushed to 11.0.6-oracle. The reason has been that there have been some changes in the relevant code in nmethod.?pp that would cause too much backporting. The review process for the 11.0.6 fix, due to its simplicity, completed much faster.
01-11-2019
That's cool, can we see the changeset then, in the interest of backporting to open 11.0.6?
01-11-2019
URL: https://hg.openjdk.java.net/jdk/jdk/rev/18c246ad2ff9 User: tschatzl Date: 2019-10-24 09:08:48 +0000
24-10-2019
There is some additional test support patch that aids in testing this change, increasing the likelihood of strong and weak processing occurring at the same time at http://cr.openjdk.java.net/~tschatzl/8230706/webrev.2.testing/ .
24-10-2019
The backporting status for this issue is unclear: there is no fix for jdk/jdk, but there is 11.0.6-oracle backport of it (JDK-8231994)? Was some variant of the fix committed straight to private Oracle 11u tree?
14-10-2019
A better option is to remove the wait-barrier for strong/weak root processing by modifying the G1CodeBlobClosure; one option is to have two types of that: - one that applies the weak closure (not marking) if unclaimed - one that in addition to that, unconditionally (independent of nmethod claim) marks the embedded oops as live; it would be nice to know which nmethods have already been "marked through" to avoid duplicate marking. This should be much more scalable with lots of threads available than waiting on the barrier as the number of embedded oops per stack trace (and in total) is (from initial measurements) not very high (in the few hundreds). We could use one bit in the nmethod::_oops_do_mark_link I think :)
18-09-2019
This is not caused by taking the heapregionrememberedset locks to add to the current region's nmethod remembered set. The problem looks like the unbounded trimming during root scanning we allow the collector do in combination with root scanning during concurrent start requiring threads wait together at the end. I.e. the worker threads trim queues (start evacuation objects) when the local task queues reach the high threshold, and does so until the low threshold is reached. However it is not guaranteed that we quickly reach the low threshold (or at all before the entire heap has been evacuated). While this is no problem in regular young gcs, during concurrent mark start all threads need to wait together until they can continue. This wastes a lot of time because the other threads are just waiting for these thread(s) and not helping. The reason why I believe this is the case (and could be worked around by increasing the global threshold(s)) is that I am seeing instances of JDK-8201313 during regular young gcs, i.e. the sum time of the sub-phase timings take longer than the total (which removes the ObjCopy times). This also makes sense wrt to the jdk versions this behavior occurs: we added this task queue trimming in jdk11 in g1 (JDK-6672778) After verification, this issue can be worked around by setting -XX:GCDrainStackTargetSize (or by disabling class unloading too I think).
13-09-2019
Interestingly this issue only occurs with longer pause times; at shorter pause times the draining never seems to be that problematic, so it is certainly application and timing specific.
12-09-2019
When reducing the gc interval these hiccups go away or are much less pronounced apparently due to less per-gc opportunity to stall. These long thread roots scan occur more often than expected, however they are only very problematic in Concurrent Start pauses: only in this case we need to wait for all threads at the end of the root marking phase due to class unloading. One particular mitigation for Cassandra in particular, may be disabling class unloading.
11-09-2019
From code inspection, one potential suspect is taking a (per-heapregion) lock when adding the nmethod to the current region's nmethod remembered set (G1CodeBlobClosure::HeapRegionGatheringOopClosure::do_oop_work(), the call to HeapRegion::add_strong_code_root()). However the time this takes seems too long, and needs to be verified.
09-09-2019

Relates :	JDK-8201313 - Times of sub-phases of ext root scan may be larger than the sum of individual timings
Relates :	JDK-8249528 - Remove obsolete comment in G1RootProcessor::process_java_roots
Relates :	JDK-6672778 - G1 should trim task queues more aggressively during evacuation pauses