Bug ID: JDK-8133051 Concurrent refinement threads may be activated and deactivated at random

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 8,9

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2015-08-05
Updated: 2017-07-26
Resolved: 2016-04-18

JDK 9
9 b116Fixed

While running the attached test on a very large machine with -XX:+TraceConcRefinement refinement thread activation and deactivation seems to get messed up, leading to

- many threads being deactivated although there is still lots of work to do (search for "deactivated worker 51, off threshold 1, current 98")
- many threads activated although there is not much work to do (search for "activated worker 20, on threshold 1, current 2")
- also the number of threads activated is frequently larger than the number of refinement buffers (see attached log, search for "deactivated worker 20, on threshold 1, current 5" where worker 20 is activated when there are only 5 buffers)

This may lead to bad performance (too many threads active, too few active).

The former issue burns CPU time degrading overall performance, the latter impacts regularity of pauses because running too few refinement threads causes refinement buffers to pile up, causing too long pauses.

Regression test exists.
26-07-2017
I've filed JDK-8151670 to cover the thrashing of unusual deactivations followed by reactivations. A separate problem is that many useless activations can occur because the thresholds are so low that more threads may be activated than there is work to be done, which I'll continue to use this bug to cover.
11-03-2016
On further consideration, I don't think the "bounce after abort request" scenario I described yesterday is actually possible. An abort request (actually, a yield request) can cause early deactivation. A later enqueue can then activate thread 0 (the primary) thread. But it would then block on the SuspendibleThreadSetJoiner (as discussed in JDK-8151594) and would not reach the point of activating any other threads, nor would it reach the deactivation point and log that. So there must be something else going on to account for these unusual deactivate then reactivate sequences.
10-03-2016
There's definitely an issue with thread 0; I mentioned this in the review for JDK-8150134: "One side effect of the logging change: it is now apparent that the primary refinement thread gets activated during a pause, though fortunately immediately blocks on the SuspendibleThreadSetJoiner." See JDK-8151594.
10-03-2016
It may just be a problem with ordering of the log messages.
10-03-2016
There other oddities in the log (which may be yet another issue): - at the gc around 1322.178, refinement thread 0 seems to be running through the gc pause - it is deactivated, and later activated, but never deactivated again. - at the gc around 1323.239, there is a lone activation message after the first GC pause messages, when everything should already be synchronized. But maybe thread 0 is somewhat special.
10-03-2016
Attached log file mentioned in the recent comment.
10-03-2016
I see a couple of similar sequences in a log I produced locally. All of the cases of this I've found have been a sequence of (1) unusual deactivations (2) activations (3) unusual deactivations (4) gc pause It's as if there was an activation "bounce" after an abort request. Given that the abort logic is quite separate from the activation logic, that seems possible. It could be that we have something like (1) request abort for pending GC (2) deactivate for abort (3) some still running Java thread enqueues a buffer, triggering activations (4) deactivate for abort again (5) gc pause Thomas - does your's look like that? If so, I think it should be treated as a separate issue.
10-03-2016
Thomas - That log snippet was generated with a version including the recent change for JDK-8150134? I earlier thought there was a mixup in the use of the stop_at parameter for DCQS::apply_closure_to_completed_buffer, but I was confused. I don't presently have an explanation for that log snippet.
09-03-2016
Yes, that build contains JDK-8150134.
09-03-2016
That might be the case that the logging is wrong, but see this log (I can attach the file tomorrow) from specjbb2015 taken just now: [1345.013s][info ][gc,metaspace] GC(583) Metaspace: 24093K->24093K(1071104K) [1345.013s][info ][gc ] GC(583) Pause Young (G1 Evacuation Pause) 7421M->2483M(10240M) (1344.944s, 1345.013s) 69. [1345.013s][info ][gc,cpu ] GC(583) User=1.79s Sys=0.00s Real=0.07s [1345.014s][debug][gc,refine ] Deactivated 0, off threshold: 441, current: 181 [1348.689s][debug][gc,refine ] Activated 0, on threshold: 471, current: 662 [1348.689s][debug][gc,refine ] Activated 1, on threshold: 501, current: 667 [1348.689s][debug][gc,refine ] Activated 2, on threshold: 531, current: 667 [1348.689s][debug][gc,refine ] Activated 3, on threshold: 561, current: 668 [1348.689s][debug][gc,refine ] Activated 4, on threshold: 591, current: 667 [1348.689s][debug][gc,refine ] Activated 5, on threshold: 621, current: 666 [1348.689s][debug][gc,refine ] Deactivated 1, off threshold: 471, current: 669 [1348.689s][debug][gc,refine ] Deactivated 3, off threshold: 531, current: 671 [1348.689s][debug][gc,refine ] Deactivated 2, off threshold: 501, current: 670 [1348.689s][debug][gc,refine ] Deactivated 4, off threshold: 561, current: 672 [1348.689s][debug][gc,refine ] Deactivated 0, off threshold: 441, current: 673 [1348.696s][debug][gc,refine ] Activated 0, on threshold: 471, current: 673 [1348.696s][debug][gc,refine ] Activated 6, on threshold: 651, current: 673 [1348.696s][debug][gc,refine ] Activated 1, on threshold: 501, current: 672 [1348.696s][debug][gc,refine ] Activated 2, on threshold: 531, current: 672 [1348.696s][debug][gc,refine ] Activated 3, on threshold: 561, current: 672 [1348.697s][debug][gc,refine ] Activated 4, on threshold: 591, current: 671 [1348.698s][debug][gc,refine ] Activated 7, on threshold: 681, current: 685 Particularly look at the large difference in the "Deactivated" messages, where threshold is 470 and current is 670. Could the messages be extended to print both the current value of the count and the count that caused the (de-)activation?
09-03-2016
In the attached log, I think the places where workers are being deactivated - even though there are a small number of additional buffers available (such as off threshold = 1, current = 2) - are artifacts of the test for deactivation being performed before the logging report, using a different fetch of the number of enqueued buffers. Some other thread could enqueue a buffer in between. These are ok.
09-03-2016
In the attached log, I think the places where workers are being deactivated - even though there are many more buffers available than the off threshold - are due to abort requests for a pending pause. These are ok.
09-03-2016
Another issue is that even when trying to disable concurrent refinement (by specifying -XX:-G1UseAdaptiveConcRefinement and setting the thresholds to high values) there is still refinement going on, wasting cpu resources for no particular gain.
07-08-2015
This is an issue since initial import.
05-08-2015

Relates :	JDK-8151670 - Unexpected concurrent refinement deactivation and reactivation
Relates :	JDK-8151564 - JDK-8150134 broke concurrent refinement thread deactivation
Relates :	JDK-8137022 - Concurrent refinement thread adjustment and (de-)activation suboptimal
Relates :	JDK-8151698 - "assert(_process_completed_threshold >= 0) failed: _process_completed is negative" with big G1ConcRefinementGreenZone values on 64 bit JVM
Relates :	JDK-8151594 - Concurrent refinement thread 0 activated during GC pause
Relates :	JDK-8176211 - Setting -XX:G1ConcRefinementThreads=0 gives different behavior to using default value of G1ConcRefinementThreads=0
Relates :	JDK-8133055 - Investigate G1 performance on SPL4