JDK-8133051 : Concurrent refinement threads may be activated and deactivated at random
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8,9
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2015-08-05
  • Updated: 2017-07-26
  • Resolved: 2016-04-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b116Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
While running the attached test on a very large machine with -XX:+TraceConcRefinement refinement thread activation and deactivation seems to get messed up, leading to

- many threads being deactivated although there is still lots of work to do (search for "deactivated worker 51, off threshold 1, current 98")
- many threads activated although there is not much work to do (search for "activated worker 20, on threshold 1, current 2")
- also the number of threads activated is frequently larger than the number of refinement buffers (see attached log, search for "deactivated worker 20, on threshold 1, current 5" where worker 20 is activated when there are only 5 buffers)

This may lead to bad performance (too many threads active, too few active).

The former issue burns CPU time degrading overall performance, the latter impacts regularity of pauses because running too few refinement threads causes refinement buffers to pile up, causing too long pauses.
Comments
Regression test exists.
26-07-2017

I've filed JDK-8151670 to cover the thrashing of unusual deactivations followed by reactivations. A separate problem is that many useless activations can occur because the thresholds are so low that more threads may be activated than there is work to be done, which I'll continue to use this bug to cover.
11-03-2016

On further consideration, I don't think the "bounce after abort request" scenario I described yesterday is actually possible. An abort request (actually, a yield request) can cause early deactivation. A later enqueue can then activate thread 0 (the primary) thread. But it would then block on the SuspendibleThreadSetJoiner (as discussed in JDK-8151594) and would not reach the point of activating any other threads, nor would it reach the deactivation point and log that. So there must be something else going on to account for these unusual deactivate then reactivate sequences.
10-03-2016

There's definitely an issue with thread 0; I mentioned this in the review for JDK-8150134: "One side effect of the logging change: it is now apparent that the primary refinement thread gets activated during a pause, though fortunately immediately blocks on the SuspendibleThreadSetJoiner." See JDK-8151594.
10-03-2016

It may just be a problem with ordering of the log messages.
10-03-2016

There other oddities in the log (which may be yet another issue): - at the gc around 1322.178, refinement thread 0 seems to be running through the gc pause - it is deactivated, and later activated, but never deactivated again. - at the gc around 1323.239, there is a lone activation message after the first GC pause messages, when everything should already be synchronized. But maybe thread 0 is somewhat special.
10-03-2016

Attached log file mentioned in the recent comment.
10-03-2016

I see a couple of similar sequences in a log I produced locally. All of the cases of this I've found have been a sequence of (1) unusual deactivations (2) activations (3) unusual deactivations (4) gc pause It's as if there was an activation "bounce" after an abort request. Given that the abort logic is quite separate from the activation logic, that seems possible. It could be that we have something like (1) request abort for pending GC (2) deactivate for abort (3) some still running Java thread enqueues a buffer, triggering activations (4) deactivate for abort again (5) gc pause Thomas - does your's look like that? If so, I think it should be treated as a separate issue.
10-03-2016

Thomas - That log snippet was generated with a version including the recent change for JDK-8150134? I earlier thought there was a mixup in the use of the stop_at parameter for DCQS::apply_closure_to_completed_buffer, but I was confused. I don't presently have an explanation for that log snippet.
09-03-2016

Yes, that build contains JDK-8150134.
09-03-2016

That might be the case that the logging is wrong, but see this log (I can attach the file tomorrow) from specjbb2015 taken just now: [1345.013s][info ][gc,metaspace] GC(583) Metaspace: 24093K->24093K(1071104K) [1345.013s][info ][gc ] GC(583) Pause Young (G1 Evacuation Pause) 7421M->2483M(10240M) (1344.944s, 1345.013s) 69. [1345.013s][info ][gc,cpu ] GC(583) User=1.79s Sys=0.00s Real=0.07s [1345.014s][debug][gc,refine ] Deactivated 0, off threshold: 441, current: 181 [1348.689s][debug][gc,refine ] Activated 0, on threshold: 471, current: 662 [1348.689s][debug][gc,refine ] Activated 1, on threshold: 501, current: 667 [1348.689s][debug][gc,refine ] Activated 2, on threshold: 531, current: 667 [1348.689s][debug][gc,refine ] Activated 3, on threshold: 561, current: 668 [1348.689s][debug][gc,refine ] Activated 4, on threshold: 591, current: 667 [1348.689s][debug][gc,refine ] Activated 5, on threshold: 621, current: 666 [1348.689s][debug][gc,refine ] Deactivated 1, off threshold: 471, current: 669 [1348.689s][debug][gc,refine ] Deactivated 3, off threshold: 531, current: 671 [1348.689s][debug][gc,refine ] Deactivated 2, off threshold: 501, current: 670 [1348.689s][debug][gc,refine ] Deactivated 4, off threshold: 561, current: 672 [1348.689s][debug][gc,refine ] Deactivated 0, off threshold: 441, current: 673 [1348.696s][debug][gc,refine ] Activated 0, on threshold: 471, current: 673 [1348.696s][debug][gc,refine ] Activated 6, on threshold: 651, current: 673 [1348.696s][debug][gc,refine ] Activated 1, on threshold: 501, current: 672 [1348.696s][debug][gc,refine ] Activated 2, on threshold: 531, current: 672 [1348.696s][debug][gc,refine ] Activated 3, on threshold: 561, current: 672 [1348.697s][debug][gc,refine ] Activated 4, on threshold: 591, current: 671 [1348.698s][debug][gc,refine ] Activated 7, on threshold: 681, current: 685 Particularly look at the large difference in the "Deactivated" messages, where threshold is 470 and current is 670. Could the messages be extended to print both the current value of the count and the count that caused the (de-)activation?
09-03-2016

In the attached log, I think the places where workers are being deactivated - even though there are a small number of additional buffers available (such as off threshold = 1, current = 2) - are artifacts of the test for deactivation being performed before the logging report, using a different fetch of the number of enqueued buffers. Some other thread could enqueue a buffer in between. These are ok.
09-03-2016

In the attached log, I think the places where workers are being deactivated - even though there are many more buffers available than the off threshold - are due to abort requests for a pending pause. These are ok.
09-03-2016

Another issue is that even when trying to disable concurrent refinement (by specifying -XX:-G1UseAdaptiveConcRefinement and setting the thresholds to high values) there is still refinement going on, wasting cpu resources for no particular gain.
07-08-2015

This is an issue since initial import.
05-08-2015