JDK-8215047 : Task terminators do not complete termination in consistent state
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 8-shenandoah,11-shenandoah,12,13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2018-12-08
  • Updated: 2019-06-18
  • Resolved: 2019-01-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 12 JDK 13
12.0.2Fixed 13 b06Fixed
Related Reports
Blocks :  
Blocks :  
Blocks :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Description
Assertion in ParallelTaskTerminator::reset_for_reuse() suggests that terminator should complete termination in consistent state. e.g. _offered_termination == _n_threads for "succeeded termination" or _offered_termination == 0 for "aborted termination"

Newly upstreamed OWST task terminator has a rare race window that results it to terminate in bad state.

When termination carries TerminatorTerminator, and its decision (TerminatorTerminator::should_exit_termination) depends on external condition, the race window could result some workers see "succeeded termination", while others see "aborted termination". The workers see "succeeded termination" do not decrease _offered_termination count, while the others, who see "aborted termination", do, that leaves 0 < _offered_termination < _n_threads.

The inconsistent state is due to spin master detecting termination condition outside lock.

This bug is not fatal in any means, since when TerminatorTerminator::should_exit_termination() returns true, it usually means to abort the termination, but it can trigger assertion failure in reset_for_reuse() method.

We did not encounter this bug in Shenandoah for so long, because we use Terminator as the way it was defined, a stack object, we don't reuse them.


Comments
Fix Request: I would like to backport this patch to JDK12u. This patch improves task termination latency by eliminating false positive result during termination check.
28-02-2019

Code review: http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2019-January/024404.html
09-01-2019

Removing side-effect from G1CMTask::should_exit_termination(), seems to fix G1 tests (G1 patch attached).
12-12-2018

I think I have clues why my debugging "guarantee" fails and why current implementation never hits retry on completed terminator (assert(_offered_termination < _n_threads) failed: Invariant). G1CMTask::should_exit_termination() has side-effect of setting task's aborted flag before returning true, and current task terminator always aborts termination if should_exit_termination() returns true, even if termination has been reached. Above behaviors conflict with the patch I proposed, that is, once terminator reaches termination state, it ignores should_exit_termination() to ensure that all threads, who participate termination protocol, see the same result.
12-12-2018

Fixes for ParallelTaskTerminator and OWSTTaskTerminator: http://cr.openjdk.java.net/~zgu/JDK-8215220/webrev.00/ But both terminators failed some of G1 tests due to problem mentioned above.
12-12-2018

I really don't think no offering termination in aborted paths, is a real issue. I believe the inconsistent state, is really due to race. For example, above loop can be re-executed while terminator in weird state. I tried following patch, newly added guarantee failed. diff --git a/src/hotspot/share/gc/g1/g1ConcurrentMark.cpp b/src/hotspot/share/gc/g1/g1ConcurrentMark.cpp --- a/src/hotspot/share/gc/g1/g1ConcurrentMark.cpp +++ b/src/hotspot/share/gc/g1/g1ConcurrentMark.cpp @@ -2792,6 +2792,7 @@ guarantee(_cm->mark_stack_empty(), "only way to reach here"); guarantee(_task_queue->size() == 0, "only way to reach here"); guarantee(!_cm->has_overflown(), "only way to reach here"); + guarantee(!has_aborted(), "Conflicted result"); } else { // Apparently there's more work to do. Let's abort this task. It // will restart it and we can hopefully find more things to do. My OWST fix enforces that, if one worker sees "succeeded" termination, all workers see "succeeded termination", which results following assertion failure in G1: # Internal Error (/home/zgu/workspace/jdk/src/hotspot/share/gc/shared/owstTaskTerminator.cpp:35), pid=24205, tid=24212 # assert(_offered_termination < _n_threads) failed: Invariant #
11-12-2018

Yes, I totally agree, this is a G1 issue where some threads may not always participate in the task termination protocol at all, which means that obviously the task termination protocol will not be in a consistent state afterwards. And it's not dependend on OWST. :) I think a quick fix would be, as you suggested, have G1 re-execute offer_termination() if it prematurely aborted (in the else-part of the code snippet you posted). Note that I am not completely sure this is sufficient.
10-12-2018

I think ParallelTaskTerminator suffers the same problem as OWSTTaskTerminator, described by this bug. There is always a possibility that one worker detects "aborted" termination, e.g. tt->should_exit_termination() returns true, before it has chance to decrease _offered_termination, another worker arrives, increases _offered_termination and detects "succeeded" termination (_offered_termination == _n_threads). If early worker exits protocol, then detects "aborted" condition outside the protocol and just leaves, the terminator will be in bad state (_offered_termination != 0 && _offered_termination != _n_threads). From the assertion inside reset_for_reuse(), it suggests that termination should always finish in good state, regardless it is a "succeeded" or "aborted" termination, which is what this RFE tries to accomplish. It can be done with OWSTTaskTerminator easily, because it can always increase/decrease _offered_termination counter behind _blocker lock, therefore, it closes the race window. I tested my fix with gc/g1/humongousObjects/TestNoAllocationsInHRegions.java, I got following assertion failure. # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/zgu/workspace/jdk/src/hotspot/share/gc/shared/owstTaskTerminator.cpp:35), pid=24205, tid=24212 # assert(_offered_termination < _n_threads) failed: Invariant # Which suggests that someone reuses a terminator, which completed termination in previous cycle, without resetting. Following code is suspicious: G1CMConcurrentMarkingTask::work(uint worker_id) { .... if (!_cm->has_aborted()) { do { task->do_marking_step(G1ConcMarkStepDurationMillis, true /* do_termination */, false /* is_serial*/); _cm->do_yield_check(); } while (!_cm->has_aborted() && task->has_aborted()); } ... Inside task->do_marking_step(), it executes termination protocol, would it possible to re-execute termination protocol after terminator successfully terminated in previous cycle?
08-12-2018

[~tschatzl Did you see G1 problem with ParallelTaskTerminator? I can reproduce what you described with "gc/g1/humongousObjects/TestNoAllocationsInHRegions.java" test using ParallelTaskTerminator, after replacing assignment with reset_for_reuse(). Now, I wonder how Google's patch can work. Thanks.
08-12-2018

[~tschatzl Yes, I think it is the bug what you saw, because it causes some workers see "succeeded termination", others see "aborted termination" I don't think what you described "G1 problem", is actually a problem, that says, not all workers have to offer termination in "aborted" cycle, but workers offer termination should see the same termination condition, aborted or succeeded. For workers offer termination, they all decrease offered_termination counter when they exit termination, while workers that do not offer termination, do not contribute offered_termination count, so for "aborted termination", the offered_termination counter should go down to zero at the end. For "succeeded termination", offered_termination count should equal to n_threads. Another evidence, Google's patch [1] actually uses "reset_for_reuse()" in G1ConcurrentMark::set_concurrency(), where you saw the failure. [1] http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2018-December/024254.html
08-12-2018

Does this refer to the issue I mentioned in the review thread? If so, I do not understand this problem description, or at least it refers to a different issue with similar result. The situation in the bug described during review is that G1 in particular does not adhere to the task termination protocol, i.e. in my view all threads at some point must call offer_termination(). However, there is a situation with G1 where one or more threads do _not_ call offer_termination(): i.e. some threads detect the exit condition by other means and just skip the offer_termination() call. This is a violation of the general task termination protocol imho, that is not specific to the OWST terminator. In fact, the ParallelTaskTerminator obviously shows the same issue. (In both cases though these seem to be benign accounting issues)
08-12-2018