Bug ID: JDK-8065402 G1 does not expand marking stack when mark stack overflow happens during concurrent marking

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 9

Priority: P4
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2014-11-19
Updated: 2018-06-21
Resolved: 2017-05-09

JDK 10
10 b21Fixed

Attached is the spreadsheet summarizing Intel's experiment increasing MarkStackSize manually vs count of
'[GC concurrent-mark-reset-for-overflow]'

The observation is we do not expand MarkStackSize to MarkStackSizeMax when concurrent-mark-overflow happens.  They have to increase it manually.

The expand flag is set when concurrent-mark-reset-for-overflow happens.
The issue is we try to expand markStack in void ConcurrentMark::checkpointRootsFinal(bool clear_all_soft_refs)
If there is no overflow, at the end, we call set_non_marking_state(); then try to expand markStack.

set_non_marking_state() calls reset_marking_state and reset expand based on _cm->has_overflown(_cm overflow is cleaned during marking).  So when we check   if (_markStack.should_expand()), it is always false.

Constructed test b8065402.java that does not require any VM modifications to reproduce an issue. Testing with this file: jdk-base/bin/java -Xmx32g -server -XX:+UseG1GC -Xlog:gc=debug b8065402 150 Duration - 182.831 sec (with 5 full gc) Proposed fix: jdk-fix/bin/java -Xmx32g -server -XX:+UseG1GC -Xlog:gc=debug b8065402 150 Duration - 68.267 sec (with 0 full gc) Mark Stack was expanded (twice): 4M -> 8M ->16M
11-04-2017
Added test for reproducing issue
11-04-2017
I simulated this issue with following debugging change - in order to decrease size of G1CMTaskQueue: --- a/src/share/vm/gc/g1/g1ConcurrentMark.hpp Tue Mar 14 22:14:33 2017 -0700 +++ b/src/share/vm/gc/g1/g1ConcurrentMark.hpp Wed Mar 22 13:30:14 2017 -0400 @@ -93,7 +93,7 @@ #pragma warning(pop) #endif -typedef GenericTaskQueue<G1TaskQueueEntry, mtGC> G1CMTaskQueue; +typedef GenericTaskQueue<G1TaskQueueEntry, mtGC, 1024> G1CMTaskQueue; typedef GenericTaskQueueSet<G1CMTaskQueue, mtGC> G1CMTaskQueueSet; // Closure used by CM during concurrent reference discovery @@ -221,7 +221,7 @@ class G1CMMarkStack VALUE_OBJ_CLASS_SPEC { public: // Number of TaskQueueEntries that can fit in a single chunk. - static const size_t EntriesPerChunk = 1024 - 1 /* One reference for the next pointer /; + static const size_t EntriesPerChunk = 64 - 1 / One reference for the next pointer /; private: struct TaskQueueEntryChunk { TaskQueueEntryChunk next; After that GCBasher was used as a test with flag "-XX:MarkStackSize=1K" I ran test on machine with 68 cores. Attached file gclogs.tar.gz contains two log files - with current way of expanding MarkStack (only if overflow happened in remark) - file base.log, and file fix.log - for proposed fix - to expand MarkStack if overflow happened in the Concurrent Mark. Base.log has no mark stack expansion, has two of "Concurrent Mark Abort" and two matched Full GC Fix.log has few " Expanded mark stack" and none of "Concurrent Mark Abort" and none of FULL GC
23-03-2017
A related issue is marking cycle takes too long. The mixed gc can not start, so full gc kicks in. Big data (Intel, Oracle NoSQL) workloads work around this by increase concurrent mark threads. Probably need to find a separate bug to keep track.
10-03-2015
Costs of the current mechanism: - you really need to start over, with the existing marks on the bitmap kept intact though (afair). This will eventually converge to a successful marking of the entire object graph. - above procedure may take so long that you will run into a full gc (- confusing to user) Costs of expansion - using more memory than necessary One could compare the length of concurrent markings with and without a big enough mark stack on a per-application basis. In case of this benchmark I do not expect a big performance hit. The full gc due to not completing concurrent marking is a considerable risk though, and most users, particularly with really large heaps will probably not care about a few additional MB of mark stack (assumption).
24-11-2014
the current implementation is confusing for users. When they see that overflow message, they expect the marking stack will expand up to max marking stack. What is the cost of ignoring those overflow and start over? I do not have data for that. What happens if we keep having those overflow during concurrent marking and can not recover?
21-11-2014
Given the following log output: 699.092: #320: [GC concurrent-mark-reset-for-overflow] 700.875: #320: [GC concurrent-mark-end, 2.2655883 secs] 700.875: #320: [GC remark 700.875: #320: [Finalize Marking, 0.0004584 secs] 700.876: #320: [GC ref-proc700.876: #320: [SoftReference, 0 refs, 0.0005507 secs]700.876: #320: [WeakReference, 266 refs, 0.0003223 secs]700.877: #320: FinalReference, 31 refs, 0.0002600 secs]700.877: #320: [PhantomReference, 10 refs, 0.0003397 secs]700.877: #320: [JNI Weak Reference, 0.0000309 secs], 0.0017227 secs] 700.877: #320: [Unloading, 0.0026094 secs] 700.880: #320: [GC aggregate-data, 0.0187864 secs] What happens here is that during marking there is a mark stack overflow. However, G1 can continue and complete marking as indicated by the "concurrent-mark-end" message fine. The remark phase does not indicate a mark stack overflow either. So the current implementation simply assumes that there is no need to increase the mark stack. In this case, the concurrent-mark-reset-for-overflow is mostly benign, as marking can complete 1 1/2 seconds later. It would be different if G1 markign ran into mark stack overflows continuously so that it cannot complete the marking. The question is, what is the expected behavior of g1: a) when the mark stack overflows during concurrent mark, just continue/restart marking (and print the message) and hope that it completes, and at the following remark pause expand the stack. b) when the mark stack overflows during concurrent mark, increase the mark stack and continue, trying to avoid marking restart in the future. (And do nothing special in the remark pause) c) when the mark stack overflows during concurrent mark, just continue/restart marking (and print the message) and hope that it completes, and at the following remark pause do nothing special either if there is no overflow during remark (the current behavior) d) something else If a), then it is true that the overflow flag should probably not be cleared after marking is complete so that remark expands the stack. If b), then the stack needs to expand when the message is printed. If c), nothing needs to be done further.
21-11-2014
IMO, the code in concurrentMartThread.cpp is OK. It just loops while cm()->restart_for_overflow() is true.
20-11-2014
It is true that the code calls reset_marking_state(), which sets should_expand() flag. I was trying to trace further why should_expand() was true when concurrent marking overflow happened, but was false when we check should_expand later. It was because set_non_marking_state() calls reset_marking_state(). By that time, the _cm overflow is already cleared.
20-11-2014
JDK-8004669 states that JDK-8000244 automatically expands the marking stack during concurrent mark. This is not true.
20-11-2014
I think the thoughts presented in the original summary are misleading. They are about mark stack overflow handling during GC remark only. The code for that during concurrent marking is in concurrentMarkThread.cpp:163-172: if (cm()->restart_for_overflow()) { if (G1TraceMarkStackOverflow) { gclog_or_tty->print_cr("Restarting conc marking because of MS overflow " "in remark (restart #%d).", iter); } if (G1Log::fine()) { gclog_or_tty->gclog_stamp(cm()->concurrent_gc_id()); gclog_or_tty->print_cr("[GC concurrent-mark-restart-for-overflow]"); } } which, basically does nothing but print one or two messages at most.
20-11-2014
If we overflow (has_overflown() is set), the code calls only reset_marking_state(), not set_non_marking_state(). Reset_marking_state() sets the should_expand() flag in the mark stack according to the has_overflown() flag, which is true at this point. So markStack.should_expand() should be true here imo.
20-11-2014