JDK-8136854 : G1 ConcurrentG1RefineThread::stop delays JVM shutdown for >150ms
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 9
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2015-09-21
  • Updated: 2018-06-21
  • Resolved: 2016-02-10
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b108Fixed
Related Reports
Relates :  
Description
G1 introduces >150 ms shutdown lag, allegedly because of Terminator_lock wait in ConcurrentG1RefineThread::stop. This can be demonstrated with a trivial HelloWorld application and G1 instrumentation:
  http://cr.openjdk.java.net/~shade/8136854/Hello.java
  http://cr.openjdk.java.net/~shade/8136854/tracing.patch

Timing on 1x4x2 i7-4790K, Linux x86_64, jdk9/dev at Sep 21, 2015.

$ time java -XX:+UseParallelGC Hello
Hello World

real	0m0.030s
user	0m0.036s
sys	0m0.000s

$ time java -XX:+UseConcMarkSweepGC Hello
Hello World

real	0m0.032s
user	0m0.028s
sys	0m0.008s

$ time java Hello
Hello World
ConcurrentG1RefineThread waited for Terminator_lock for 168 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec
ConcurrentG1RefineThread waited for Terminator_lock for 0 msec

real	0m0.208s
user	0m0.040s
sys	0m0.004s

We can clearly see that waiting on Terminator_lock wastes ~150msec.  This issue blocks the development of startup performance-sensitive JDK features, as the startup overheads drown in this G1 delay.
Comments
Please commit the workaround ASAP, and then we can work on a "signalling" solution (which, I think, would have its own slab of performance problems).
25-01-2016

Some internal discussion some time ago indicated that it would be better to change the wait-loop with arbitrary delay kind of synchronization to usual signalling. Not sure if it might be better to provide the attached change as "temporary" fix.
25-01-2016

Is there any plan to get this fixed in JDK 9 soon? Just asking because it makes it hard to measure startup for very short running tests (I realize this is a shutdown issue rather than a startup issue and it can be side stepped by specifying the through-put collector).
20-01-2016

fix prototyped by Thomas.Schatzl.
06-01-2016

Tried Thomas' patch, it improves the startup time significantly! $ for I in `seq 1 10`; do time java Hello; done 2>&1 | grep real baseline/patched, Parallel: 0m0.054s, 0m0.034s, 0m0.033s, 0m0.034s, 0m0.033s, 0m0.033s, 0m0.033s, 0m0.033s, 0m0.033s, 0m0.033s, baseline, G1: 0m0.223s, 0m0.210s, 0m0.208s, 0m0.212s, 0m0.208s 0m0.209s, 0m0.208s, 0m0.208s, 0m0.208s, 0m0.208s patched, G1: 0m0.041s, 0m0.052s, 0m0.040s, 0m0.050s, 0m0.040s 0m0.040s, 0m0.040s, 0m0.040s, 0m0.040s, 0m0.042s
21-10-2015

Can you try attached patch? It reduces the delay the refinement threads wait on the CGC_lock and recheck the abort condition. The additional volatile declarations are just to make sure the compiler does not do odd things. It reduces the total runtime ("real") of an empty Java program by 60ms here.
21-10-2015

Sounds like the refinement threads are waiting in ConcurrentGCThread::wait_for_universe_init(): This method gets the CGC_lock, then tests the conditions: is_init_completed() _should_terminate Then it waits on the CGC_lock. Note that the change to is_init_completed() does not signal any monitor, and _should_terminate is protected by the Terminator_lock, but does not signal any monitor. To clean up termination we should decide on what is the controlling monitor for _should_terminate, and do matched wait/notify on that monitor. That will make termination more responsive, but does leave a 200ms delay on starting some of these GC service threads. Another thing to do is add an increasing delay: delay = 25; wait(delay); delay = MIN(delay *2, 200);
20-10-2015

patch for the instrumentation
20-10-2015

Could you apply your patch so that others can decode what your instrumentation means?
20-10-2015

instrumented concurrentG1RefineThread.cpp. It seems it is not stopping but starting concurrentRefinementThread. time $JAVA_HOME/bin/java -XX:+UseG1GC -XX:G1ConcRefinementThreads=1 -XX:ParallelGCThreads=1 -Xmx602624k -XX:ConcGCThreads=1 Hello Hello World terminator_lock_start 0.1236 start run_young_rs_sampleing 0.2129 sample_young_list_start 0.2129, sample_young_list_time 0.0000, monitor_wait_time 0.0000 start concurrentRefinementThread 0.2131 end 0.2132, diff 0.0001 terminator_lock_time 0.0000, monitor_notify_time 0.0000, terminator_wait_time 0.0896 terminator_lock_start 0.2134 wait_service_interval 0.0005 terminator_lock_time 0.0000, monitor_notify_time 0.0000, terminator_wait_time 0.0001 real 0m0.228s user 0m0.168s sys 0m0.060s terminator_lock_start: is the timestamp when we try to shutdown jvm, start run_young_rs_sampleing 0.2129 sample_young_list_start 0.2129, sample_young_list_time 0.0000, monitor_wait_time 0.0000 start concurrentRefinementThread 0.2131 end 0.2132, diff 0.0001 This is when young_rs_sampling and concurrentRefinementThread starts. Then it detects should_terminate and terminate.
19-10-2015

This doesn't look like a problem with the Mutex:es, but rather that fact that the code is waiting for the ConcurrentG1RefineThread to stop executing. So, I don't think a Semaphore would help here. Could you also instrument what the thread is doing by measuring the time in: void ConcurrentG1RefineThread::run() ?
08-10-2015

That acknowledgement wait could be replaced by either a semaphore or separate per-thread locks.
24-09-2015

Could potentially be fixed with using a Semaphore instead of the Monitor for the Terminator_lock. Note that this is just a half-baked idea, not sure if that is actually possible.
23-09-2015

On intelsdv44, 4-socket, 10-core/socket: ====default g1gc Hello World Heap garbage-first heap total 2068480K, used 4096K [0x0000000082000000, 0x0000000082400fc8, 0x0000000800000000) region size 4096K, 2 young (8192K), 0 survivors (0K) Metaspace used 2884K, capacity 4486K, committed 4864K, reserved 1056768K class space used 303K, capacity 386K, committed 512K, reserved 1048576K real 0m0.701s user 0m0.462s sys 0m0.335s ====limit heap size Hello World real 0m0.256s user 0m0.160s sys 0m0.105s ====limit heap size and threads to 10 Hello World real 0m0.231s user 0m0.165s sys 0m0.061s ====limit heap size and threads to 1 thread Hello World real 0m0.224s user 0m0.165s sys 0m0.047s ====parallegc limit threads to 10 Hello World Heap PSYoungGen total 602624K, used 41329K [0x0000000580100000, 0x00000005aa180000, 0x0000000800000000) eden space 516608K, 8% used [0x0000000580100000,0x000000058295c438,0x000000059f980000) from space 86016K, 0% used [0x00000005a4d80000,0x00000005a4d80000,0x00000005aa180000) to space 86016K, 0% used [0x000000059f980000,0x000000059f980000,0x00000005a4d80000) ParOldGen total 1377792K, used 0K [0x0000000080200000, 0x00000000d4380000, 0x0000000580100000) object space 1377792K, 0% used [0x0000000080200000,0x0000000080200000,0x00000000d4380000) Metaspace used 2881K, capacity 4486K, committed 4864K, reserved 1056768K class space used 303K, capacity 386K, committed 512K, reserved 1048576K real 0m0.135s user 0m0.137s sys 0m0.059s part of the delay comes from g1 trying to start with much bigger heap. Reducing heap size to parallel default size helps. But more than parallelgc time. It is not related to number of threads. On the other hand, g1 has to stop parallelgc threads, concurrent refinement threads, concurrent marking threads, and stringDedupThread. Much more than parallelgc threads.
22-09-2015