JDK-6919638 : CMS: ExplicitGCInvokesConcurrent misinteracts with gc locker
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: hs14.3,6u21
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic,solaris_10
  • CPU: generic,sparc
  • Submitted: 2010-01-25
  • Updated: 2019-10-31
  • Resolved: 2010-05-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u20-revFixed 7Fixed hs16.3Fixed
Related Reports
Relates :  
Relates :  
Description
User reports that when they use NIO, the implementation of the
gc locker with CMS is causing periods when threads doing get()s and put()s
on NIO buffers are getting blocked for several seconds at a time.

They have provided a test case that purportedly reproduces the
issues that they are having.

They claim that rewriting their application to avoid the use of NIO
is avoiding the blockage of threads involved in IO.

An email trail of the incident and its description and follow-up can
be found starting at the following URL :-

http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-January/000486.html

Comments
EVALUATION http://hg.openjdk.java.net/hsx/hsx17/baseline/rev/e90d89dab490
22-04-2011

SUGGESTED FIX See Evaluation section for changeset.
04-05-2010

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/bb843ebc7c55
04-05-2010

EVALUATION JNI Critical Section entries were being "fenced" for the entire duration of the concurrent collection, causing some application threads to be blocked until the concurrent collection finished. This included the last thread exiting the JNI critical section before a scavenge was induced. While fixing this bug will allow somewhat smoother operation of JNI critical section entries, there will still be cases when excessive or long uses of JNI critical sections may still cause entering threads to block for periods of time that are longer than the duration of a single scavenge. The worst case such blockage will be proportional to the longest JNI critical sections, and to the number of threads that may concurrently be in such critical sections. Thus, fixing this bug will still not match the performance improvement obtained for NIO get/put from fixing 6570619. See also Comments field, entry #10
26-04-2010

SUGGESTED FIX The change below to vmCMSOperation.cpp fixes the bug by allowing the _gc_locker precipitated concurrent full gc not to block the caller until it's completed. The change to genCollectedHeap.cpp further below is still being debated since it decouples the gc locker policy for CMS from ExplicitGCInvokesConcurrent (since no need is seen for linking the two together in any way). Whether the gc locker should cause a concurrent full collection should be more nuanced than the current somewhat ad-hoc bang-bang policy. diff -r 6ecb6e6de3d6 src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.cpp --- a/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.cpp Mon Apr 19 05:40:21 2010 -0700 +++ b/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.cpp Thu Apr 22 22:35:14 2010 -0700 @@ -190,7 +190,7 @@ void VM_GenCollectFullConcurrent::doit() CMSCollector::disable_icms(); // In case CMS thread was in icms_wait(), wake it up. CMSCollector::start_icms(); - // Nudge the CMS thread to start a concurrent collection + // Nudge the CMS thread to start a concurrent collection. CMSCollector::request_full_gc(_full_gc_count_before); } else { FullGCCount_lock->notify_all(); // Inform the Java thread its work is done @@ -231,7 +231,8 @@ void VM_GenCollectFullConcurrent::doit_e // e.g. at the rate of 1 full gc per ms, this could // overflow in about 1000 years. GenCollectedHeap* gch = GenCollectedHeap::heap(); - if (gch->total_full_collections_completed() <= _full_gc_count_before) { + if (_gc_cause != GCCause::_gc_locker && + gch->total_full_collections_completed() <= _full_gc_count_before) { // Now, wait for witnessing concurrent gc cycle to complete, // but do so in native mode, because we want to lock the // FullGCEvent_lock, which may be needed by the VM thread =============================== still under debate: ------------------- diff -r 6ecb6e6de3d6 src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.hpp --- a/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.hpp Mon Apr 19 05:40:21 2010 -0700 +++ b/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.hpp Thu Apr 22 22:58:50 2010 -0700 @@ -126,8 +126,7 @@ class VM_GenCollectFullConcurrent: publi GCCause::Cause gc_cause) : VM_GC_Operation(gc_count_before, full_gc_count_before, true /* full */) { _gc_cause = gc_cause; - assert(FullGCCount_lock != NULL && UseConcMarkSweepGC && - ExplicitGCInvokesConcurrent, "Otherwise shouldn't be here"); + assert(FullGCCount_lock != NULL, "Error"); assert(UseAsyncConcMarkSweepGC, "Else will hang caller"); } ~VM_GenCollectFullConcurrent() {} diff -r 6ecb6e6de3d6 src/share/vm/memory/genCollectedHeap.cpp --- a/src/share/vm/memory/genCollectedHeap.cpp Mon Apr 19 05:40:21 2010 -0700 +++ b/src/share/vm/memory/genCollectedHeap.cpp Thu Apr 22 22:35:14 2010 -0700 @@ -410,9 +410,9 @@ bool GenCollectedHeap::must_clear_all_so } bool GenCollectedHeap::should_do_concurrent_full_gc(GCCause::Cause cause) { - return (cause == GCCause::_java_lang_system_gc || - cause == GCCause::_gc_locker) && - UseConcMarkSweepGC && ExplicitGCInvokesConcurrent; + return UseConcMarkSweepGC && + (cause == GCCause::_gc_locker || + (cause == GCCause::_java_lang_system_gc && ExplicitGCInvokesConcurrent)); } void GenCollectedHeap::do_collection(bool full,
23-04-2010

SUGGESTED FIX <<deleted>>
22-04-2010

SUGGESTED FIX <<deleted>>
05-04-2010

SUGGESTED FIX <<deleted>>
01-04-2010

EVALUATION ###@###.### informs me that in JDK 7, NIO bulk copies do not use JNI critical sections, which explains why the blocking behaviour is not seen in JDK 7. Cross out my observation about large copies in JDK 7 being broken into smaller JNI critical sections (above, display 2). While that is certainly a difference, those routines are not used in JDK 7 to implement NIO bulk copying; rather the NIO bulk copies use Unsafe copyMemory which entirely avoids the gc locker issues above (which issues should nonetheless be fixed for any remianing current and future clients of the JNI critical section interfaces; for instance jdk's older than 7).
25-01-2010

SUGGESTED FIX <<deleted>>
25-01-2010

WORK AROUND -XX:-ExplicitGCInvokesConcurrent
25-01-2010

EVALUATION User observed the crucial fact that -XX:+ExplicitGCInvokesConcurrent is essential to reproduce the blockage. With that crucail piece of information, it was easy to see the root cause. The bug is in the gc locker malinteracting with +ExplicitGCInvokesconcurrent, so that while a concurrent background collection is in progress, the "ongoing gc" flag stays set, so JNI critical sections stay blocked for that duration. This malinteraction will have to be fixed so that even while the bkgrd collection is in progress, JNI critical sections are allowed (which really was the original intent, except for the incorrect implementation).
25-01-2010

EVALUATION One difference between 7 and 6uXX is the amount of copying done per JNI critical section. In jdk 7, large copies are broken into smaller critical sections. This can prevent starvation of some get()'s or put()'s because of the gc locker fencing off new entries into JNI CS while Eden is full and allocation is proceeding slow path. The root cause of the blockage of the customer application must lie in this. I'll find the relevant JDK CR and link this up to it. I'll also see if the JDK 6'ism of large monolithic copies cannot be somehow mitigated in the JVM, although i doubt it. What is still not clear is why the blockages seem to coincide with CMS cycles. It must be that the CMS cycles are somehow slowing down or preventing the exit of the threads doing memcpy inside the JNI critical section. Investigation ongoing.
25-01-2010

EVALUATION So far I have been able to reproduce the problem with 6u17 and 6u18, but not with the latest jdk 7. Still investigating as to this difference and indeed the root cause.
25-01-2010