United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6919638 : CMS: ExplicitGCInvokesConcurrent misinteracts with gc locker

Details
Type:
Bug
Submit Date:
2010-01-25
Status:
Resolved
Updated Date:
2011-04-22
Project Name:
JDK
Resolved Date:
2010-05-18
Component:
hotspot
OS:
generic,solaris_10
Sub-Component:
gc
CPU:
sparc,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
hs14.3,6u21
Fixed Versions:
hs19 (b01)

Related Reports
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:

Sub Tasks

Description
User reports that when they use NIO, the implementation of the
gc locker with CMS is causing periods when threads doing get()s and put()s
on NIO buffers are getting blocked for several seconds at a time.

They have provided a test case that purportedly reproduces the
issues that they are having.

They claim that rewriting their application to avoid the use of NIO
is avoiding the blockage of threads involved in IO.

An email trail of the incident and its description and follow-up can
be found starting at the following URL :-

http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2010-January/000486.html

                                    

Comments
EVALUATION

http://hg.openjdk.java.net/hsx/hsx17/baseline/rev/e90d89dab490
                                     
2011-04-22
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/bb843ebc7c55
                                     
2010-05-04
SUGGESTED FIX

See Evaluation section for changeset.
                                     
2010-05-04
EVALUATION

JNI Critical Section entries were being "fenced" for the entire duration of the
concurrent collection, causing some application threads to be blocked until
the concurrent collection finished. This included the last thread exiting
the JNI critical section before a scavenge was induced.

While fixing this bug will allow somewhat smoother operation of JNI
critical section entries, there will still be cases when excessive
or long uses of JNI critical sections may still cause entering threads
to block for periods of time that are longer than the duration of a single
scavenge. The worst case such blockage will be proportional to the longest
JNI critical sections, and to the number of threads that may concurrently
be in such critical sections.

Thus, fixing this bug will still not match the performance improvement
obtained for NIO get/put from fixing 6570619.

See also Comments field, entry #10
                                     
2010-04-26
SUGGESTED FIX

The change below to vmCMSOperation.cpp fixes the bug by
allowing the _gc_locker precipitated concurrent full gc not
to block the caller until it's completed.  The change to
genCollectedHeap.cpp further below is still being debated since it decouples
the gc locker policy for CMS from ExplicitGCInvokesConcurrent
(since no need is seen for linking the two together in any way).
Whether the gc locker should cause a concurrent full collection
should be more nuanced than the current somewhat ad-hoc bang-bang policy.

diff -r 6ecb6e6de3d6 src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.cpp
--- a/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.cpp    Mon Apr 19 05:40:21 2010 -0700
+++ b/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.cpp    Thu Apr 22 22:35:14 2010 -0700
@@ -190,7 +190,7 @@ void VM_GenCollectFullConcurrent::doit()
     CMSCollector::disable_icms();
     // In case CMS thread was in icms_wait(), wake it up.
     CMSCollector::start_icms();
-    // Nudge the CMS thread to start a concurrent collection
+    // Nudge the CMS thread to start a concurrent collection.
     CMSCollector::request_full_gc(_full_gc_count_before);
   } else {
     FullGCCount_lock->notify_all();  // Inform the Java thread its work is done
@@ -231,7 +231,8 @@ void VM_GenCollectFullConcurrent::doit_e
   // e.g. at the rate of 1 full gc per ms, this could
   // overflow in about 1000 years.
   GenCollectedHeap* gch = GenCollectedHeap::heap();
-  if (gch->total_full_collections_completed() <= _full_gc_count_before) {
+  if (_gc_cause != GCCause::_gc_locker &&
+      gch->total_full_collections_completed() <= _full_gc_count_before) {
     // Now, wait for witnessing concurrent gc cycle to complete,
     // but do so in native mode, because we want to lock the
     // FullGCEvent_lock, which may be needed by the VM thread


===============================
still under debate:
-------------------

diff -r 6ecb6e6de3d6 src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.hpp
--- a/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.hpp    Mon Apr 19 05:40:21 2010 -0700
+++ b/src/share/vm/gc_implementation/concurrentMarkSweep/vmCMSOperations.hpp    Thu Apr 22 22:58:50 2010 -0700
@@ -126,8 +126,7 @@ class VM_GenCollectFullConcurrent: publi
                               GCCause::Cause gc_cause)
     : VM_GC_Operation(gc_count_before, full_gc_count_before, true /* full */) {
     _gc_cause = gc_cause;
-    assert(FullGCCount_lock != NULL && UseConcMarkSweepGC &&
-           ExplicitGCInvokesConcurrent, "Otherwise shouldn't be here");
+    assert(FullGCCount_lock != NULL, "Error");
     assert(UseAsyncConcMarkSweepGC, "Else will hang caller");
   }
   ~VM_GenCollectFullConcurrent() {}

diff -r 6ecb6e6de3d6 src/share/vm/memory/genCollectedHeap.cpp
--- a/src/share/vm/memory/genCollectedHeap.cpp  Mon Apr 19 05:40:21 2010 -0700
+++ b/src/share/vm/memory/genCollectedHeap.cpp  Thu Apr 22 22:35:14 2010 -0700
@@ -410,9 +410,9 @@ bool GenCollectedHeap::must_clear_all_so
 }

 bool GenCollectedHeap::should_do_concurrent_full_gc(GCCause::Cause cause) {
-  return (cause == GCCause::_java_lang_system_gc ||
-          cause == GCCause::_gc_locker) &&
-         UseConcMarkSweepGC && ExplicitGCInvokesConcurrent;
+  return UseConcMarkSweepGC &&
+         (cause == GCCause::_gc_locker ||
+          (cause == GCCause::_java_lang_system_gc && ExplicitGCInvokesConcurrent));
 }

 void GenCollectedHeap::do_collection(bool  full,
                                     
2010-04-23
SUGGESTED FIX

<<deleted>>
                                     
2010-04-22
SUGGESTED FIX

<<deleted>>
                                     
2010-04-05
SUGGESTED FIX

<<deleted>>
                                     
2010-04-01
EVALUATION

So far I have been able to reproduce the problem with 6u17 and 6u18, but not
with the latest jdk 7. Still investigating as to this difference and indeed the root
cause.
                                     
2010-01-25
EVALUATION

One difference between 7 and 6uXX is the amount of copying done
per JNI critical section. In jdk 7, large copies are broken
into smaller critical sections. This can prevent starvation of
some get()'s or put()'s because of the gc locker fencing off
new entries into JNI CS while Eden is full and allocation
is proceeding slow path. The root cause of the blockage of
the customer application must lie in this.

I'll find the relevant JDK CR and link this up to it.

I'll also see if the JDK 6'ism of large monolithic copies cannot
be somehow mitigated in the JVM, although i doubt it.

What is still not clear is why the blockages seem to coincide
with CMS cycles. It must be that the CMS cycles are somehow
slowing down or preventing the exit of the threads doing
memcpy inside the JNI critical section. Investigation ongoing.
                                     
2010-01-25
SUGGESTED FIX

<<deleted>>
                                     
2010-01-25
WORK AROUND

-XX:-ExplicitGCInvokesConcurrent
                                     
2010-01-25
EVALUATION

User observed the crucial fact that -XX:+ExplicitGCInvokesConcurrent
is essential to reproduce the blockage. With that crucail piece of information,
it was easy to see the root cause. The bug is in the gc locker malinteracting
with +ExplicitGCInvokesconcurrent, so that while a concurrent background
collection is in progress, the "ongoing gc" flag stays set, so
JNI critical sections stay blocked for that duration. This malinteraction
will have to be fixed so that even while the bkgrd collection is in
progress, JNI critical sections are allowed (which really was the
original intent, except for the incorrect implementation).
                                     
2010-01-25
EVALUATION

###@###.### informs me that in JDK 7, NIO bulk copies
do not use JNI critical sections, which explains why the blocking
behaviour is not seen in JDK 7. 

Cross out my observation about large copies in JDK 7 being broken
into smaller JNI critical sections (above, display 2). While that is
certainly a difference, those routines are not used in JDK 7 to
implement NIO bulk copying; rather the NIO bulk copies use
Unsafe copyMemory which entirely avoids the gc locker issues
above (which issues should nonetheless be fixed for any remianing
current and future clients of the JNI critical section interfaces;
for instance jdk's older than 7).
                                     
2010-01-25



Hardware and Software, Engineered to Work Together