Bug ID: JDK-6186200 RFE: Stall allocation requests while heap is full and GC locker is held

JDK-6186200 : RFE: Stall allocation requests while heap is full and GC locker is held

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 1.3.1_11,1.3.1_13,6

Priority: P2
Status: Resolved
Resolution: Fixed
OS: generic,solaris_9
CPU: generic,sparc

Submitted: 2004-10-28
Updated: 2017-05-16
Resolved: 2005-05-18

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	Other	JDK 6
1.3.1_17Fixed	1.4.2_11Fixed	6 b37Fixed

Related Reports

Relates :	JDK-6736295 - SIGSEGV in product jvm, assertion "these are the only valid states during a mark sweep" in fastdebug
Relates :	JDK-6782457 - CMS: Livelock in CompactibleFreeListSpace::block_size().
Relates :	JDK-6190855 - nsk.jvmpi.endisgc001 fails due to changes in RFE CR#5101236
Relates :	JDK-6539517 - CR 6186200 should be extended to perm gen allocation to prevent spurious OOM's from perm gen
Relates :	JDK-6944195 - jmap -{histo,dump}:live, +PrintClassHistogram and friends should stall for GC locker to clear
Relates :	JDK-6507548 - REGRESSION: slower performance of GZIPInputStream for 5u10 (multi threaded testcase)
Relates :	JDK-6280181 - Concurrently memory allocation and JNI CS provoke OOM
Relates :	JDK-6450320 - jvmpi causes a hang in 5.0u7
Relates :	JDK-7005799 - G1: nsk/regression/b6186200 fails with OOME
Relates :	JDK-6789220 - CMS: intermittent timeout running nsk/regression/b4796926

Description

As implicit in the synopsis above, suppose the heap is getting
nearly full and a GC will shortly be needed to satisfy any new
requests. Suppose that, at this juncture, one thread T1 in a multi-threaded
application acquires the gc locker in order to do a pretty short duration
operation. At this point, let us say T1 gets descheduled and
another thread T2 makes a java heap allocation request that cannot
be satisfied without a GC.

As currently implemented, we may bring the world to a safepoint,
and attempt a GC, realize that gc locker has locked out GC and
return NULL, signifying our failure to allocate the requested
storage. T2 will get an OOM which it can decide to catch and
deal with as appropriate, or it might not catch it and get
terminated.

A different implementation possibility is to hold the request of
thread T2 in abeyance (not returning the NULL) until such time
as we are able to do a garbage collection (i.e. when the gc locker
is vacated). Would that be a more friendly or useful behaviour
for applications?

Of course, in such a design we would need to consider the possibility
that a thread that holds the gc locker is itself making the allocation
request. That should be easy to track with just a little state
(i think already present) in the thread and in such cases we can and
would return a NULL because we would otherwise risk deadlock.

However, there might be more subtle deadlocks possible if T1's
operation has a circular dependency on T2's allocation via
a dependency chain not directly visible to the JVM. Clearly,
that would be a violation of the JNI spec by the programmer,
as to the restrictions on using critical lockers, and the
JVM could just shrug off responsibility for such user violations.
###@###.### 10/28/04 19:13 GMT

Comments

EVALUATION To the SDN user who asked which versions this is fixed in: ---------------------------------------------------------- This is a bug in at least each of the 1.3.1, 1.4.2, 5.0 trains, and is fixed in 1.3.1_17, 1.4.2_11 and 5.0u7 releases of those trains. It's also fixed in Mustang (6.0) beta b37. This information is also visible in the "Fixed in" field of the bug report.

25-04-2006

SUGGESTED FIX See http://analemma.sfbay/net/spot/scratch/ysr/gclocker/webrev ###@###.### 2005-04-18 21:06:42 GMT Event: putback-to Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline (jano.sfbay:/export/disk05/hotspot/ws/main/gc_baseline) Child workspace: /net/prt-web.sfbay/prt-workspaces/20050420100316.ysr.gclocker/workspace (prt-web:/net/prt-web.sfbay/prt-workspaces/20050420100316.ysr.gclocker/workspace) User: ysr Comment: --------------------------------------------------------- Original workspace: neeraja:/net/spot/scratch/ysr/gclocker Submitter: ysr Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050420100316.ysr.gclocker/ Webrev: http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050420100316.ysr.gclocker/workspace/webrevs/webrev-2005.04.20/index.html Fixed 6186200: RFE: Stall allocation requests while heap is full and GC locker is held http://analemma.sfbay/net/spot/scratch/ysr/gclocker/webrev The problem was that if the heap is getting close to full and a thread enters a JNI critical section, then an allocation request that exceeds the available space will fail because GC is not allowed and the request can not be immediately satisfied. This means that applications that use JNI critical sections, even for pretty short durations, might be susceptible to strange and fleeting OOM errors, provided the critical section is entered at the right time. Several such reports have recently surfaced in the field (Wachovia, Instinet, SAP, SCT, etc.), some of them resulting in escalations. Our fix is to stall such requests until the critical section has cleared, making a GC possible. For defensive reasons, if the allocating thread is itself in the critical section, we do not stall. This avoids self-deadlocks, but of course does not rule out deadlock possibilities because of transitive dependencies, not directly or easily visible or inferable by the VM, from a thread in a JNI critical section to an allocating thread thus stalled. Clearly, such dependencies of threads in JNI critical sections violates the conditions documented in JNI_Get*Critical(). It could be argued that the reflexive dependency is also such a violation and need not be checked. That is certainly a reasonable stance; we now flag such reflexive deadlock possibilities under the -Xcheck:jni flag. I chose to be liberal in this check since it involves use of state that is already available in the allocating thread. (I am easily persuaded not to make a concession for such self-deadlocks. After all, it could be argued, it's a good thing not to encourage users to write bad code, indeed code that violates documented restrictions.) JVMPI offers an API for disabling GC, which can of course be used to cause further deadlocks under such circumstances. The GC locker is also used in certain JVMPI functions to prevent GC while events are posted. There are example code paths in the JVM where threads may suspend holding the GC locker lock, in response to JVMP/DI suspension. These introduce further situations where the application can deadlock. In all these cases, we will have replaced a potential OOM error with a new deadlock. However, JVMPI is going away in Mustang, so we need not worry about these new deadlock scenarios. In any event, there are deadlocking modes possible with these and other JVMPI interfaces even without the use of GC locker. So I do not believe this is a major issue, albeit one that should be run by JVMPI users (tool vendors) before (if/when) being back-ported to Tiger. This change request is currently before the CCC. We are putting this fix back so as to allow nightly GC testing before the next integration. If the CCC recommends modifications, we'll do so in a future putback as necessary. A further GC locker related fix and a code clean-up is forthcoming in another putback later today under bug id 4828899. Reviewed by: Alan Bateman, Paul Hohensee Fix verified: yes Verification test: A modified version of Mingayo's JNI locker test run with a small heap so as to increase the cross-section of the described window of vulnerability Other testing: PRT big apps testing (36 hours with all collectors; thanks LiFeng/June) runThese -full refworkload Files: update: src/share/vm/gc_implementation/parallelScavenge/parallelScavengeHeap.cpp update: src/share/vm/memory/collectorPolicy.cpp update: src/share/vm/memory/gcLocker.cpp update: src/share/vm/memory/gcLocker.hpp Examined files: 3240 Contents Summary: 4 update 3236 no action (unchanged) ###@###.### 2005-04-20 21:31:58 GMT

18-04-2005

EVALUATION Bug has been fixed, and tested successfully with big apps and using targeted regression tests. The fix will be putback to Mustang upon CCC approval; see http://ccc.sfbay/6186200. ###@###.### 2005-04-18 21:06:41 GMT

16-03-2005