United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6186200 RFE: Stall allocation requests while heap is full and GC locker is held
JDK-6186200 : RFE: Stall allocation requests while heap is full and GC locker is held

Details
Type:
Enhancement
Submit Date:
2004-10-28
Status:
Resolved
Updated Date:
2010-05-10
Project Name:
JDK
Resolved Date:
2005-05-18
Component:
hotspot
OS:
solaris_9,generic
Sub-Component:
gc
CPU:
sparc,generic
Priority:
P2
Resolution:
Fixed
Affected Versions:
1.3.1_11,1.3.1_13,6
Fixed Versions:

Related Reports
Backport:
Backport:
Backport:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
As implicit in the synopsis above, suppose the heap is getting
nearly full and a GC will shortly be needed to satisfy any new
requests. Suppose that, at this juncture, one thread T1 in a multi-threaded
application acquires the gc locker in order to do a pretty short duration
operation. At this point, let us say T1 gets descheduled and
another thread T2 makes a java heap allocation request that cannot
be satisfied without a GC.

As currently implemented, we may bring the world to a safepoint,
and attempt a GC, realize that gc locker has locked out GC and
return NULL, signifying our failure to allocate the requested
storage. T2 will get an OOM which it can decide to catch and
deal with as appropriate, or it might not catch it and get
terminated.

A different implementation possibility is to hold the request of
thread T2 in abeyance (not returning the NULL) until such time
as we are able to do a garbage collection (i.e. when the gc locker
is vacated). Would that be a more friendly or useful behaviour
for applications?

Of course, in such a design we would need to consider the possibility
that a thread that holds the gc locker is itself making the allocation
request. That should be easy to track with just a little state
(i think already present) in the thread and in such cases we can and
would return a NULL because we would otherwise risk deadlock.

However, there might be more subtle deadlocks possible if T1's
operation has a circular dependency on T2's allocation via
a dependency chain not directly visible to the JVM. Clearly,
that would be a violation of the JNI spec by the programmer,
as to the restrictions on using critical lockers, and the
JVM could just shrug off responsibility for such user violations.
###@###.### 10/28/04 19:13 GMT

                                    

Comments
EVALUATION

To the SDN user who asked which versions this is fixed in:
----------------------------------------------------------

This is a bug in at least each of the 1.3.1, 1.4.2, 5.0 trains,
and is fixed in 1.3.1_17, 1.4.2_11 and 5.0u7 releases of those
trains. It's also fixed in Mustang (6.0) beta b37. This information
is also visible in the "Fixed in" field of the bug report.
                                     
2006-04-25
EVALUATION

Bug has been fixed, and tested successfully with big apps and using
targeted regression tests. The fix will be putback to
Mustang upon CCC approval; see http://ccc.sfbay/6186200.


###@###.### 2005-04-18 21:06:41 GMT
                                     
2005-03-16
SUGGESTED FIX

See http://analemma.sfbay/net/spot/scratch/ysr/gclocker/webrev
###@###.### 2005-04-18 21:06:42 GMT

Event:            putback-to
Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline
                  (jano.sfbay:/export/disk05/hotspot/ws/main/gc_baseline)
Child workspace:  /net/prt-web.sfbay/prt-workspaces/20050420100316.ysr.gclocker/workspace
                  (prt-web:/net/prt-web.sfbay/prt-workspaces/20050420100316.ysr.gclocker/workspace)
User:             ysr

Comment:

---------------------------------------------------------

Original workspace:     neeraja:/net/spot/scratch/ysr/gclocker
Submitter:              ysr
Archived data:          /net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050420100316.ysr.gclocker/
Webrev:                 http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050420100316.ysr.gclocker/workspace/webrevs/webrev-2005.04.20/index.html


Fixed 6186200: RFE: Stall allocation requests while heap is full and GC locker is held

http://analemma.sfbay/net/spot/scratch/ysr/gclocker/webrev

The problem was that if the heap is getting close to full
and a thread enters a JNI critical section, then an
allocation request that exceeds the available space
will fail because GC is not allowed and the request
can not be immediately satisfied. This means that
applications that use JNI critical sections, even for
pretty short durations, might be susceptible to strange
and fleeting OOM errors, provided the critical section is
entered at the right time. Several such reports have recently
surfaced in the field (Wachovia, Instinet, SAP, SCT, etc.),
some of them resulting in escalations.

Our fix is to stall such requests until the critical
section has cleared, making a GC possible. For defensive
reasons, if the allocating thread is itself in the critical
section, we do not stall. This avoids self-deadlocks, but
of course does not rule out deadlock possibilities because
of transitive dependencies, not directly or easily visible
or inferable by the VM, from a thread in a JNI critical
section to an allocating thread thus stalled. Clearly,
such dependencies of threads in JNI critical sections
violates the conditions documented in JNI_Get*Critical().
It could be argued that the reflexive dependency is also
such a violation and need not be checked. That is certainly
a reasonable stance; we now flag such reflexive deadlock
possibilities under the -Xcheck:jni flag. I chose
to be liberal in this check since it involves use of
state that is already available in the allocating thread.
(I am easily persuaded not to make a concession for such
self-deadlocks.  After all, it could be argued, it's a
good thing not to encourage users to write bad code,
indeed code that violates documented restrictions.)

JVMPI offers an API for disabling GC, which can of course
be used to cause further deadlocks under such circumstances.
The GC locker is also used in certain JVMPI functions to
prevent GC while events are posted. There are example
code paths in the JVM where threads may suspend holding
the GC locker lock, in response to JVMP/DI suspension.
These introduce further situations where the application can
deadlock. In all these cases, we will have replaced a potential
OOM error with a new deadlock. However, JVMPI is going away
in Mustang, so we need not worry about these new deadlock
scenarios. In any event, there are deadlocking modes
possible with these and other JVMPI interfaces even without
the use of GC locker. So I do not believe this is a
major issue, albeit one that should be run by JVMPI users
(tool vendors) before (if/when) being back-ported to Tiger.

This change request is currently before the CCC.
We are putting this fix back so as to
allow nightly GC testing before the next integration.
If the CCC recommends modifications, we'll do so
in a future putback as necessary.

A further GC locker related fix and a code clean-up
is forthcoming in another putback later today under bug id
4828899.

Reviewed by: Alan Bateman, Paul Hohensee

Fix verified: yes

Verification test: A modified version of Mingayo's JNI locker
   test run with a small heap so as to increase the
   cross-section of the described window of vulnerability

Other testing:
  PRT
  big apps testing (36 hours with all collectors; thanks LiFeng/June)
  runThese -full
  refworkload

Files:
update: src/share/vm/gc_implementation/parallelScavenge/parallelScavengeHeap.cpp
update: src/share/vm/memory/collectorPolicy.cpp
update: src/share/vm/memory/gcLocker.cpp
update: src/share/vm/memory/gcLocker.hpp

Examined files: 3240

Contents Summary:
       4   update
    3236   no action (unchanged)

###@###.### 2005-04-20 21:31:58 GMT
                                     
2005-04-18



Hardware and Software, Engineered to Work Together