J2SE Version (please include all output from java -version flag):
6u29
Does this problem occur on J2SE 1.4.x, 1.5 or 6? Yes / No (pick one)
Yes
Bug Description:
A customer heap dump clearly shows that over 80% of the heap is being
consumed by a private static map within sun.rmi.transport.DGCAckHandler.
This map maps from a java.rmi.server.UID to a sun.rmi.transport.DGCAckHandler
instance. There are 11,300,771 entries in this map consuming 880,764,432
bytes of heap memory.
Currently they have specified -Dsun.rmi.dgc.client.gcInterval=3600000 and
-Dsun.rmi.dgc.server.gcInterval=3600000 to reduce DGC overhead.
And they remove these as a troubleshooting measure. That said,
don't see any direct link between these settings and the issue at hand
from reading JVM source code.Instead, it seems be in the bowels of
the JVM RMI code.
Reading the decompiled Java 6 Update 29 code, it appears that
DGCAckHandlers are created and inserted into the map by
sun.rmi.transport.ConnectionOutputStream.saveObject(),
which in turn appears to be called by sun.rmi.transport.LiveRef
whenever writing a live Remote object to an RMI response.
Once an entry is added to the map, it will only be removed by
DGCAckHandler.received(UID), which is called upon receipt of
a DGCAck from the client. A DGC ack timeout (5 minutes by default)
will null out the non-final fields of DGCAckHandler -- but won't
remove the entry from the list! So it would appear that if many
disparate live references are returned to clients which don't respond
with a DGC ack, that this map will grow without bounds.
the JVM should do better than that, though.
It appears that all the DGCAckHandler instances have a null objList,
which seems to imply that all of these have timed out!
Thought pass this along as thought this might ring some bells there.
the anonymous Runnable passed to schedule() in DGCAckHandler.startTimer()
should also call idTable.remove(). The code feels fragile otherwise,
More information from the submitter:
A customer is encountering enormous memory usage as per below within the bowels
of DGCAckHandler. The customer in question is using Java 6 Update 29 on Solaris
10 and seeing the issue when doing a stress test with OracleATS
(Advanced Test Suite).
More information from the submitter:
We have finally gotten enough information from the customer to reproduce the issue.
We have RMI server processes which when they discover they are overloaded do the following:
if ( needToRedirect(...) )
{
MethodServer next_server = nextMethodServer(); // MethodServer is a java.rmi.Remote interface
if (next_server != null)
{
...
throw new ServerLoadException(next_server); // TO_DO: ponder where/how/if we might decide that no one wants this request and return no-next server or some such...
}
}
where nextMethodServer internally does:
nextMethodServer = serverManager.getNextServer( thisMethodServerImpl );
plus some caching of this result with a time-to-live, where serverManager is a stub to another RMI server (kind of an overseer), nextMethodServer is a stub to another MethodServer and thisMethodServerImpl is the MethodServer implementation within this RMI server process.
It appears that either the generation of the ServerLoadException or invocation of getNextServer() is triggering DGCAckHandler instances to be created which will never be DGC ack'ed.�� Given the time-to-live caching, the ServerLoadException is the issue.
The number of leaked DGCAckHandlers appears to be exactly equal to the number of calls to ServerLoadException(), so I think that points quite convincingly to this being the cause of the issue.
Modifying ServerLoadException to hide the MethodServer stub in a byte[] (as per attached) rather than leaving it as a simple non-transient field seems to completely workaround the issue.
That raises the question as to what should be fixed in the RMI internals, as I don't see anything clearly wrong with the RMI usage pattern here that should trigger a DGCAckHandler leak.