JDK-6231517 : flawed reconnect logic in sun.rmi.transport.DGCClient's makeDirtyCall()
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.rmi
  • Affected Version: 6
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2005-02-22
  • Updated: 2010-04-03
  • Resolved: 2005-07-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other JDK 6
5.0u8Fixed 6 betaFixed
Description
I have discovered a problem in sun.rmi.transport.DGCClient while
running on our platform.  I encountered this problem using Java 1.3.1,
but the code is identical in Java 1.4.2, 5.0, and 6.0, so I'm sure the 
problem exists there as well.

We are running a first copy of JBoss application server.  It uses RMI
to connect to a second copy of JBoss application server.  If the
second copy of JBoss terminates abruptly, the first copy of JBoss
starts attempting to reconnect to the second copy using the RMI port.
These reconnection attempts are very rapid fire, in a tight loop.

After analyzing the problem, I determined that the problem is in
sun.rmi.transport.DGCClient.  In particular, the method makeDirtyCall
is called periodically to renew the lease on remote objects so they
are not garbage collected in the remote JVM.  If the socket "connect"
to the RMI port fails, the code in makeDirtyCall goes into a reconnect
algorithm.  The time to wait between retries is computed by taking the
difference between the timestamps before and after the last connect
attempt.  Since the granularity of System.currentTimeMillis() may be
less than milliseconds, it is not at all unusual for this difference
to be zero.  Therefore, zero is used to compute the wait time.  This
wait time is doubled after each retry, but of course doubling zero
doesn't do much good!  These retries occur until the lease time is up.
The typical lease time is 10 minutes, so the tight loop can continue
for quite a while.

I would like to suggest the following fix: [see Suggested Fix]
###@###.### 2005-2-22 03:14:44 GMT

Comments
EVALUATION Regarding the JDC/SDN comment of 2006-02-12, the fix for this bug in the Mustang release can be found in the source snapshots available from the Mustang project on java.net (https://mustang.dev.java.net/). It was roughly the following patch to src/share/classes/sun/rmi/transport/DGCClient.java: @@ -366,11 +374,12 @@ /* * Calculate the initial retry delay from the * average time elapsed for each of the first - * two failed dirty calls. + * two failed dirty calls. The result must be + * at least 1000ms, to prevent a tight loop. */ dirtyFailureDuration = - (dirtyFailureDuration + - (endTime - startTime)) >> 1; + Math.max((dirtyFailureDuration + + (endTime - startTime)) >> 1, 1000); } long newRenewTime = endTime + (dirtyFailureDuration << n);
22-02-2006

EVALUATION Fixed similar to "Suggested Fix".
28-07-2005

EVALUATION Yes, this bug appears to exist as described, and when it occurs, the effect would indeed seem to be rather unpleasant. At a miniumum, something like the Suggested Fix seems reasonable. Moreover, I would guess that this bug is most likely to be triggered when the dirty call failure does not involve any network operations, such as when a dirty call fails with a SecurityException because a custom client socket factory does not have permission to connect to the particular remote endpoint. But in such a case, retry seems pointless anyway-- perhaps the retry algorithm should only be attempted when the failure is a RemoteException, or some set of RemoteException subclasses (like ConnectException, ConnectIOException, MarshalException, UnmarshalException, etc.). ###@###.### 2005-2-24 00:11:20 GMT
23-02-2005

SUGGESTED FIX Below is the code for method sun.rmi.transport.DGCClient.makeDirtyCall with some annotations (lines starting with "->") from licensee: /** * Make a DGC dirty call to this entry's endpoint, for the ObjIDs * corresponding to the given set of refs and with the given * sequence number. * * This method should NOT be called while synchronized on this entry. */ private void makeDirtyCall(Set refEntries, long sequenceNum) { ObjID[] ids; if (refEntries != null) { ids = createObjIDArray(refEntries); } else { ids = emptyObjIDArray; } long startTime = System.currentTimeMillis(); try { Lease lease = dgc.dirty(ids, sequenceNum, new Lease(vmid, leaseValue)); long duration = lease.getValue(); long newRenewTime = computeRenewTime(startTime, duration); long newExpirationTime = startTime + duration; synchronized (this) { dirtyFailures = 0; setRenewTime(newRenewTime); expirationTime = newExpirationTime; } } catch (Exception e) { long endTime = System.currentTimeMillis(); synchronized (this) { dirtyFailures++; if (dirtyFailures == 1) { /* * If this was the first recent failed dirty call, * reschedule another one immediately, in case there * was just a transient network problem, and remember * the start time and duration of this attempt for * future calculations of the delays between retries. */ dirtyFailureStartTime = startTime; dirtyFailureDuration = endTime - startTime; setRenewTime(endTime); } else { /* * For each successive failed dirty call, wait for a * (binary) exponentially increasing delay before * retrying, to avoid network congestion. */ int n = dirtyFailures - 2; if (n == 0) { /* * Calculate the initial retry delay from the * average time elapsed for each of the first * two failed dirty calls. */ dirtyFailureDuration = (dirtyFailureDuration + (endTime - startTime)) >> 1; -> /* -> * I think something like the following should be -> * added: -> * if (dirtyFailureDuration < 1000) -> * dirtyFailureDuration = 1000; -> */ } long newRenewTime = endTime + (dirtyFailureDuration << n); /* * Continue if the last known held lease has not * expired, or else at least a fixed number of times, * or at least until we've tried for a fixed amount * of time (the default lease value we request). */ if (newRenewTime < expirationTime || dirtyFailures < dirtyFailureRetries || newRenewTime < dirtyFailureStartTime + leaseValue) { setRenewTime(newRenewTime); } else { /* * Give up: postpone lease renewals until next * ref is registered for this endpoint. */ setRenewTime(Long.MAX_VALUE); } } if (refEntries != null) { /* * Add all of these refs to the set of refs for this * endpoint that may be invalid (this VM may not be in * the server's referenced set), so that we will * attempt to explicitly dirty them again in the * future. */ invalidRefs.addAll(refEntries); /* * Record that a dirty call has failed for all of these * refs, so that clean calls for them in the future * will be strong. */ Iterator iter = refEntries.iterator(); while (iter.hasNext()) { RefEntry refEntry = (RefEntry) iter.next(); refEntry.markDirtyFailed(); } } /* * If the last known held lease will have expired before * the next renewal, all refs might be invalid. */ if (renewTime >= expirationTime) { invalidRefs.addAll(refTable.values()); } } } } ###@###.### 2005-2-22 03:14:45 GMT
22-02-2005