Bug ID: JDK-6275081 Significant slowdown when creating/releasing a JMX rmi/jrmp connector server starting with b36

Type: Bug
Component: core-libs
Sub-Component: java.rmi
Affected Version: 6

Priority: P3
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2005-05-24
Updated: 2010-07-29
Resolved: 2005-09-20

JDK 6
6 betaFixed

Since Mustang b36 we see a jmx test failing all platforms on time out.
Analysis is that it became around 80 times slower to create a jmx connector 
server, start it, stop it.
Most of jmx code doing that spends time in allocating/freeing java.net 
level objects.
An important detail is that rmi over jrmp is impacted but rmi over iiop 
is still as fast with b36 than with b35.
The jmx engineering team did not putback anything in b36, I made an 
unsuccessful search in Bugster, so I wonder what's going on.
Attached java code is all you need to demonstrate the issue.

###@###.### 2005-05-24 09:03:28 GMT

EVALUATION Fixed via general cleanup of the server socket accept loop implementation.

16-09-2005

WORK AROUND Set a failure handler with RMISocketFactory.setFailureHandler, and have it return true, or sometimes return true. This is obviously nasty. ###@###.### 2005-05-24 13:04:17 GMT

24-05-2005

EVALUATION It would appear that we're falling over some code in RMI that handles TCP accept failures. I think this code is being triggered now because of the recent fix for 4457683 ("port used by remote object is not freed up after remote object is unexported"). What I'm seeing is that the system is hitting the Thread.sleep(10000) in sun.rmi.transport.tcp.TCPTransport.continueAfterAcceptFailure: /* Default behavior if no failure handler is installed: * if we get a burst of NFAIL failures in NMSEC milliseconds, * then wait for ten seconds. This is to ensure that * individual failures don't cause hiccups, but sustained * failures don't hog the CPU in futile accept-fail-retry * looping. */ final int NFAIL = 10; final int NMSEC = 5000; We could be getting a ten second wait every ten times we call RMIConnectorServer.start, which would add an average penalty of one second. If starting normally takes about 1/80th of a second that would account for the slowdown. The attached program (connectorservertime.Main) can be used to reproduce the problem. Run it with an argument that says how many times to create and immediately destroy a connector server. I find (on Solaris 9) that with argument 11 everything works fine, but with argument 12 it hits the delay. It may be that this sort of rapid creation and deletion won't happen in practice, but I think the question should be studied by the RMI team. ###@###.### 2005-05-24 13:04:17 GMT The observed behavior is indeed a bug that was introduced with the fix for 4457683. Now that a sun.rmi.transport.tcp.TCPTransport object can have multiple server sockets in its lifetime, the way that recent server socket accept failures are recorded as TCPTransport instance variables is flawed: private transient long acceptFailureTime = 0L; private transient int acceptFailureCount; Every time that a server socket is closed, an accept failure will be recorded; after 10 of these in a burst (if there is no RMIFailureHandler installed), a 10 second sleep will be performed by the accept loop for the most recent server socket. That itself might have been harmless, except that the method continueAfterAcceptFailure synchronizes on the TCPTransport instance (why?), which the listen method also does before creating a new server socket, which delays the subsequent export. The server socket accept failure history should be specific to a given server socket; it would seem cleaner to move these variables along with the TCPTransport.run() method into a separate inner implementation of Runnable, which can be synchronized on separately (if it needs synchronization at all, which I don't think that it does, because these variables should only be accessed from a single accept loop anyway). [Note that in Mustang build b40, the attached test case will no longer fail because of the fix for 6269166, which involved partially backing out the fix for 4457683 temporarily. This bug remains, however, and it can be demonstrated with sequences of exporting/unexporting remote objects on explicit ports.] ###@###.### 2005-06-03 02:56:33 GMT

24-05-2005