Bug ID: JDK-6697180 JMX query results in java.io.IOException: Illegal state

JDK-6697180 : JMX query results in java.io.IOException: Illegal state - also a deadlock can also be seen

Type: Bug
Component: core-svc
Sub-Component: javax.management
Affected Version: 5.0

Priority: P2
Status: Closed
Resolution: Fixed
OS: linux_redhat_4.0
CPU: generic

Submitted: 2008-05-02
Updated: 2011-03-08
Resolved: 2011-03-08

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	Other	JDK 6	JDK 7
5.0u16-revFixed	5.0u17Fixed	6u12Fixed	7 b37Fixed

Description

problem is observed with JRE 1.5.0_09 on RHEL4u3 32-bit

A customer has a JMX manager software which uses Notifications from
agents as well as JMX query.  The manager software encountered two 
unexpected situations (A and B below). The manager deals with three 
JMX agents, thus the logs often confuse us to tell what was really 
happing.

(A) ClientNotifFowarder thread died, but RMIConnector did not catch
error. Later, JMX query failed, but restart operation ended up with
IOException("Illegal state.").

// ClientNotifForwarder.java
public synchronized ClientListenerInfo[] preReconnection() throws
IOException {
if (state == TERMINATED || beingReconnected) { // should never
throw new IOException("Illegal state.");
}

2008/04/12 20:06:13 ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.io.IOException: Not connected
2008/04/13 9:07:24 ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.io.IOException: Illegal state.
2008/04/13 9:07:24 ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.io.IOException: The ClientNotifForwarder
has been terminated.
2008/04/13 17:37:17 ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.rmi.NoSuchObjectException: no such object
in table
2008/04/13 17:37:17 RMIConnector RMIClientCommunicatorAdmin-doStop
WARNING: Failed to call the method close():java.rmi.ConnectIOException:
error during JRMP connection establishment; nested exception is:
java.net.SocketException: Connection reset

There was a message which tells ClientNotifForwarder thread was stopped
on 2008/04/12.

2008/03/14 10:01:45 RMIConnector RMIClientCommunicatorAdmin-doStop
WARNING: Failed to call the method close():java.rmi.ConnectIOException:
error during JRMP connection establishment; nested exception is:
java.net.SocketException: Connection reset
2008/04/12 20:06:13 ClientNotifForwarder NotifFetcher-run
SEVERE: Failed to fetch notification, stopping thread. Error is:
java.rmi.ConnectException: Connection refused to host: 10.251.200.61;
nested exception is:
java.net.ConnectException: Connection refused
java.rmi.ConnectException: Connection refused to host: 10.251.200.61;
nested exception is:
java.net.ConnectException: Connection refused
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:574)

at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:185)

* We initially thought there was a short network problem which resulted in ClientNotifFowarder thread to terminate.  

2008/04/13 9:07:24 ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.io.IOException: The ClientNotifForwarder

This seems to have caused "Illegal state" much later when ClientCommunicatorAdmin
tried to restart (we guess there may have been another netowrk problem at that time...)

2008/04/13 9:07:24 ClientCommunicatorAdmin restart
WARNING: Failed to restart: java.io.IOException: Illegal state.

(B) It looks like both JMX query operation and ClientNotifForwarder
thread got IOException almost at the same time. In that case, these two
threads cause a kind of dead lock.

Thread in JMX query:

"Thread-3039643" prio=1 tid=0x39a005e8 nid=0x3f0a in Object.wait() [0x34b8e000..0x34b8f130]
��������at java.lang.Object.wait(Native Method)
��������- waiting on <0x532a7958> (a javax.management.remote.rmi.RMIConnector$RMINotifClient)
��������at java.lang.Object.wait(Object.java:474)
��������at
com.sun.jmx.remote.internal.ClientNotifForwarder.postReconnection(ClientNotifForwarder.java:314)
��������- locked <0x532a7958> (a javax.management.remote.rmi.RMIConnector$RMINotifClient)
��������at
javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.reconnectNotificationListeners(RMIConnector.java:1512)
��������at
javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.doStart(RMIConnector.java:1577)
��������at
com.sun.jmx.remote.internal.ClientCommunicatorAdmin.restart(ClientCommunicatorAdmin.java:106)
��������at
com.sun.jmx.remote.internal.ClientCommunicatorAdmin.gotIOException(ClientCommunicatorAdmin.java:34)
��������at
javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1413)
��������at
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.getMBeanInfo(RMIConnector.java:1033)


��������at
jp.co.ntt.lab.soa.crane.manager.base.mbs.ProxyMBean$GetMBeanInfoThread.run(ProxyMBean.java:476)

ClientNotifFowarder thread:

"ClientNotifForwarder-2" daemon prio=1 tid=0x08684a98 nid=0x4e61 in Object.wait()
[0x3e5da000..0x3e5db0b0]
��������at java.lang.Object.wait(Native Method)
��������- waiting on <0x532e02e8> (a [I)
��������at java.lang.Object.wait(Object.java:474)
��������at
com.sun.jmx.remote.internal.ClientCommunicatorAdmin.restart(ClientCommunicatorAdmin.java:82)
��������- locked <0x532e02e8> (a [I)
��������at
com.sun.jmx.remote.internal.ClientCommunicatorAdmin.gotIOException(ClientCommunicatorAdmin.java:34)
��������at
javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1413)
��������at
javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1293)
��������at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:508)
��������at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:399)
��������at
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:83)

* It looks like these thread prevent each other to complete restart
operations.

It looks like current JMX implementation does not handle these cases well.
The customer is asking for a workaround or a fix for these.

The customer investigated more and guess that this was not caused by network unreliability.  Unstable state between the finish of ClientNotifForwarder 
and the finish of RMIConnector in server.  When client communicate with the 
server during this period, NoSuchObjectException (extends IOException) 
occured in earlier timing and ConnectException (connection refused) occured 
in later timing.

Full logs as well as a test case for the Illegal State with reproduction instructions in (README.txt) have been attached to this bug.

Comments

SUGGESTED FIX During reconnection, if the fetching job is in the state "STARTED", we should not ask it to end, the job should be either busy on distributing notifications received to listeners, or blocked on waiting reconnection. We only need to do specially clean job for this case when the reconnection is finished. A test is attached to reproduce the problem and to verify the solution suggested.
08-07-2008
EVALUATION Today we end the fetching job because we need to do some clean after the reconnection, that's why the method "preReconnection()" set the flag "STOPPING" for the fetching job, and the method "preReconnection()" needs waiting the job end. Obviously this logic makes deadlock. One solution is to not ask the job to end during the reconnection, but in the method "postReconnection" we do necessary cleaning, we should be very careful because the clean job is different according to the state of the fetching job.
25-06-2008
EVALUATION If my previous evaluation is correct, the solution is: when the fetching thread finds the state = RE_CONNECTING, instead of waiting the state change, it may check whether another state is equal to STOPPING, if yes it should do stopping to free the thread doing the reconnection. We need the test to verify the previous evaluation and this solution.
21-06-2008
EVALUATION For B), the reason of blocking is: One user thread was doing query (querying thread) and got an IOException, this thread was used to do reconnection process, we have another thread created by the client connector doing notif feching (fetching thread), it got IOException too and found that another thread was doing reconnection so it was waiting. After the connection was reestablished, the user thread doing reconnection process was used to 1) re-add automatically all listeners to the remote client, 2) set a flag to tell the fetching thread to stop because a new fetching thread would be created after the reconnection was completed. The problem seems clear now: the user thread was waiting the fetching thread to die before creating a new fetching thread, but the current fetching thread was waiting the user thread to finish reconnection.
07-05-2008