Bug ID: JDK-8049303 Transient network problems cause JMX thread to fail silenty

Type: Bug
Component: core-svc
Sub-Component: javax.management
Affected Version: 6u32

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2014-07-03
Updated: 2015-06-04
Resolved: 2014-09-12

JDK 8	JDK 9
8u40Fixed	9 b32Fixed

By default the JMX client side notification fetch timeout
(jmx.remote.x.notification.fetch.timeout) is 1 minute and the default server
connection timeout (jmx.remote.x.server.connection.timeout) is 2 minutes.

If the client side connector thread makes a notification fetch request to
the server, but a transient network problem prevents the server response
from reaching the client, the client side connector will wait for a
response until the timeout period (1 minute) has expired before throwing an
IOException.

The client side RMIConnector implementation handles the IOException, by
re-checking the connection status to understand whether or not it is
broken.  If the connection is available at that moment, the connector fails
by re-throwing the initial IOException. The problem is that this re-check
of the connection passes because the server side of the connection doesn't
time out until 2 minutes has passed (by default) - the NotifFetcher thread
dies without posting a failed notification, and the client application does
not get a chance to recover.

The fix suggested is to modify RMIConnector.RMINotifClient.fetchNotifs: if the fetchNotifs request gets an IOException, we examine the chain of exceptions to determine whether this is a deserialization issue. If so - we propagate the appropriate exception to the caller, who will then proceed with fetching notifications one by one, otherwise we call communicatorAdmin.gotIOException(ioe), there are 2 kinds of response: 1) the call returns OK, means the connection is re-established, we re-call the fetchNotifs; 2) the call throws IOException, we check the connection status: 2-1) "terminated", that means the connection is closed, we re-throw the original IOException, the caller will end silently. 2-2) not "terminated", we add a flag "retried" for this situation, if the flag is false, we set the flag to true and re-do the fetchNotifs request, this is useful for a transient network problem, otherwise we close the connection and re-throw the original IOException, it is here we fix the bug. We do not modify communicatorAdmin.gotIOException(ioe), it is called too by all other remote requests. It is not easy to have a test reproducing the bug.
11-09-2014
Yes a transient network problem could make the notification fetching thread die silently, but it has nothing to do with JMX client side notification fetch timeout (jmx.remote.x.notification.fetch.timeout), this timeout is used by the client fetching request to wait notifications at server side, if no notification arrives during this timeout, the request will return with an empty response, but not an IOException. The issue could happen like this: 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException 2) communicatorAdmin.gotIOException(ioe) was called and it checked the connection, it did not close the connection because the connection was now OK. 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original exception and found it was not a dersialization exception, it re-threw the original IOException 4) the caller ClientNotifForwarder did not know how to treat this exception, decided to end silently. The fix suggested by the reporter (adding a flag reFetch) would make the fetching continue, but that would make the fetching loss notifications, that could be why the fix failed some tests related to the notification serialization.
10-09-2014
We should have a solution to keep treating serialization exception, and re-fetch notifications if possible, in case of a transient network problem as other remote calls do.
08-09-2014
Shanliang, can you propose a valid solution to this problem? The ClientNotifForwarder shouldn't just die silently.
08-09-2014