JDK-8236498 : Improve Winsock error handling for closed connections (connection reset)
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.net
  • Priority: P4
  • Status: New
  • Resolution: Unresolved
  • OS: windows_10
  • CPU: generic
  • Submitted: 2019-12-23
  • Updated: 2020-05-18
Related Reports
Relates :  
Description
I am opening this issue on behalf of my Lucene committer colleague Dawid Weiss:

Short background: Apache Solr (part of Apache Lucene project) uses Jetty 9 for communication between nodes. Recently we added support for TLS connections and HTTP/2 support (using the Jetty libraries). So this issue also affects users of Jetty, using their client libs.

We have a multithreaded test setup where the test framework starts several Solr instances (each with a jetty) on localhost and then executing queries and index updates on it. This causes a lot of communication between the nodes. It also enables TLS connections. Since Java 9+, Solr also uses HTTP2 on TLS connections (as there is native ALPN support).

To also simulate some errors, sometimes solr nodes are killed during test run to figure out how the system behaves.

During debugging some random test failures on Windows, we figured out that sometimes we get completely unexpected "recv failed" errors. Here is the analysis by Dawid (see Lucene/Solr issue: [https://issues.apache.org/jira/browse/SOLR-13778]):

{quote}
Ok, so here it comes. I started from looking at the stack trace of those nested "recv failed" exceptions:
{code}
java.net.SocketException: Software caused connection abort: recv failed
   [junit4]   2>   	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
   [junit4]   2>   	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
{code}
the native method in question is different on Windows and on Linux. On Windows the core we're interested in is here:

https://github.com/openjdk/jdk14/blob/f58a8cbed2ba984ceeb9a1ea59f917e3f9530f1e/src/java.base/windows/native/libnet/SocketInputStream.c#L120-L154

The core part is a switch on WSAGetLastError:
{code}
int err = WSAGetLastError();
...
switch (err) {
...
default:
  NET_ThrowCurrent(env, "recv failed");
}
{code}

here is when I needed to recompile the JDK to actually see what the error code returned from Winsocks is. And it's this one: 

WSAGetLastError returns 10053 (WSAECONNABORTED)

https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2

If you take a closer look at the source code of SocketInputStream.c this return value is not covered in the switch -- that's why we're getting the exception. 

And here comes the worst part: the reasons as to WHY winsocks aborts a recv with this result type are fairly vague.  Microsoft says this:

"Connection reset by peer. An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET."

I tried to reproduce the same exception on a simple(r) example but totally failed. It SHOULD be possible to get the same exception using plain sockets (no SSL involved) but I always ended up getting regular connection-closed... I wish I could provide an example of this because it'd be a more concrete proof of the problem (and a signal that the JDK implementation should return a socket closed exception for this condition). Can't figure out a way to do it though, argh.

Going back to failures in Solr tests: I think the reason is that we shutdown jetty in the middle of the test but then reuse the same client that was previously connected to an existing instance. If it's an SSL connection then there may be SSL comms flying around in addition to user messages and if they're issued on a closed socket connection they trigger this enigmatic recv failed error.

I think the client should be reinstantiated (or at least any existing connections dropped) for the tests to work reliably. If we want a more connection-drop resilient client we could try to look into SSLExceptions/ SocketException and try to parse the 'recv failed' but I think it makes little practical sense and is really hacky. Better to drop the request in real life and properly reinitialize the client in tests.
{quote}

In short, the suggestion is to fix the switch statement in SocketInputStream.c to either do the following:
- add another case statement for WSAECONNABORTED (e.g. next to the already existing ones WSAECONNRESET and WSAESHUTDOWN). Because according to Mircosoft documentation this is basically more or less the same.
- alternatively in the default statement somehow report the WSA code, so it does not only return "recv failed", but maybe "recv failed (WSA error %d)", 10053 - in this case)

I'd prefer the first one.

The analysis has shown why this is critical during the usage of TLS connections: TLS sometimes sends behind the user some messages over the underlying socket. If the user socket was somehow closed it may asynchronously still send some messages during the close event over the wire that may trigger this unexpected exception. So it is very important to get a clear error message like "socket was closed".
Comments
The purpose for the special handling for "connection reset" in libnio/ch/SocketDispatcher.c (or libnet/SocketInputStream.c when running with -Djdk.net.usePlainSocketImpl) is so that subsequent reads will throw IOException rather than return bytes or -1/EOF. The API docs for Socket::getInputStream have more on this. Technically this special handling is not needed on Windows as I don't know of any case where it is possible to read bytes or -1/EOF after Windows reports an error. Since JDK 13 the exception messages are system messages. This is a point that we called out in the "Risks and Assumptions" section of JEP 353. In this case, the system message is a bit better than before but you are right that we should examine these cases to see if improvements should be made.
28-12-2019

Thanks Alan, you are right the error message is better in Java 14. We were testing with Java 11 (the Solr release version) and we only got "recv failed". Nevertheless, it would be good to have a special case handling in SocketInputStream, because WSAECONNABORTED is really similar to the other ones handled in the local switch statement. So maybe just add WSAECONNABORTED as third option next to WSAECONNRESET and WSAESHUTDOWN. I refer to the comment in code: "Windows sometimes reports the reset as a shutdown error." - IMHO the same statement also applies to WSAECONNABORTED, it also happens sometimes - looks like as Dawid found out - when something was sent before to the already closed Socket using SocketOutputStream: // Emit some data to already-closed socket. This succeeds (!) socket.getOutputStream().write("will succeed?!".getBytes("UTF-8"));
27-12-2019

The implementation has changed in JDK 13 so you'll get the same error messages as native Windows applications ("An established connection was aborted by the software in your host machine"). I agree that the error is confusing and we should look at WSAECONNABORTED in more detail to see if a less confusing error would be better here.
27-12-2019

I attached a reproducer that similates the inconsistens socket state with the generic error message on windows. It should really just say ConnectionResetException.
27-12-2019