I am opening this issue on behalf of my Lucene committer colleague Dawid Weiss:
Short background: Apache Solr (part of Apache Lucene project) uses Jetty 9 for communication between nodes. Recently we added support for TLS connections and HTTP/2 support (using the Jetty libraries). So this issue also affects users of Jetty, using their client libs.
We have a multithreaded test setup where the test framework starts several Solr instances (each with a jetty) on localhost and then executing queries and index updates on it. This causes a lot of communication between the nodes. It also enables TLS connections. Since Java 9+, Solr also uses HTTP2 on TLS connections (as there is native ALPN support).
To also simulate some errors, sometimes solr nodes are killed during test run to figure out how the system behaves.
During debugging some random test failures on Windows, we figured out that sometimes we get completely unexpected "recv failed" errors. Here is the analysis by Dawid (see Lucene/Solr issue: [https://issues.apache.org/jira/browse/SOLR-13778]):
{quote}
Ok, so here it comes. I started from looking at the stack trace of those nested "recv failed" exceptions:
{code}
java.net.SocketException: Software caused connection abort: recv failed
[junit4] 2> at java.base/java.net.SocketInputStream.socketRead0(Native Method)
[junit4] 2> at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
{code}
the native method in question is different on Windows and on Linux. On Windows the core we're interested in is here:
https://github.com/openjdk/jdk14/blob/f58a8cbed2ba984ceeb9a1ea59f917e3f9530f1e/src/java.base/windows/native/libnet/SocketInputStream.c#L120-L154
The core part is a switch on WSAGetLastError:
{code}
int err = WSAGetLastError();
...
switch (err) {
...
default:
NET_ThrowCurrent(env, "recv failed");
}
{code}
here is when I needed to recompile the JDK to actually see what the error code returned from Winsocks is. And it's this one:
WSAGetLastError returns 10053 (WSAECONNABORTED)
https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2
If you take a closer look at the source code of SocketInputStream.c this return value is not covered in the switch -- that's why we're getting the exception.
And here comes the worst part: the reasons as to WHY winsocks aborts a recv with this result type are fairly vague. Microsoft says this:
"Connection reset by peer. An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET."
I tried to reproduce the same exception on a simple(r) example but totally failed. It SHOULD be possible to get the same exception using plain sockets (no SSL involved) but I always ended up getting regular connection-closed... I wish I could provide an example of this because it'd be a more concrete proof of the problem (and a signal that the JDK implementation should return a socket closed exception for this condition). Can't figure out a way to do it though, argh.
Going back to failures in Solr tests: I think the reason is that we shutdown jetty in the middle of the test but then reuse the same client that was previously connected to an existing instance. If it's an SSL connection then there may be SSL comms flying around in addition to user messages and if they're issued on a closed socket connection they trigger this enigmatic recv failed error.
I think the client should be reinstantiated (or at least any existing connections dropped) for the tests to work reliably. If we want a more connection-drop resilient client we could try to look into SSLExceptions/ SocketException and try to parse the 'recv failed' but I think it makes little practical sense and is really hacky. Better to drop the request in real life and properly reinitialize the client in tests.
{quote}
In short, the suggestion is to fix the switch statement in SocketInputStream.c to either do the following:
- add another case statement for WSAECONNABORTED (e.g. next to the already existing ones WSAECONNRESET and WSAESHUTDOWN). Because according to Mircosoft documentation this is basically more or less the same.
- alternatively in the default statement somehow report the WSA code, so it does not only return "recv failed", but maybe "recv failed (WSA error %d)", 10053 - in this case)
I'd prefer the first one.
The analysis has shown why this is critical during the usage of TLS connections: TLS sometimes sends behind the user some messages over the underlying socket. If the user socket was somehow closed it may asynchronously still send some messages during the close event over the wire that may trigger this unexpected exception. So it is very important to get a clear error message like "socket was closed".