JDK-8049846 : SocketInputStream.socketRead0 hangs even with soTimeout under high load
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.net
  • Affected Version: 8u5
  • Priority: P4
  • Status: Closed
  • Resolution: Duplicate
  • OS: linux
  • CPU: x86
  • Submitted: 2014-07-02
  • Updated: 2016-10-20
  • Resolved: 2016-10-20
Related Reports
Duplicate :  
Description
FULL PRODUCT VERSION :
java version "1.8.0_05"
Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Linux ip-10-234-136-61 3.2.0-53-virtual #81-Ubuntu SMP Thu Aug 22 21:21:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

EXTRA RELEVANT SYSTEM CONFIGURATION :
We run a Java program on Amazon EC2 and download data from Amazon S3. Network protocol is HTTP (no SSL). We establish 100 or more connections in parallel and download data using Java thread for each connection.

A DESCRIPTION OF THE PROBLEM :
Reading data from a socket hangs for at least 1 hour even if I set Socket.setSoTimeout to non-0 value. It happens very rarely (once a month under 100 or more connections in parallel with continuously 10MB/s or more data transfer).

Although it rarely happens, it leads significant impact to applications if it happens.

When it hanged, thread dump includes following stack trace:

----
"pool-3-thread-6" prio=10 tid=0x00007fd43006e800 nid=0xdf23 runnable [0x00007fd424e56000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:129)
	at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
	at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
	at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
	at java.io.FilterInputStream.read(FilterInputStream.java:116)
	at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
	- locked <0x00000000fed8a010> (a java.lang.Object)
	at com.treasure_data.storage.s3.S3ReadableByteChannel.read(S3ReadableByteChannel.java:80)
	at com.treasure_data.storage.prefetch.Prefetch$Prefetcher.run(Prefetch.java:48)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)
----

When it hanged, I attached gdb to the java process and found that O_NONBLOCK is not set to the socket's file descriptor. SO_RCVTIMEO is set to 0 which means no timeout.

I downloaded source code of OpenJDK 6, 7, 8 and 9 and found that SocketInputStream/socketRead0 calls poll(2) and then call recv(2) on Linux. It's a commonly known issue on Linux that poll(2) could report that a socket file descriptor is ready to read even if it's actually not ready to read. This could happen when data has arrived but a packet has wrong checksum and is discarded. Here is the source code from OpenJDK 9 commit 2df45ac1bf49:

----
JNIEXPORT jint JNICALL
Java_java_net_SocketInputStream_socketRead0(JNIEnv *env, jobject this,
                                            jobject fdObj, jbyteArray data,
                                            jint off, jint len, jint timeout)
{
    ...

    if (timeout) {
        nread = NET_Timeout(fd, timeout);
        if (nread <= 0) {
             ...
             return -1;
        }
    }

    nread = NET_Read(fd, bufP, len);

    ...
}
----

NET_Timeout, which calls poll(2), could return with return code 1 even if following read blocks if a packet's checksum is wrong. Then NET_Read, which calls recv(2), blocks because O_NONBLOCK flag is not set to the file descriptor and SO_RCVTIMEO is 0.

According to source code, this problem exists in Java 6, 7, 8 and 9.

I found similar reports on OpenJDK bug database:
* https://bugs.openjdk.java.net/browse/JDK-7132648
* https://bugs.openjdk.java.net/browse/JDK-8032910

I found another recent report on Apache HTTP Client's mailing list:
* 2014-04-14 "I have set read time out but thread hang on socketRead0": http://mail-archives.apache.org/mod_mbox/hc-dev/201404.mbox/%3CCAFAd71WYx0UX7Mpgof2Mj=f1080CqEmR+=bRDZ9fjivZQUqV+g@mail.gmail.com%3E



STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
It's difficult to reproduce this problem in limited amount of time. If my understanding is correct, this problem happens only if TCP socket state was left in waiting state due to packet loss or checksum error at the exact moment when poll(3) returned false-positively.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
SocketInputStream.read() times out regardless of the load or network condition.
ACTUAL -
SocketInputStream.read() hangs for more than 1 hour very rarely.

ERROR MESSAGES/STACK TRACES THAT OCCUR :
We don't get error message because the thread hangs.
When I get thread dump, it shows that a thread is hanged at SocketInputStream.socketRead0.

REPRODUCIBILITY :
This bug can be reproduced rarely.

CUSTOMER SUBMITTED WORKAROUND :
I couldn't find workaround. Because SocketInputStream.socketRead0 is a primitive JNI method, I couldn't find ways to read data from a socket without calling the method.


Comments
Closing out as a duplicate of JDK-8075484
20-10-2016

http://hg.openjdk.java.net/jdk9/dev/jdk/rev/af17b6bc08dd
13-09-2016