JDK-6944020 : transferTo() over the network: Sometimes nothing is written (EPIPE)
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.net
  • Affected Version: 6u18
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: solaris_nevada
  • CPU: generic
  • Submitted: 2010-04-15
  • Updated: 2014-07-17
Related Reports
Relates :  
Description
transferTo() returns successfully, but nothing is written to the network (checked with tcpdump). Observed on multiple platforms.

the attached Reproduce.java does the following:
- Start a import com.sun.net.httpserver.HttpServer
- PUT files to it using HttpURLConnection and transferTo

We are seeing a problem on OS-X and Solaris X86 (all the platforms we have tried), where data is not written to the network when we use FileChannel.transferTo() to a network stream, in 5+/- of 8000 file transfers.

Log file and source code to reproduce attached.

Example on how to read the log:

The HTTP server times out after 20 seconds, because it fails to read the expected amount of data from the network (the example program writes this to the console):
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
Thr[27]: jag-file-transfer-thread-5 11:26:50.011 SEVERE SERVER: 500: Failed to write data to file '/Users/austvik/NetBeansProjects/ReproduceTransferToCrash/./file/Users/austvik/test.file-5159': no data received, expected 32425 bytes.
java.net.SocketTimeoutException: no data received
        at sun.net.httpserver.Request$ReadStream.block(Request.java:310)
        at sun.net.httpserver.Request$ReadStream.read(Request.java:250)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at sun.net.httpserver.FixedLengthInputStream.readImpl(FixedLengthInputStream.java:37)
        at sun.net.httpserver.LeftOverInputStream.drain(LeftOverInputStream.java:94)
        at sun.net.httpserver.LeftOverInputStream.close(LeftOverInputStream.java:51)
        at reproducetransfertocrash.Reproduce$WebServer$FileHandler.handlePut(Reproduce.java:422)
        at reproducetransfertocrash.Reproduce$WebServer$FileHandler.handle(Reproduce.java:336)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:65)
        at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:65)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:68)
        at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:555)
        at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:65)
        at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:527)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:637)
Thr[16]: client-copy-to-thread-9 11:26:50.011 WARNING CLIENT: Error from SERVER: Failed to write data to file '/Users/austvik/NetBeansProjects/ReproduceTransferToCrash/./file/Users/austvik/test.file-5159': no data received, expected 32425 bytes.
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
Observe that this time it is test.file-5159 that fails, the threads involved is thread 27 - the server thread, and thread 16 - the client thread.

When I look in my logfile, I will see the server saying:
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
Thr[27]: jag-file-transfer-thread-5 11:26:10.011 FINE File './file/Users/austvik/test.file-5159' transfered between '/127.0.0.1:49718' (client) and '/127.0.0.1:54894' (server)
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
The interesting part here is the port number on the server - 54894.

I will also find output from the client in the log file:
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
Thr[16]: client-copy-to-thread-9 11:26:10.011 FINEST CLIENT: Copying '/Users/austvik/test.file' to '/Users/austvik/test.file-5159' at 'http://localhost:49718/file//Users/austvik/test.file-5159'
Thr[19]: client-copy-to-thread-2 11:26:10.011 FINE sun.net.www.MessageHeader@4d40cc822 pairs: {null: HT
TP/1.1 200 OK}{Transfer-encoding: chunked}
Thr[19]: client-copy-to-thread-2 11:26:10.011 FINEST CLIENT: Copying '/Users/austvik/test.file' to '/Users/austvik/test.file-5160' at 'http://localhost:49718/file//Users/austvik/test.file-5160'
Thr[16]: client-copy-to-thread-9 11:26:10.011 FINE sun.net.www.MessageHeader@72b5fdb9 pairs: {PUT /file
//Users/austvik/test.file-5159 HTTP/1.1: null}{Content-Type: text/plain}{Cache-Control: no-cache}{Pragma: no-cache}{User-Agent: Java/1.6.0_17}{Host: localhost:49718}{Accept: text/html, image/gif, image/jpeg
, *; q=.2, */*; q=.2}{Connection: keep-alive}{Content-Length: 32425}
Thr[16]: client-copy-to-thread-9 11:26:10.011 FINE CLIENT: Wrote '32425' of '32425' bytes from /Users/austvik/test.file-5159' to the network layer
Thr[16]: client-copy-to-thread-9 11:26:10.011 FINER CLIENT: written and flused and closed '/Users/austvik/test.file-5159'
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
The interesting data is the line that somes after transfereTo() has finished, that returns that I have written the correct amount of data, and the line that says that I have flushed and closed all connections.

We can see from the timestamps that this happens 40 seconds before the stacktrace in the top - the stacktrace on the top is a SocketTimeoutException. The client thinks that it has written everything to the network and closed all connections 40 seconds before the timeout.

If I look at the network traffic between the port numbers noted above, I see normal TCP SYN/SYN&ACK/ACK connect, and then I see a TCP PSH&ACK with all the headers from the client (from put to Content-Length), and an ACK on that from the server. On normal transfers, the content of the file is now transfered in the next ACK from the client to the server, but on the files that fail, the server never sends anything more, and the next thing that is sent is the "HTTP 500 Internal Server Error" with the "failed to write data" error message from the server to the client. tcpdump.txt attached. Frame 1089 contains the PUT and the headers (ends on 0d 0a 0d 0a). Frame 44846 contains the "HTTP 500 Internal Server Error", and 44848 the error message that is printed in the log ("Failed to write data to file ...").

We see this problem on local and remote interfaces on different operating systems, but never with only one thread. The file it happens to is different every time.

Another thing to observe, is that retries has a high possibility of failing (with less threads, a much percentage):
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
Thr[28]: jag-file-transfer-thread-3 11:26:27.011 SEVERE SERVER: 500: Failed to write data to file '/Users/austvik/NetBeansProjects/ReproduceTransferToCrash/./file/Users/austvik/test.file-2011': no data received, expected 32425 bytes.
...
Thr[28]: jag-file-transfer-thread-3 11:27:08.011 SEVERE SERVER: 500: Failed to write data to file '/Users/austvik/NetBeansProjects/ReproduceTransferToCrash/./file/Users/austvik/test.file-2011': no data received, expected 32425 bytes.
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----

In another run, we truss'ed the process under Solaris/Java 6u18 (everything lwp has been removed). There, the name of the file that won't transfere is lock_multi_bug38691.test. We see that the PUT headers are written at t=9.9089. (fd8 and 64 is loggers.) At t=9.9096, we are trying to send the content of the file, and get an SIGPIPE/EPIPE. We then close the file and the socket and retry, and send the headers, but don't give the program an opportunity to send the content of the file. 20 seconds later we time out.
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----
2654/71:         9.9056 stat64("/home/vo136787/mysql/mysql-advanced-5.1.39-solaris10-x86_64/mysql-test/t/lock_multi_bug38691.test", 0xB4E806F0) = 0
2654/71:         9.9057 access("/home/vo136787/mysql/mysql-advanced-5.1.39-solaris10-x86_64/mysql-test/t/lock_multi_bug38691.test", X_OK) Err#13 EACCES
2654/71:         9.9058 access("/home/vo136787/mysql/mysql-advanced-5.1.39-solaris10-x86_64/mysql-test/t/lock_multi_bug38691.test", R_OK) = 0
2654/71:         9.9059 access("/home/vo136787/mysql/mysql-advanced-5.1.39-solaris10-x86_64/mysql-test/t/lock_multi_bug38691.test", W_OK) = 0
2654/71:         9.9062 open64("/home/vo136787/mysql/mysql-advanced-5.1.39-solaris10-x86_64/mysql-test/t/lock_multi_bug38691.test", O_RDONLY) = 43
2654/71:         9.9062 fstat64(43, 0xB4E806F0)                         = 0
2654/71:         9.9063 fstat64(43, 0xB4E80770)                         = 0
2654/71:         9.9067     Incurred fault #6, FLTBOUNDS  %pc = 0xFE95C7A5
2654/71:              siginfo: SIGSEGV SEGV_ACCERR addr=0xFE190900
2654/71:         9.9067     Received signal #11, SIGSEGV [caught]
2654/71:              siginfo: SIGSEGV SEGV_ACCERR addr=0xFE190900
2654/71:         9.9068 setcontext(0xB4E80044)
2654/71:         9.9089 send(46, " P U T   / f i l e / / h".., 505, 0) = 505
2654/71:         9.9089 fstat64(43, 0xB4E80720)                         = 0
2654/71:         9.9090 pread64(43, " #\n #   B u g # 3 8 6 9".., 2837, 0) = 2837
2654/71:         9.9094 write(8, " 1 8 / 0 1   1 2 : 5 3 :".., 200) = 200
2654/71:         9.9095 write(64, " 1 8 / 0 1   1 2 : 5 3 :".., 200) = 200
2654/71:         9.9096 send(46, " #\n #   B u g # 3 8 6 9".., 2837, 0) Err#32 EPIPE
2654/71:         9.9096     Received signal #13, SIGPIPE [caught]
2654/71:         9.9097 setcontext(0xB4E7E054)
2654/71:         9.9098 send(46, " #\n #   B u g # 3 8 6 9".., 2837, 0) Err#32 EPIPE
2654/71:         9.9099     Received signal #13, SIGPIPE [caught]
2654/71:         9.9100 setcontext(0xB4E7E0E4)
2654/71:         9.9101 fcntl(10, F_DUP2FD, 0x0000002B)                 = 43
2654/71:         9.9101 close(43)                                       = 0
2654/71:         9.9108 read(46, 0xB4E7E5D0, 8192) Err#131 ECONNRESET
2654/71:         9.9109 read(46, 0xB4E7E5D0, 8192)                      = 0
2654/71:         9.9110 fcntl(15, F_DUP2FD, 0x0000002E)                 = 46
2654/71:         9.9111 close(46)                                       = 0
2654/71:         9.9112 so_socket(PF_INET6, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 43
2654/71:         9.9113 connect(43, 0xB4E804D0, 32, SOV_DEFAULT)        = 0
2654/71:         9.9113 getsockname(43, 0xB4E804D0, 0xB4E80508, SOV_DEFAULT) = 0
2654/71:         9.9114 setsockopt(43, tcp, TCP_NODELAY, 0xB4E80668, 4, SOV_DEFAULT) = 0
2654/71:         9.9115 send(43, " P U T   / f i l e / / h".., 505, 0) = 505
2654/71:        read(43, 0xB4E7E590, 8192)      (sleeping...)
2654/71:        29.9322 read(43, " H T T P / 1 . 1   5 0 0".., 8192)    = 59
2654/71:        29.9324 llseek(5, 0x02213F36, SEEK_SET) = 0x02213F36
2654/71:        29.9325 read(5, " P K0304\n\0\0\0\0\0 = r".., 30)       = 30
2654/71:        29.9326 llseek(5, 0x02213F7A, SEEK_SET) = 0x02213F7A
2654/71:        29.9326 read(5, "CAFEBABE\0\0\0 1\0A7\b\0".., 3524) = 3524
2654/71:        29.9328 llseek(5, 0x022066B4, SEEK_SET) = 0x022066B4
2654/71:        29.9328 read(5, " P K0304\n\0\0\0\0\0 < r".., 30)       = 30
2654/71:        29.9329 llseek(5, 0x022066F1, SEEK_SET) = 0x022066F1
2654/71:        29.9332 read(5, "CAFEBABE\0\0\0 1\0 h\b\0".., 2382) = 2382
2654/71:        29.9343 llseek(5, 0x02215342, SEEK_SET) = 0x02215342
2654/71:        29.9345 read(5, " P K0304\n\0\0\0\0\0 = r".., 30)       = 30
2654/71:        29.9346 llseek(5, 0x0221538D, SEEK_SET) = 0x0221538D
------------8<-----------------8<-----------------8<-----------------8<-----------------8<-----

We believe there are one or two errors here:
#1: Errors writing to network are not propagated correctly when transferTo() is used, so that the client is not informed when something wrong happens (transferTo returns correct complete count and we are allowed to flush and close connections). Maybe the system is more stressed with multiple streams/threads, and this happens under stress, maybe this is a thread safety issue.
#2: (maybe because of #1) some caches are not invalidated properly when sockets time out, and are reused when they should not be (The "retry does not help" issue).

If you want help to reproduce or test out patches, we can assist. 

Alan Bateman have tried to reproduce with this attached script, but has not succeeded in it, even though we can do it on any platform we try, jdk6u17 and jdk6u17. Maybe it is the parameters we pass? I run with 10 threads and 8000 files like this: java -jar jarfile 10 8000 test_file.test. With just one thread it happens very seldom or not at all.

The test file I use when reproducing is 2837 bytes. It could be that Alan tried to reproduce with a bigger file.

Alan Says:
> With jdk6 (original GA release from 2006) then it fails quickly with
> exceptions like this:
>
> Thr[15]: jag-file-transfer-thread-0 09:32:20.009 SEVERE SERVER: 500:
> Failed to write data to file '/tmp/./file/tmp/m123.22496.zip-18': Stream
> is closed, expected 444831 bytes.
> java.io.IOException: Stream is closed
> at
> sun.net.httpserver.FixedLengthInputStream.readImpl(FixedLengthInputStream.java:31)
>
> at
> sun.net.httpserver.LeftOverInputStream.drain(LeftOverInputStream.java:91)
> at
> sun.net.httpserver.LeftOverInputStream.close(LeftOverInputStream.java:48)
> at Reproduce$WebServer$FileHandler.handlePut(Reproduce.java:420)
> at Reproduce$WebServer$FileHandler.handle(Reproduce.java:334)

I say:
Yeah, this happens after 2xtimeout (1 timeout from transferFrom, one from close/drain). 


From my logs it looks like this for me:
- Client writes headers
- Client transferTo fails with EPIPE (don't know why, could be server issue)
- Client reconnects, retries, and sends headers, but not the transferTo/file/outputStream that I have written
- Client waits for response from Server
- Server reads headers, content length, and times out trying to read the file which is never written to the network
- During error handling (finally), server times out in close/drain, and that is the exception you see above.

There are two issues, I think:
1. EPIPE when multi threaded: HttpServer issue? (doesn't look like it from the truss logs - can't see find close() between writing headers and trying to write file content.) The way we are using the OS network layer? (strange, it fails on both OS-X and Solaris, but I am not able to trace on OS-X to check that we get EPIPE there too - dtruss makes java not run the code).
2. Retry doesn't send the output stream. Even with the EPOLLS happening some times (2-6 per 8000 file transfers with 10 threads), if the retry sent the output stream, we wouldn't have noticed. 

Michael McMahon forwarded my problem to net-dev (http://mail.openjdk.java.net/pipermail/net-dev/2010-January/thread.html) in January, but I didn't get any response there. Now I am filing the problem here, so that I get a error number and don't forget the issue. We stumbled upon it again yesterday after having turned down our copyer to just one thread.