Bug ID: JDK-4425033 TCP test 'HalfWriteIgnored' of sqe test suite failed in full look

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 1.3.1,1.4.0

Priority: P3
Status: Closed
Resolution: Duplicate
OS: windows_nt
CPU: x86

Submitted: 2001-03-13
Updated: 2001-09-07
Resolved: 2001-09-07

TCP test 'HalfWriteIgnored' of sqe test suite failed in full look when server on Win 2000/WinNT/Win98 and client on Solaris 8/Sol8 x86

In this test client i.e.
'HalfWriteIgnored' writes some string to TcpServer and tries to listen from
server. But server just writes a NULL byte and closes output stream. When server
writes a NULL byte on client side while reading from input stream its throwing
IOException with 'Broken pipe' in message.

How to reproduce it:
1. Get tests from /net/sqesvr/export/vsn/NET/merlin_jdk_net_promoted

2. Set JAVA_HOME to jdk1.4 and STABLE_JAVA_HOME to jdk1.3

3. Execute server on Win NT/Win 2000 machine with full look

/tests/tcp/run_tcp_server -full

4. Execute atleast 4 clients on Sol 8/Sol8 x86 machines

/tests/tcp/run_tcp_client -full -1 serverName

5. check the results logs created in the same folder where 'tests' is there.

Other Information:

I got the following messages:

VERBOSE: byte write threw 'Broken pipe' IO exception.
VERBOSE: loop #39: connecting to duotronic:25000
DEBUG: Sending first line to server.
DEBUG: Waiting for server NUL byte.
ERROR: Cannot read server NUL byte.
ERROR: Interrupted system call
FINALSTATUS:HalfWriteIgnored:EXIT_ERROR:2:Number of ERRORS:1:TEST INCOMPLETE

pratul.wadher@Eng 2001-03-30

Ladybird build 20

TCP test 'HalfWriteIgnored' of sqe test suite failed in full look when server on Win WinNT and clients on Solaris 8 /Sol2.6 x86How to reproduce it:
1. Get tests from /net/sqesvr/export/vsn/NET/ladybird_jdk_net

2. Set JAVA_HOME to jdk1.3.1 (ladybird build 20 ) and STABLE_JAVA_HOME to jdk1.3

3. Execute server on Win NT machine with full look

/tests/tcp/sh run_tcp_server -full

4. Execute atleast 4 clients on Sol 8/Sol2.6 x86 machine

/tests/tcp/run_tcp_client -full -1 serverName

5. check the results logs created in the same folder under 'tests' .

The following result was displayed:

ERROR: Cannot read server NUL byte.
ERROR: Interrupted system call
FINALSTATUS:HalfWriteIgnored:EXIT_ERROR:2:Number of ERRORS:1:TEST INCOMPLETE

pratul.wadher@Eng 2001-04-02

I have modified the run_tcp_client to include only the failed test and have attached the said file in the report. It is called run_tcp_bug. So use this to run this particular test.

SUGGESTED FIX Make change in merlin.nightly/src/os/solaris/vm/hpi_solaris.hpp Suggested fix came from Alan Bateman. validated fix with testcase specified in bug report. inline int hpi::timeout(int fd, long timeout) { struct timeval tv; long prevTime; static const char* aNull = 0; #ifndef USE_SELECT struct pollfd pfd; pfd.fd = fd; pfd.events = POLLIN; #else fd_set tbl; struct timeval select_tv; select_tv.tv_sec = timeout / 1000; select_tv.tv_usec = (timeout % 1000) * 1000; FD_ZERO(&tbl); FD_SET(fd, &tbl); #endif // need time in case we must restart gettimeofday(&tv, &aNull); prevTime = ((tv.tv_sec * 1000) + (tv.tv_usec / 1000)); while (1) { int res; long newTime; #ifndef USE_SELECT INTERRUPTIBLE(::poll(&pfd, 1, timeout), res, os::Solaris::clear_interrupted); #else INTERRUPTIBLE(::select(fd + 1, &tbl, 0, 0, &select_tv), res, os::Solaris::clear_interrupted); #endif if (res != OS_ERR) { return res; } if (errno != EINTR) { return res; } // adjust timeout for restart gettimeofday(&tv, &aNull); newTime = ((tv.tv_sec * 1000) + (tv.tv_usec / 1000)); timeout -= (newTime - prevTime); if (timeout <= 0) { return 0; } prevTime = newTime; } } gary.collins@East 2001-04-16

16-04-2001

EVALUATION What happens is that under some specific circumstances, we get the following exception : java.net.SocketException: Interrupted system call at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:119) at java.net.SocketInputStream.read(SocketInputStream.java:147) at java.io.FilterInputStream.read(FilterInputStream.java:69) at HalfWriteIgnored.doWriteChecks(HalfWriteIgnored.java:185) at HalfWriteIgnored.main(HalfWriteIgnored.java:110) This is generated because JVM_Timeout(fd, timeout) returns with a JVM_IO_ERR and errno == 4 (Interrupted System Call). This happens long before the actual timeout value is reached. Hotspot used to call sysTimeout(fd,timeout), but doesn't anymore so I can't really investigate further. There is now a hpi::timeout() call. It seems that it should handle the Interrupted system call better, maybe by retrying. Note: I have been able to reproduce this bug only when the client runs on Solaris and the server on Windows 2000. I'm reassigning the bug to the hotspot group. jean-christophe.collet@Eng 2001-03-29 This bug eluded to a problem in hotspot because hotspot changed the interface. return hpi::timeout(....) This interface does the following: it use to return os::jvm_timeout(fd,timeout); SCCS comment todo: worry about interruptable io Makes a call to the native libraries machine dependent code. This is where we should be handling such retry issues. Please be aware that adding retries could on a server machine overload it and cause the system to stop all new connections and wait. On Solaris we handle Polling a little better, and that is why you don't see these issues, but possible. This is why the current implementation fails for PC. NIO code I thought was suppose to fix such Socket connection problems with polling.. on win32 we may have a race condition on Initialization. Please look at code ./solaris/hpi/native_threads/src/sys_api_td.c:sysTimeout(int fd, long timeout) { ./win32/hpi/src/socket_md.c:sysTimeout(int fd, long timeout) { Looking threw win32 code it appears that we don't handle interrupts like we do on Solaris.. I would say from looking at the code on native libraries code that we may have a bug there instead. Please have the JDK libraries team look into this further and change the status of the bug back to the JDK libraries. java\classes_io or classes_net I could be way off base, and if that is true then hotspot will look into this further. At this time we should be looking at code mentioned above to see if we are handling interruptable io correctly. gary.collins@East 2001-04-04 --- The issue is that poll is being interrupted due to a SIGPIPE and isn't being restarted automatically as poll isn't one of the system functions - see sigaction(2) regarding SA_RESTART. It appears that this issue has been in JDK/J2SE for many years. This should get fixed in beta referesh. alan.bateman@ireland 2001-04-05

05-04-2001