Bug ID: JDK-6670408 testcase panics 1.5.0_12&_14 JVM when java.net.PlainSocketImpl trying to throw an exception

JDK-6670408 : testcase panics 1.5.0_12&_14 JVM when java.net.PlainSocketImpl trying to throw an exception

Type: Bug
Component: core-libs
Sub-Component: java.net
Affected Version: 5.0u14

Priority: P2
Status: Closed
Resolution: Fixed
OS: solaris
CPU: sparc

Submitted: 2008-03-03
Updated: 2011-05-18
Resolved: 2011-05-18

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	Other	Other	JDK 6	JDK 7
1.4.2_19-revFixed	1.4.2_20Fixed	5.0u17Fixed	6u11-revFixed	7 b29Fixed

Related Reports

Relates :	JDK-6343810 - connect in java/net/PlainSocketImpl.c should handle EALREADY
Relates :	JDK-6382902 - VM interrupted I/O feature put on an option switch (sol)
Relates :	JDK-6680485 - Wrong error-handling with Solaris-specific interruptible I/O (Solaris)
Relates :	JDK-6704896 - FD_SET usage can cause stack corruption (sol)

Description

Customer's app crashes on 1.5.0_12 in various java.net.PlainSocketImpl functions,
while it was fine on 1.5.0_11. There is no simple crash pattern.

The problem is easily reproducible.

Please run through the following steps:

1. Testcase
-----------
Please find attached the following test case:
5607 Mar 3 16:00 SocketTest.java
5328 Mar 3 16:02 cms_test_client.jar

Please note: you will need to have a WebServer running locally on port 80.

2. Run
------
java -classpath </path/to/>cms_test_client.jar testclient.SocketTest 256 5000 10.13

3. crashes appear on 1.5.0_12, _13, _14, and _15
------------------------------------------------
[ ... ]
Got exception with localhost Invalid argument

Got 179 hanging threads

Got exception with localhost Invalid argument
Got exception with localhost Invalid argument
Got exception with localhost Invalid argument
Got exception with localhost Invalid argument
Active = 256 getCompletedTaskCount 35154 getTaskCount 64516 getPoolSize 256
[thread 266 also had an error]
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#

4. 1.5.0_11 is fine
-------------------
/data/jdk1.5.0_11/bin/java -classpath /net/redback.germany/data/38045863/testcase/cms_test_client.jar testclient.SocketTest 256 5000 10.13
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
[ ... ]
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
SucessFul localhost 80
Active = 0 getCompletedTaskCount 64516 getTaskCount 64516 getPoolSize 256
Finished
%

Comments

SUGGESTED FIX --- PlainSocketImpl.c- 2008-05-08 22:54:05.296670972 +0400 +++ PlainSocketImpl.c 2008-05-08 22:54:05.192796472 +0400 @@ -345,15 +345,29 @@ * See 6343810. */ while (1) { - fd_set wr, ex; +#ifndef USE_SELECT + { + struct pollfd pfd; + pfd.fd = fd; + pfd.events = POLLOUT; + + errno = 0; + connect_rv = NET_Poll(&pfd, 1, -1); + } +#else + { + fd_set wr, ex; - FD_ZERO(&wr); - FD_SET(fd, &wr); - FD_ZERO(&ex); - FD_SET(fd, &ex); + FD_ZERO(&wr); + FD_SET(fd, &wr); + FD_ZERO(&ex); + FD_SET(fd, &ex); + + errno = 0; + connect_rv = NET_Select(fd+1, 0, &wr, &ex, 0); + } +#endif - errno = 0; - connect_rv = NET_Select(fd+1, 0, &wr, &ex, 0); if (connect_rv == JVM_IO_ERR) { if (errno == EINTR) { continue;
09-05-2008
EVALUATION Yes, this is a clear and well known problem/limitation with the select system call. select should be replaced with poll in this case to avoid the limitation of 1024 file descriptors. This would be the preferred solution rather than defining FD_SETSIZE. It look like this issue is as of a direct result of the library changes for CR 6343810, and any fix for this CR should be backported to update releases where 6343810 has also been fixed.
08-05-2008
EVALUATION Quoting Steve Goldman on this. -------- Original Message -------- Subject: Re: 6670408: testcase panics 1.5.0_12&_14 JVM when java.net.PlainSocketImpl trying to throw an exception Date: Tue, 06 May 2008 15:19:04 -0400 From: steve goldman <###@###.###> Ok I found the bug. Dave Dice surmised the problem on Friday. So the problem is in this code PlainSocketImpl.c while (1) { fd_set wr, ex; FD_ZERO(&wr); FD_SET(fd, &wr); FD_ZERO(&ex); FD_SET(fd, &ex); the fd goes well past the end of the bitvectors wr/ex. The limit on the size on 32bits is 1024 bits. If I truss the program I see it get socket descriptors well past 1024. It finally trips my memory protection check when it was around 3000. If I hadn't messed up my protection code I would have found this on Friday. I looked at the java/io/FileDescriptor and the fd is in fact to large for the statically allocated bitmap.
07-05-2008
EVALUATION The fix above needs to go into Hotspot as a separate bug, but it isn't relevant to this problem. This problem is about something going wrong with Hotspot when the network code tries to throw an exception.
26-03-2008
EVALUATION Instead we may need to verify a first system call actually happened and actually got interrupted before a 2nd connect is attempted on Solaris.
11-03-2008
SUGGESTED FIX Here's an alternative suggested fix (see comments and eval) using the 1.5.0_12 source: --- ../old/os_solaris.inline.hpp Tue Mar 11 17:40:38 2008 +++ os_solaris.inline.hpp Tue Mar 11 18:07:55 2008 @@ -89,10 +89,11 @@ _setup; \ _before; \ OSThread* _osthread = _thread->osthread(); \ if (_thread->has_last_Java_frame()) { \ /* this is java interruptible io stuff / \ + errno = 0; \ if ((os::is_interrupted(_thread, _clear)) \ \|\| ((_cmd) < 0 && errno == EINTR \ && os::is_interrupted(_thread, _clear))) { \ _result = OS_INTRPT; \ } \ --- ../old/hpi_solaris.hpp Tue Mar 11 17:40:37 2008 +++ hpi_solaris.hpp Tue Mar 11 18:07:38 2008 @@ -75,11 +75,11 @@ prevtime = ((julong)t.tv_sec 1000) + t.tv_usec / 1000; for(;;) { INTERRUPTIBLE_NORESTART(::poll(&pfd, 1, timeout), res, os::Solaris::clear_interrupted); - if(res == OS_ERR && errno == EINTR) { + if(res < 0 && errno == EINTR) { gettimeofday(&t, &aNull); newtime = ((julong)t.tv_sec * 1000) + t.tv_usec /1000; timeout -= newtime - prevtime; if(timeout <= 0) return OS_OK;
11-03-2008
EVALUATION Need to check the UseVMInterruptibleIO value before testing _result against OS_INTRPT.
05-03-2008
SUGGESTED FIX The following two changes were red herrings: the third change to the socket impl code is the real fix for this CR. See notes for the history. --- src/os/solaris/vm/hpi_solaris.hpp- 2007-05-15 22:29:42.012602000 +0400 +++ src/os/solaris/vm/hpi_solaris.hpp 2008-03-05 16:52:06.950605000 +0300 @@ -104,7 +104,10 @@ os::Solaris::clear_interrupted); // Depending on when thread interruption is reset, _result could be // one of two values when errno == EINTR - if (((_result == OS_INTRPT) \|\| (_result == OS_ERR)) && (errno == EINTR)) { + if ((UseVMInterruptibleIO == true && + _result == OS_ERR && errno == EINTR) \|\| + (UseVMInterruptibleIO == false && + ((_result == OS_INTRPT \|\| _result == OS_ERR) && errno == EINTR))) { /* restarting a connect() changes its errno semantics */ INTERRUPTIBLE(::connect(fd, him, len), _result, os::Solaris::clear_interrupted);
05-03-2008