United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6670408 testcase panics 1.5.0_12&_14 JVM when java.net.PlainSocketImpl trying to throw an exception
JDK-6670408 : testcase panics 1.5.0_12&_14 JVM when java.net.PlainSocketImpl trying to throw an exception

Details
Type:
Bug
Submit Date:
2008-03-03
Status:
Closed
Updated Date:
2011-05-18
Project Name:
JDK
Resolved Date:
2011-05-18
Component:
core-libs
OS:
solaris
Sub-Component:
java.net
CPU:
sparc
Priority:
P2
Resolution:
Fixed
Affected Versions:
5.0u14
Fixed Versions:

Related Reports
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
Customer's app crashes on 1.5.0_12 in various java.net.PlainSocketImpl functions,
while it was fine on 1.5.0_11. There is no simple crash pattern. 

The problem is easily reproducible.

Please run through the following steps:

1. Testcase
-----------
Please find attached the following test case:
 5607 Mar  3 16:00 SocketTest.java
 5328 Mar  3 16:02 cms_test_client.jar

Please note: you will need to have a WebServer running locally on port 80.


2. Run
------
java -classpath </path/to/>cms_test_client.jar testclient.SocketTest 256 5000 10.13

3. crashes appear on 1.5.0_12, _13, _14, and _15
------------------------------------------------
[ ... ]
Got exception with localhost Invalid argument

 Got 179 hanging threads 

Got exception with localhost Invalid argument
Got exception with localhost Invalid argument
Got exception with localhost Invalid argument
Got exception with localhost Invalid argument
Active = 256 getCompletedTaskCount 35154 getTaskCount 64516 getPoolSize 256
[thread 266 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#


4. 1.5.0_11 is fine
-------------------
/data/jdk1.5.0_11/bin/java -classpath /net/redback.germany/data/38045863/testcase/cms_test_client.jar testclient.SocketTest 256 5000 10.13
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
[ ... ]
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
SucessFul localhost  80
Active = 0 getCompletedTaskCount 64516 getTaskCount 64516 getPoolSize 256
Finished
%

                                    

Comments
EVALUATION

Need to check the UseVMInterruptibleIO value before testing _result against OS_INTRPT.
                                     
2008-03-05
SUGGESTED FIX

The following two changes were red herrings: the third change to the socket impl code is the real fix for this CR. See notes for the history.

--- src/os/solaris/vm/hpi_solaris.hpp-  2007-05-15 22:29:42.012602000 +0400
+++ src/os/solaris/vm/hpi_solaris.hpp   2008-03-05 16:52:06.950605000 +0300
@@ -104,7 +104,10 @@
                             os::Solaris::clear_interrupted);
     // Depending on when thread interruption is reset, _result could be
     // one of two values when errno == EINTR
-    if (((_result == OS_INTRPT) || (_result == OS_ERR)) && (errno == EINTR)) {
+    if ((UseVMInterruptibleIO == true &&
+         _result == OS_ERR && errno == EINTR) ||
+       (UseVMInterruptibleIO == false &&
+         ((_result == OS_INTRPT || _result == OS_ERR) && errno == EINTR))) {
       /* restarting a connect() changes its errno semantics */
         INTERRUPTIBLE(::connect(fd, him, len), _result,
                       os::Solaris::clear_interrupted);
                                     
2008-03-05
EVALUATION

Instead we may need to verify a first system call actually happened and actually got interrupted before a 2nd connect is attempted on Solaris.
                                     
2008-03-11
SUGGESTED FIX

Here's an alternative suggested fix (see comments and eval) using the 1.5.0_12 source:

--- ../old/os_solaris.inline.hpp        Tue Mar 11 17:40:38 2008
+++ os_solaris.inline.hpp       Tue Mar 11 18:07:55 2008
@@ -89,10 +89,11 @@
   _setup; \
   _before; \
   OSThread* _osthread = _thread->osthread(); \
   if (_thread->has_last_Java_frame()) { \
     /* this is java interruptible io stuff */ \
+      errno = 0; \
       if ((os::is_interrupted(_thread, _clear)) \
        || ((_cmd) < 0 && errno == EINTR \
          && os::is_interrupted(_thread, _clear))) { \
         _result = OS_INTRPT; \
       } \


--- ../old/hpi_solaris.hpp      Tue Mar 11 17:40:37 2008
+++ hpi_solaris.hpp     Tue Mar 11 18:07:38 2008
@@ -75,11 +75,11 @@
   prevtime = ((julong)t.tv_sec * 1000)  +  t.tv_usec / 1000;

   for(;;) {

     INTERRUPTIBLE_NORESTART(::poll(&pfd, 1, timeout), res, os::Solaris::clear_interrupted);
-    if(res == OS_ERR && errno == EINTR) {
+    if(res < 0 && errno == EINTR) {
        gettimeofday(&t, &aNull);
        newtime = ((julong)t.tv_sec * 1000)  +  t.tv_usec /1000;
        timeout -= newtime - prevtime;
        if(timeout <= 0)
          return OS_OK;
                                     
2008-03-11
EVALUATION

The fix above needs to go into Hotspot as a separate bug, but it isn't relevant to this problem. This problem is about something going wrong with Hotspot when the network code tries to throw an exception.
                                     
2008-03-26
EVALUATION

Quoting Steve Goldman on this.

-------- Original Message --------
Subject: Re: 6670408: testcase panics 1.5.0_12&_14 JVM when java.net.PlainSocketImpl trying to throw an exception
Date: Tue, 06 May 2008 15:19:04 -0400
From: steve goldman <###@###.###>

Ok I found the bug. Dave Dice surmised the problem on Friday. So the 
problem is in this code

PlainSocketImpl.c

            while (1) {
                fd_set wr, ex;

                FD_ZERO(&wr);
                FD_SET(fd, &wr);
                FD_ZERO(&ex);
                FD_SET(fd, &ex);


the fd goes well past the end of the bitvectors wr/ex. The limit on the 
size on 32bits is 1024 bits. If I truss the program I see it get socket 
descriptors well past 1024. It finally trips my memory protection check 
when it was around 3000. If I hadn't messed up my protection code I 
would have found this on Friday.

I looked at the java/io/FileDescriptor and the fd is in fact to large 
for the statically allocated bitmap.
                                     
2008-05-07
EVALUATION

Yes, this is a clear and well known problem/limitation with the select system call. select should be replaced with poll in this case to avoid the limitation of 1024 file descriptors. This would be the preferred solution rather than defining FD_SETSIZE.

It look like this issue is as of a direct result of the library changes for CR 6343810, and any fix for this CR should be backported to update releases where 6343810 has also been fixed.
                                     
2008-05-08
SUGGESTED FIX

--- PlainSocketImpl.c-	2008-05-08 22:54:05.296670972 +0400
+++ PlainSocketImpl.c	2008-05-08 22:54:05.192796472 +0400
@@ -345,15 +345,29 @@
              * See 6343810.
              */
             while (1) {
-                fd_set wr, ex;
+#ifndef USE_SELECT
+                {
+                    struct pollfd pfd;
+                    pfd.fd = fd;
+                    pfd.events = POLLOUT;
+
+                    errno = 0;
+                    connect_rv = NET_Poll(&pfd, 1, -1);
+                }
+#else
+                {
+                    fd_set wr, ex;
 
-                FD_ZERO(&wr);
-                FD_SET(fd, &wr);
-                FD_ZERO(&ex);
-                FD_SET(fd, &ex);
+                    FD_ZERO(&wr);
+                    FD_SET(fd, &wr);
+                    FD_ZERO(&ex);
+                    FD_SET(fd, &ex);
+
+                    errno = 0;
+                    connect_rv = NET_Select(fd+1, 0, &wr, &ex, 0);
+                }
+#endif
 
-                errno = 0;
-                connect_rv = NET_Select(fd+1, 0, &wr, &ex, 0);
                 if (connect_rv == JVM_IO_ERR) {
                     if (errno == EINTR) {
                         continue;
                                     
2008-05-09



Hardware and Software, Engineered to Work Together