Bug ID: JDK-6447412 Issue with socket.close() for ssl sockets when poweroff on other system

JDK-6447412 : Issue with socket.close() for ssl sockets when poweroff on other system

Type: Bug
Component: security-libs
Sub-Component: javax.net.ssl
Affected Version: 5.0u7,6u2

Priority: P2
Status: Closed
Resolution: Fixed
OS: linux_redhat_6.1
CPU: generic

Submitted: 2006-07-11
Updated: 2011-05-18
Resolved: 2011-05-18

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	Other	Other	JDK 6	JDK 7	Other
1.4-pool,OpenJDK6Resolved	1.4.2_18-revFixed	1.4.2_19Fixed	6u2Fixed	7 b25Fixed	OpenJDK6Fixed

Related Reports

Relates :	JDK-2151940 - Two 1.4.2 JNDI NONBLITS testcases fail: Unsupported ciphersuite SSL_RSA_WITH_RC4_128_MD5
Relates :	JDK-6668261 - Appli. hangs because SSLSocket in Client side can not be closed

Description

Synopsis : Issue w/ socket.close() for ssl sockets when poweroff on other system.


FULL PRODUCT VERSION :

java version "1.4.2_09"

Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_09-b05)

Java HotSpot(TM) Client VM (build 1.4.2_09-b05, mixed mode)

FULL OPERATING SYSTEM VERSION :

2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux

Red Hat Enterprise Linux AS release 3 (Taroon Update 5)

glibc-common-2.3.2-95.33

glib2-2.2.3-2.0

glibc-headers-2.3.2-95.33

glibc-2.3.2-95.33

glib-1.2.10-11.1

glibc-kernheaders-2.4-8.34.1

glibc-devel-2.3.2-95.33


DESCRIPTION OF THE PROBLEM:

    Socket.close() on an SSL socket on Linux hangs  if a computer is abruptly removed from the network because of power failure or shutdown ?f.  The socket.close() method hangs for 30 minutes or more then returns correctly.  If the computer is powered on, then the method completes as soon as the MAC address is rediscovered.  The method works in the same scenario on Solaris.

Modifying the TCP parameters have no result (soTimeout / SoLinger), and shutdownInput and shutdownOutput (which may be a work around for normal TCP sockets) are not implemented on SSL sockets.

I?ve tested this with 1.4 (up to Rel 11) and 1.5 (up to Rel 6).

Also tried loading the software on the 1.6 beta ? however there are issues loading the JSSE methods.

I see from the bug report 4726957 that a similar problem may be fixed in mustang (1.6) ?

Please advise on what data is required.

We need whatever fix for release in jdk 1.4.2.


> Tested Bug ID # 4726957 and it does now pass with 1.5.0_07. 
Finished the port and tested the software with 1.5.0_07, however it does not fix the problem.  
Here is the thread dump of the thread that appears to be locking up.  
I've cut out our code, but we are calling the close.
>
> "DefaultTaskRunnerGroup Priority 10 TaskRunner 2" prio=1 tid=0x87c26150 nid=0x7b0c waiting for monitor entry [0x863cf000..0x863cf6f0]
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:662)
>
>             - waiting to lock <0x8ed1d688> (a java.lang.Object)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.sendAlert(SSLSocketImpl.java:1622)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.warning(SSLSocketImpl.java:1475)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.closeInternal(SSLSocketImpl.java:1315)
>
>             - locked <0x8eb15b68> (a com.sun.net.ssl.internal.ssl.SSLSocketImpl)
>
>             - locked <0x8ed1d698> (a java.lang.Object)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.close(SSLSocketImpl.java:1219)
>

==================
Available Dumps :
==================

On machine titan.sfbay.sun.com at location /home/sd158479/Nortel/830777
File Name : 

-rwxrwxrwx   1 sd158479 staff    82128021 Jun 22 16:17 nt42sun.zip
-rwxrwxrwx   1 sd158479 staff    82128021 Jun 22 16:16 nt4sun.zip
-rwxrwxrwx   1 sd158479 staff         10 Jun 22 16:18 nt4sun2.zip


> The file contains :
>
> Jcore (java core file)
    and td.txt (java thread dump)
>
>  
>
From the Thread Dump it would appear that the problem is 
with com.sun.net.ssl.internal.ssl.SSLSocketImpl.close() instead of socket.close().
>
> Also again:
>
>             This only happens on Linux ? It works perfect on Solaris.
>                         And
>             The case it fails is if you "pull the plug" the server. Then try and close() the connection on the client.
>
>             If you gracefully shutdown the server (init 0 for instance) everything works fine.

>
> In the thread dump look for:
>
> "DefaultTaskRunnerGroup Priority 10 TaskRunner 2" prio=1 tid=0x87c26150 nid=0x7b0c waiting for monitor entry [0x863cf000..0x863cf6f0]
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:662)
>
>             - waiting to lock <0x8ed1d688> (a java.lang.Object)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.sendAlert(SSLSocketImpl.java:1622)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.warning(SSLSocketImpl.java:1475)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.closeInternal(SSLSocketImpl.java:1315)
>
>             - locked <0x8eb15b68> (a com.sun.net.ssl.internal.ssl.SSLSocketImpl)
>
>             - locked <0x8ed1d698> (a java.lang.Object)
>
>             at com.sun.net.ssl.internal.ssl.SSLSocketImpl.close(SSLSocketImpl.java:1219)

 Which is the thread that is locked waiting on the close()

Comments

EVALUATION The networking team will not support write timeout for its complex, so the fix for jdk 7 is the same as for jdk 6.
17-03-2008
EVALUATION --Christopher.Hegarty-- Solution: The general consensus was that we need to change the lock around writing records to something like a java.util.concurrent.locks.ReentrantLock that way we could have something like: Thread A calls close. Tries to acquire write lock. If cannot acquire lock within SoLinger, then closeSocket. --Xuelei.Fan-- I run into the corner on the issue. The following scenarios seems fine: if ( get the write lock in SO_LINGER time) { // send the SSL/TLS required data, and then [*] // close the socket. } else { // close the socket immediately } But the above scenarios only works in the situation that the out stream has been blocked or there are enough buffer left to hold the SSL/TLS required data. There are risks that SSL/TLS data will full fill the out stream and block indefinitely. So it does not solve the issue. One possible workaround may be that the customer set the SO_LINGER to zero, the socket will close immediately without any more actions. but it means that the application is in the dangerous of losing data in normal situation, I don't think it is the acceptable. --Christopher.Hegarty-- After discussing the issue that requires the send buffer to be increased: setSendBufferSize(getSendBufferSize() + Record.maxAlertRecordSize); As I mentioned when suggesting that you try this workaround, it is not guaranteed to always work. For example, the buffer may be as large as possible. We ( the Networking team ) feel that unless the customer is specifically encountering this issue, that adding this workaround is not a good idea. It will only make a corner case more obscure and harder to diagnose. My assumption is that this is an escalated bug and will be required to be backported. The fix without increasing send buffer should be sufficient for 99.9% of cases. For Java SE 7 we are looking at possibly adding a write timeout to Socket and if this happens then you could use this to avoid the above problem. --Xuelei.Fan-- We will use two different fix for tiger/mustang and jdk7. For tiger/mustan, the reentrant lock used to waiting for so_linger timeout; while for jdk7, we will try to adress the issue with write timeout, which is a much more stable and reliable solution.
19-01-2007
EVALUATION The Socket.shutdownOutput() spec says that "For a TCP socket, any previously written data will be sent followed by TCP's normal connection termination sequence." That means that implementation of shutdownOutput() helps nothing on solving the issue, becaue the previous written data will continue be blocked and the close() still have to wait for the unblocking the socket.
23-11-2006
EVALUATION When the server abruptly removed from the network, the client maybe failed to get any alert, so the socket will keep alive (I think the solaris system maybe get the report in a short time, while linux DO NOT get any information about the break even for for 30 minutes as the description). Before the client get the break info, if will continue send messgaes to the server if there are any data, after the socket output buffer full, the send()/write() blocks the messages. For SSL, when calling close(), it is required to send a close_notify alert before closing the write side of the connection. Because the socket has been blocked, the close_notify will have to wait in line, then the close() will not return before unblocked. The ideal solution would be that the client should get alert shortly after the connection break, as solaris do(if the bug only happens on linux). Alternativelly, will think about the workaround of SoLinger or shutdownOutput.
22-11-2006
EVALUATION Managed to get the same stack trace from a simple test, attached. The thing is that SSL spec states that "...Each party is required to send a close_notify alert before closing the write side of the connection.". So when closing a SSLSocket asynchronously, the already bloked writing thread holds a write lock, which the closing thread also wants to acquire when trying to send close_notify message. Thus the deadlock.
13-10-2006