JDK-8182757 : JDWP: Socket Transport handshake hangs on Solaris
  • Type: Bug
  • Component: core-svc
  • Sub-Component: debugger
  • Affected Version: 6,7,8,9,10
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris
  • CPU: sparc,x86_64
  • Submitted: 2017-06-23
  • Updated: 2019-05-22
  • Resolved: 2017-08-03
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 10
10 b19Fixed
Related Reports
Relates :  
Description
Setting priority to P2 to match the original bug:

JDK-6303969 JDWP: Socket Transport handshake fails rarely on InstancesTest.java

The purpose of this new bug is to extract sighting information for ONE
failure mode described in JDK-6303969.

The failure mode is a hang between the debugger and debuggee
on Solaris SPARC or Solaris X64 systems only. This failure mode
has not been seen on any other platform.

The debugger pstack trace looks like this:

-----------------  lwp# 2 / thread# 2  --------------------
 ffff80ffbf51e35a pollsys  (ffff80ffbf13e418, 1, 0, 0)
 ffff80ffbf4bef93 poll () + 5f
 ffff80f1b8814ad8 NET_Timeout0 () + b8
 ffff80f1b881428a NET_Timeout () + 2a
 ffff80f1b8810fa7 Java_java_net_PlainSocketImpl_socketAccept () + 247
 ffff80ffa2030f31 * java/net/PlainSocketImpl.socketAccept(Ljava/net/SocketImpl;)V+-30776
 ffff80ffa200b7e3 * java/net/AbstractPlainSocketImpl.accept(Ljava/net/SocketImpl;)V+23568 (line 922)
 ffff80ffa200b7e3 * java/net/ServerSocket.implAccept(Ljava/net/Socket;)V+3232 (line 1155)
 ffff80ffa200b7e3 * java/net/ServerSocket.accept()Ljava/net/Socket;+3584 (line 1034)
 ffff80ffa200b560 * nsk/share/jdwp/SocketTransport.accept()V+7768 (line 209)
 ffff80ffa200b7e3 * nsk/share/jdwp/Debugee.connect()Lnsk/share/jdwp/Transport;+-7592 (line 301)
 ffff80ffa200b560 * nsk/share/jdwp/Binder.bindToDebugee(Ljava/lang/String;)Lnsk/share/jdwp/Debugee;+9296 (line 185)
 ffff80ffa200b560 * nsk/jdwp/ObjectReference/InvokeMethod/invokemeth001.runIt([Ljava/lang/String;Ljava/io/PrintStream;)I+-15856 (line 354)
 ffff80ffa200b220 * nsk/jdwp/ObjectReference/InvokeMethod/invokemeth001.run([Ljava/lang/String;Ljava/io/PrintStream;)I+-15576 (line 188)
 ffff80ffa200b220 * nsk/jdwp/ObjectReference/InvokeMethod/invokemeth001.main([Ljava/lang/String;)V+-15304 (line 175)
 ffff80ffa2000d2d * nsk/jdwp/ObjectReference/InvokeMethod/invokemeth001.main([Ljava/lang/String;)V+-14032 (line 175)
 ffff80f1bac5e357 __1cJJavaCallsLcall_helper6FpnJJavaValue_rknMmethodHandle_pnRJavaCallArguments_pnGThread__v_ () + 507
 ffff80f1bad1bbd7 __1cRjni_invoke_static6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallType_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_ () + 4b7
 ffff80f1bad43ec7 jni_CallStaticVoidMethod () + 577
 ffff80f1bc206d5e JavaMain () + 30e
 ffff80ffbf515221 _thrp_setup () + a5
 ffff80ffbf5154c0 _lwp_start ()

The key attributes of the above pstack output:

- The debugger is in a java/net/ServerSocket.accept() call.
- The accept() call has called NET_Timeout() which results
  in a poll() and then a pollsys() call.
- Basically, the ServerSocket is waiting for a connect event
  to come in on the socket.


The debuggee pstack trace looks like this:

-----------------  lwp# 2 / thread# 2  --------------------
 ffff80ffbf51dbfa recv     (6, ffff80ffbf13e250, e, 0)
 ffff80ffbf67fe2e recv () + 12
 ffff80f1ac403e5e dbgsysRecv () + 2e
 ffff80f1ac403702 recv_fully () + 32
 ffff80f1ac402a08 handshake () + 68
 ffff80f1ac40336e socketTransport_attach () + ce
 ffff80f1ac63ac7a transport_startTransport () + 7a
 ffff80f1ac622e5a startTransport () + 6a
 ffff80f1ac61fbaf bagEnumerateOver () + 3f
 ffff80f1ac6233bd initialize () + 1dd
 ffff80f1ac622629 cbEarlyVMInit () + 79
 ffff80f1bb21a011 __1cLJvmtiExportTpost_vm_initialized6F_v_ () + 581
 ffff80f1bb7cf184 __1cHThreadsJcreate_vm6FpnOJavaVMInitArgs_pb_i_ () + 7d4
 ffff80f1bad6b7eb __1cWJNI_CreateJavaVM_inner6FppnHJavaVM__ppv3_i_ () + bb
 ffff80f1bad6bcb9 JNI_CreateJavaVM () + 9
 ffff80f1bc20967b InitializeJVM () + 11b
 ffff80f1bc206aa5 JavaMain () + 55
 ffff80ffbf515221 _thrp_setup () + a5
 ffff80ffbf5154c0 _lwp_start ()

The key attributes of the above pstack output:

- The debuggee agent is in cbEarlyVMInit() which
  is the event handler for the VM_INIT event.
- The agent is in socketTransport_attach() and is in
   the JDWP handshake() code.
- The handshake() code is trying to recv() data from
   the socket.

DO NOT add any entries to this bug report that do meet the
exact failure mode described by this bug.

So we have the debugger side waiting for a connect() and
debuggee side has already returned from its connect()
and is trying to receive data from the socket. The question
is what happened to the connect event? Did it get dropped?
Did it get snarfed by another ServerSocket listening on the
same port?

The remaining notes that I'm adding to this bug are from my
personal e-mail archive for JDK-6303969. When the bug was
imported from the older bug system to JBS, date and comment
author information was stripped so original 20 description notes
were all munged together into the mess that is the description
note for JDK-6303969.

Update: Adding a DKFL rule entry to match the above pstack
output. It's a ridiculously broad rule, but it's what we have:

RULE nsk/jdwp/ObjectReference/InvokeMethod/invokemeth001 Timeout none

If you get a timeout that matches this rule, you have to look at
the pstack output for BOTH the debugger and debuggee and
make sure they look very similar to the above examples.
Comments
If a socket is being setup without a fixed port using the SO_REUSEADDR flag can lead to other processes interfering with the poll/receive process of a debugger/debuggee configuring a socket for communication. When SO_REUSEADDR is used other processes can attempt a listen() on the same port and receive a connect from the debuggee. This causes the debugger to stay in poll() waiting for a connect and the debuggee stays in recv() waiting to receive data from the "rogue" process that will never send it. This can also lead to connections being terminated early on the debuggee side when the "rogue" process terminates the connection because it does not receive what it expected from the client process (i.e. the debuggee). The fix is to not use the SO_REUSEADDR flag for non-fixed port sockets. This keeps "rogue" processes from reusing the port address and from stealing the connects sent by from the debuggee.
27-07-2017