SHORT SUMMARY: Possible race condition which may cause the
socket close failure.
INDICATORS:
Threads stuck in socket read and they correspond to the ESTABLISHED
connections.
These sockets are closed from other java threads and the customer suspects a race
condition.
"OUT Socket Receiver (/10.4.66.8:39918)" prio=1 tid=0x3220ac78
nid=0x1bab runnable [2e1ff000..2e1ff86c]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.DataInputStream.readFully(DataInputStream.java:266)
at java.io.DataInputStream.readFully(DataInputStream.java:242)
at
com.x.y.integration.systemconnection.messagedefinition.MessageSepara
tor.separateByteMessage(MessageSeparator.java:213)
at
com.x.y.integration.systemconnection.outbound.socket.SocketEisContro
ller.run(SocketEisController.java:316)
NET_Read calls recv(2) in the startOp and endOp pair. startOp registers
current thread to entry fdEntry and endOp checks the fdEntry's intr field.
In closefd, fdEntry's intr is set to 1 and pthread_kill is called to the
thread. However, this can happen before calling recv. In that case, recv is
not interrupted by the pthread_kill and it blocks (unless close(2)/dup2(2)
interrupts the recv() system call and it looks it does not happen).
#define BLOCKING_IO_RETURN_INT(FD, FUNC) { \
int ret; \
threadEntry_t self; \
fdEntry_t *fdEntry = getFdEntry(FD); \
if (fdEntry == NULL) { \
errno = EBADF; \
return -1; \
} \
do { \
startOp(fdEntry, &self); \
ret = FUNC; \
endOp(fdEntry, &self); \
} while (ret == -1 && errno == EINTR); \
return ret; \
}
int NET_Read(int s, void* buf, size_t len) {
BLOCKING_IO_RETURN_INT( s, recv(s, buf, len, 0) );
}
COUNTER INDICATORS: na
TRIGGERS: na
KNOWN WORKAROUND: na
PRESENT SINCE: 1.4.2
HOW TO VERIFY:
It looks it is difficult to create the suspected timing window from a java
program.
Support tried OpenJDK jdk7u-dev to widen the timing window.
He put usleep to two places to make the timing window bigger.
1. added usleep to closefd()
static int closefd(int fd1, int fd2) {
:
threadEntry_t *curr = fdEntry->threads;
while (curr != NULL) {
curr->intr = 1;
pthread_kill( curr->thr, sigWakeup );
if (getenv("SOCKET_CLOSE_DEBUG")) printf("D: kill %#x", curr->thr);
curr = curr->next;
}
if (getenv("SLEEP_BEFORE_CLOSE")) {
usleep(200000); // ### sleep 200msec after pthread_kill, but before dup2
}
2. added usleep just ofter startOp in modified version of
BLOCKING_IO_RETURN_INT, which is called by NET_Read.
#define BLOCKING_IO_RETURN_INT_READ(FD, FUNC) { \
int ret; \
threadEntry_t self; \
fdEntry_t *fdEntry = getFdEntry(FD); \
int blockingIODebug = (int)getenv("BLOCKING_IO_DEBUG"); \
if (fdEntry == NULL) { \
errno = EBADF; \
return -1; \
} \
do { \
startOp(fdEntry, &self); \
if (blockingIODebug) { \
printf("Calling usleep in Net_X"); \
} \
usleep(200000); \
if (blockingIODebug) { \
printf("usleep done in Net_X"); \
} \
ret = FUNC; \
if (blockingIODebug) { \
printf("FUNC returned ret %d errno %d", ret, errno); \
} \
endOp(fdEntry, &self); \
} while (ret == -1 && errno == EINTR); \
return ret; \
}
int NET_Read(int s, void* buf, size_t len) {
if (getenv("SOCKET_READ_DEBUG")) printf("D: NET_Read(%d,,%d)
", s, len);
BLOCKING_IO_RETURN_INT_READ( s, recv(s, buf, len, 0) );
}
$ SLEEP_BEFORE_CLOSE=1 BLOCKING_IO_DEBUG=1 SOCKET_CLOSE_DEBUG=1
SOCKET_READ_DEBUG=1 ~/local/openjdk/jdk7u-dev/build/linux-i586/bin/java
-Dsocket.close.debug=1 AsyncCloseEx
D: Client Calling read
D: NET_Read(13,,128)
Calling usleep in Net_X
D: main calling close.
D: PlainSocket close fdUseCount 1 closePending false
D: PlainSocket calling socketPreClose() for fdUseCount != 0
D: socketClose0 0xb779cf58 deferred 1 fd 13 use-dup2 1
D: kill 0x8efffb70
usleep done in Net_X
D: dup2(11, 13) called
D: socketClose0 0xb779cf58 deferred 1 fd 13 use-dup2 1 returning
D: main close done.
### Ctrl- to show thread dump.
^2013-01-10 17:29:33
Full thread dump OpenJDK Server VM (24.0-b28 mixed mode):
"DestroyJavaVM" prio=10 tid=0x091d4000 nid=0x639 waiting on condition
[0x00000000]
java.lang.Thread.State: RUNNABLE
"client" prio=10 tid=0x09364800 nid=0x647 runnable [0x8effe000] <<< blocked
in socket read
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at AsyncCloseEx$1.run(AsyncCloseEx.java:29)
"server" prio=10 tid=0x09307c00 nid=0x646 runnable [0x8f1cc000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:522)
at java.net.ServerSocket.accept(ServerSocket.java:490)
at AsyncCloseEx$2.run(AsyncCloseEx.java:53)
"Service Thread" daemon prio=10 tid=0x092e9c00 nid=0x644 runnable
[0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x092e7c00 nid=0x643 waiting on
condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x092e5400 nid=0x642 waiting on
condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x092e3800 nid=0x641 waiting on
condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x092ce800 nid=0x640 in Object.wait()
[0x8f538000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xa9f35400> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0xa9f35400> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)
"Reference Handler" daemon prio=10 tid=0x092c9c00 nid=0x63e in Object.wait()
[0x8f588000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xa9f34fd8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0xa9f34fd8> (a java.lang.ref.Reference$Lock)
"VM Thread" prio=10 tid=0x092c8400 nid=0x63c runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x091dd800 nid=0x63a runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x091df000 nid=0x63b runnable
"VM Periodic Task Thread" prio=10 tid=0x092f3c00 nid=0x645 waiting on
condition
JNI global references: 137
Heap
PSYoungGen total 9408K, used 983K [0xa9f30000, 0xaa9a0000, 0xb4680000)
eden space 8128K, 12% used [0xa9f30000,0xaa025f78,0xaa720000)
from space 1280K, 0% used [0xaa860000,0xaa860000,0xaa9a0000)
to space 1280K, 0% used [0xaa720000,0xaa720000,0xaa860000)
ParOldGen total 21376K, used 0K [0x95080000, 0x96560000, 0xa9f30000)
object space 21376K, 0% used [0x95080000,0x95080000,0x96560000)
PSPermGen total 16384K, used 1878K [0x91080000, 0x92080000,
0x95080000)
object space 16384K, 11% used [0x91080000,0x91255a48,0x92080000)
So, it was verified that the timing window really exists.
NOTES FOR SE:
Support notes :
---------------
Using OpenJDK jdk7u-dev, I implemented an experimental fix. If it is
enabled, the read socket thread won't stack in SocketInputStream.socketRead0
even with the intentionally extended timing window we are suspecting.
Changes are below in solaris/native/java/net/linux_close.c:
1. Added a function isInterrupted() which checks if the current thread entry
has intr flag set.
This must be called after locking corresponding fdEntry's lock.
2. Changed BLOCKING_IO_RETURN_INT macro to use isInterrupted() after locking
self.lock.
It calls FUNC (in this case recv()), only if the current thread is not interrupted.
closefd() calls pthread_kill and sets threadEntry's intr filed. Enclosed incr field change with lock/unlock with curr->lock.
Guarding the recv with the same lock should make the macro not to miss the pthread_kill and thread entry's intr flag.
REGRESSION: no
}