JDK-6671051 : (process) Runtime.exec() hangs if signalled during fork/exec
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 6u3
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: solaris_10
  • CPU: generic
  • Submitted: 2008-03-04
  • Updated: 2011-05-18
  • Resolved: 2011-05-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u7Fixed 7 b25Fixed OpenJDK6Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
We are seeing threads hung in java.lang.Runtime.exec() on Solaris 10. The child process is created, but the Runtime.exec() call never returns.

In this example the calling thread is 103(0x67) and the child pid is 14159(0x374f) - I didn't capture the output that shows that.

Here is the backtrace (from jstack). 

"CacheManagerTaskExecutor-15" prio=10 tid=0x0840d400 nid=0x67 sleeping[0xb48a7000..0xb48a7b70]
   java.lang.Thread.State: RUNNABLE
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:431)
        at java.lang.Runtime.exec(Runtime.java:328)
        at com.sun.hss.domain.util.misc.ExpectCommandExecutor.execute(ExpectCommandExecutor.java:48)
        - locked <0xbc1e72d0> (a com.sun.hss.domain.util.misc.ExpectCommandExecutor)
        at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:168)
        at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:145)
        at com.sun.hss.domain.driver.generic.GenericServerMXBeanHelper.execIPMICommand(GenericServerMXBeanHelper.java:406)
        at com.sun.hss.domain.util.ipmi.AbstractIPMICommandHelper.getFruMap(AbstractIPMICommandHelper.java:279)
        at com.sun.hss.domain.driver.v20z.V20zServerMXBeanHelper.getSerialNumber(V20zServerMXBeanHelper.java:211)
        at com.sun.hss.domain.driver.generic.AbstractServerMXBean.getSerialNumber(AbstractServerMXBean.java:87)
        at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:167)
        at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:96)
        at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:33)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65)
        at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216)
        at javax.management.StandardMBean.getAttribute(StandardMBean.java:358)
        at com.sun.hss.domain.AbstractResourceMXBean.getAttribute(AbstractResourceMXBean.java:209)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
        at com.sun.jdmk.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ConcurrentMBeanServerInterceptor.getAttribute(ConcurrentMBeanServerInterceptor.java:837)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.cacao.agent.DispatchInterceptor.getAttribute(DispatchInterceptor.java:506)
        at com.sun.cacao.agent.auth.impl.AccessControlInterceptor.getAttribute(AccessControlInterceptor.java:310)
        at com.sun.jdmk.JdmkMBeanServerImpl.getAttribute(JdmkMBeanServerImpl.java:640)
        at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.rereadAttribute(ServerCachedMBeanProxy.java:326)
        at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.refreshCachedAttributes(ServerCachedMBeanProxy.java:295)
        at com.sun.cacao.cached.impl.server.RefreshAttributeTask.execute(RefreshAttributeTask.java:60)
        at com.sun.cacao.cached.impl.server.TaskPool$TaskExecutor.run(TaskPool.java:334)


pstack shows the following for the thread (note that the 374f parameter corresponds to the child in question):
-----------------  lwp# 103 / thread# 103  --------------------
 fef30ee7 waitid   (0, 374f, b48a6fc0, 3)
 fef25b52 waitpid  (374f, 0, 0) + 75
 fe488242 Java_java_lang_UNIXProcess_forkAndExec (840d4ec, b48a7170, b48a716c, b48a7168, 7, 0) + 6ca
 fb40a598 ???????? (0, fb408219, bc1e7260, bc1e7250, bc1e7240, 0)
 fb402de9 ???????? (0, 0, 0, 0, 7, bc1e71f0)
 fb7f62b0 ???????? ()

This appears to be stuck in the following code in Java_java_lang_UNIXProcess_forkAndExec():
    close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */
    if (read(fail[0], &errnum, sizeof(errnum)) != 0) {
        waitpid(resultPid, NULL, 0);
        throwIOException(env, errnum, "Exec failed");
        goto Catch;


Although this is highly reproducible in our lab stress tests, it was hard to produce when running truss. However, in this case I was able to capture it with truss -l -t read,waitid. Here is the truss output from the time of the failure (note the lwp id is 103, consistent with the above):
/103:         siginfo: SIGJVM1 pid=13520 uid=0 code=-1
/103:    0.0024 read(309, 0xB48A70B4, 4)                        Err#4 EINTR
/103:   waitid(P_PID, 14159, 0xB48A6FC0, WEXITED|WTRAPPED) (sleeping...)

It looks like there is a bug in the code above. It assumes that a non-zero return from the read system call represents a successful read of errnum, while in this case, the return value is actually -1. This causes the calling thread to wait for the child process to complete while, in our example, the child process has successfully exec'ed and is actually waiting to communicate with the hung thread upon its return.

Comments
EVALUATION It appears that this bug is a regression introduced by the fix for 4052517: (process) Runtime.exec won't execute programs belonging to other groups on Unix 4811767: (process) Runtime.exec should throw IOException when workdir does not exist (Unix) 5033302: (process) Can't execute Solaris NFS programs with uid>64k on Linux-amd64 in mustang-b81
14-03-2008

EVALUATION Submitter has verified the fix.
06-03-2008

EVALUATION I am embarrassed to admit that I fell into the canonical "Worse" trap described by the famous paper: http://www.dreamsongs.com/WorseIsBetter.html I am a "Right Thing" kind of guy myself, even more so after thinking about this bug.
05-03-2008

EVALUATION Here is my (untested) attempt to get the logic completely right. Does programming always have to be so hard? diff --git a/src/solaris/native/java/lang/UNIXProcess_md.c b/src/solaris/native/java/lang/UNIXProcess_md.c --- a/src/solaris/native/java/lang/UNIXProcess_md.c +++ b/src/solaris/native/java/lang/UNIXProcess_md.c @@ -479,6 +479,37 @@ closeSafely(int fd) close(fd); } +/* + * Reads nbyte bytes from file descriptor fd into buf, + * The read operation is retried in case of EINTR or partial reads. + * + * Returns number of bytes read (normally nbyte, but may be less in + * case of EOF). In case of read errors, returns -1 and sets errno. + */ +static ssize_t +readFully(int fd, void *buf, size_t nbyte) +{ + ssize_t remaining = nbyte; + for (;;) { + ssize_t n = read(fd, buf, remaining); + if (n == 0) { + return nbyte - remaining; + } else if (n > 0) { + remaining -= n; + if (remaining <= 0) + return nbyte; + /* We were interrupted in the middle of reading the bytes. + * Unlikely, but possible. */ + buf = (void *) (((char *)buf) + n); + } else if (errno == EINTR) { + /* Strange signals like SIGJVM1 are possible at any time. + * See http://www.dreamsongs.com/WorseIsBetter.html */ + } else { + return -1; + } + } +} + #ifndef __solaris__ #undef fork1 #define fork1() fork() @@ -606,9 +637,15 @@ Java_java_lang_UNIXProcess_forkAndExec(J /* parent process */ close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */ - if (read(fail[0], &errnum, sizeof(errnum)) != 0) { + + switch (readFully(fail[0], &errnum, sizeof(errnum))) { + case 0: break; /* Exec succeeded */ + case sizeof(errnum): waitpid(resultPid, NULL, 0); throwIOException(env, errnum, "Exec failed"); + goto Catch; + default: + throwIOException(env, errno, "Read failed"); goto Catch; }
05-03-2008

EVALUATION The read in question from a private pipe assumes either success or EOF, but an EINTR is not impossible, and so should be handled by retrying the operation.
04-03-2008