Relates :
|
|
Relates :
|
|
Relates :
|
We are seeing threads hung in java.lang.Runtime.exec() on Solaris 10. The child process is created, but the Runtime.exec() call never returns. In this example the calling thread is 103(0x67) and the child pid is 14159(0x374f) - I didn't capture the output that shows that. Here is the backtrace (from jstack). "CacheManagerTaskExecutor-15" prio=10 tid=0x0840d400 nid=0x67 sleeping[0xb48a7000..0xb48a7b70] java.lang.Thread.State: RUNNABLE at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:53) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) at java.lang.Runtime.exec(Runtime.java:593) at java.lang.Runtime.exec(Runtime.java:431) at java.lang.Runtime.exec(Runtime.java:328) at com.sun.hss.domain.util.misc.ExpectCommandExecutor.execute(ExpectCommandExecutor.java:48) - locked <0xbc1e72d0> (a com.sun.hss.domain.util.misc.ExpectCommandExecutor) at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:168) at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:145) at com.sun.hss.domain.driver.generic.GenericServerMXBeanHelper.execIPMICommand(GenericServerMXBeanHelper.java:406) at com.sun.hss.domain.util.ipmi.AbstractIPMICommandHelper.getFruMap(AbstractIPMICommandHelper.java:279) at com.sun.hss.domain.driver.v20z.V20zServerMXBeanHelper.getSerialNumber(V20zServerMXBeanHelper.java:211) at com.sun.hss.domain.driver.generic.AbstractServerMXBean.getSerialNumber(AbstractServerMXBean.java:87) at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:167) at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:96) at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:33) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65) at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216) at javax.management.StandardMBean.getAttribute(StandardMBean.java:358) at com.sun.hss.domain.AbstractResourceMXBean.getAttribute(AbstractResourceMXBean.java:209) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666) at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638) at com.sun.jdmk.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:178) at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178) at com.sun.hss.domain.ConcurrentMBeanServerInterceptor.getAttribute(ConcurrentMBeanServerInterceptor.java:837) at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178) at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178) at com.sun.cacao.agent.DispatchInterceptor.getAttribute(DispatchInterceptor.java:506) at com.sun.cacao.agent.auth.impl.AccessControlInterceptor.getAttribute(AccessControlInterceptor.java:310) at com.sun.jdmk.JdmkMBeanServerImpl.getAttribute(JdmkMBeanServerImpl.java:640) at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.rereadAttribute(ServerCachedMBeanProxy.java:326) at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.refreshCachedAttributes(ServerCachedMBeanProxy.java:295) at com.sun.cacao.cached.impl.server.RefreshAttributeTask.execute(RefreshAttributeTask.java:60) at com.sun.cacao.cached.impl.server.TaskPool$TaskExecutor.run(TaskPool.java:334) pstack shows the following for the thread (note that the 374f parameter corresponds to the child in question): ----------------- lwp# 103 / thread# 103 -------------------- fef30ee7 waitid (0, 374f, b48a6fc0, 3) fef25b52 waitpid (374f, 0, 0) + 75 fe488242 Java_java_lang_UNIXProcess_forkAndExec (840d4ec, b48a7170, b48a716c, b48a7168, 7, 0) + 6ca fb40a598 ???????? (0, fb408219, bc1e7260, bc1e7250, bc1e7240, 0) fb402de9 ???????? (0, 0, 0, 0, 7, bc1e71f0) fb7f62b0 ???????? () This appears to be stuck in the following code in Java_java_lang_UNIXProcess_forkAndExec(): close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */ if (read(fail[0], &errnum, sizeof(errnum)) != 0) { waitpid(resultPid, NULL, 0); throwIOException(env, errnum, "Exec failed"); goto Catch; Although this is highly reproducible in our lab stress tests, it was hard to produce when running truss. However, in this case I was able to capture it with truss -l -t read,waitid. Here is the truss output from the time of the failure (note the lwp id is 103, consistent with the above): /103: siginfo: SIGJVM1 pid=13520 uid=0 code=-1 /103: 0.0024 read(309, 0xB48A70B4, 4) Err#4 EINTR /103: waitid(P_PID, 14159, 0xB48A6FC0, WEXITED|WTRAPPED) (sleeping...) It looks like there is a bug in the code above. It assumes that a non-zero return from the read system call represents a successful read of errnum, while in this case, the return value is actually -1. This causes the calling thread to wait for the child process to complete while, in our example, the child process has successfully exec'ed and is actually waiting to communicate with the hung thread upon its return.
|