JDK-4843136 : (process) pipe file descriptor from Runtime.exec not being closed
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 1.4.1_02,1.4.2
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic,linux,solaris_9
  • CPU: x86,sparc
  • Submitted: 2003-04-04
  • Updated: 2016-04-15
  • Resolved: 2003-04-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other Other
1.4.1_05 05Fixed 1.4.2Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
I'm running:

datsun$ /opt/jdk1.4/bin/java -version
java version "1.4.2-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-beta-b16)
Java HotSpot(TM) Client VM (build 1.4.2-beta-b16, mixed mode)

and the following problem occurs.  Here's what's happening...

I have a server process that runs forever.  It sometimes uses Runtime.exec
to start subprocesses.

datsun$ ps ax | grep java
   4100 ?        S  3:42 /opt/jdk1.4/bin/java -Djava.ext.dirs=/home/shannon/ext

This is my server process.  It's been running since Mar 17.

datsun$ pfiles 4100
4100:   /opt/jdk1.4/bin/java -Djava.ext.dirs=/home/shannon/ext -jar /home/shan
   Current rlimit: 1024 file descriptors
    0: S_IFCHR mode:0666 dev:136,0 ino:60568 uid:0 gid:3 rdev:13,2
       O_RDONLY|O_LARGEFILE
    1: S_IFREG mode:0600 dev:32,16 ino:745340 uid:1011 gid:10 size:6037573
       O_WRONLY|O_LARGEFILE
    2: S_IFIFO mode:0000 dev:231,0 ino:131335 uid:0 gid:25 size:0
       O_RDWR
    3: S_IFCHR mode:0666 dev:136,0 ino:60570 uid:0 gid:3 rdev:13,12
       O_RDWR
    4: S_IFDOOR mode:0444 dev:236,0 ino:61825 uid:0 gid:0 size:0
       O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[223]
    5: S_IFIFO mode:0664 dev:32,16 ino:745581 uid:1011 gid:10 size:780
       O_RDONLY|O_LARGEFILE
    6: S_IFREG mode:0600 dev:32,16 ino:746048 uid:1011 gid:10 size:7607
       O_RDONLY|O_LARGEFILE
    7: S_IFIFO mode:0664 dev:32,16 ino:745581 uid:1011 gid:10 size:780
       O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
    8: S_IFREG mode:0600 dev:32,16 ino:745922 uid:1011 gid:10 size:1034402
       O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE
   10: S_IFREG mode:0600 dev:32,16 ino:745922 uid:1011 gid:10 size:1034402
       O_WRONLY|O_APPEND|O_CREAT|O_LARGEFILE

Note that fd 9 and fd 11 are the last two unused fds.  If I were to create
a pipe, the read end would be fd 9 and the write end would be fd 11.

datsun$ ps alx | grep 4100
  8  1011  4100  4097  0  29 3011865664992 0002d934 S ?         3:42 /opt/jdk1.4/
  8  1011 17934  4100  0  49 23 1096  744 300037f4c40 S ?         0:00 /bin/sh /hom
datsun$ ps alx | grep 17934
  8  1011 17934  4100  0  49 23 1096  744 300037f4c40 S ?         0:00 /bin/sh /hom
  8  1011 17936 17934  0  49 23  976  704 300043d8a80 S ?         0:00 cat

17934 is the shell started by 4100.

datsun$ pfiles 17934
17934:  /bin/sh /home/shannon/.mail/filtermsg metadata-jsr +In/metadata
   Current rlimit: 1024 file descriptors
    0: S_IFIFO mode:0000 dev:231,0 ino:178470 uid:1011 gid:10 size:0
       O_RDWR
    1: S_IFREG mode:0600 dev:0,2 ino:1624593 uid:1011 gid:10 size:7
       O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
    2: S_IFREG mode:0600 dev:0,2 ino:1624593 uid:1011 gid:10 size:7
       O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
   11: S_IFIFO mode:0000 dev:231,0 ino:178470 uid:1011 gid:10 size:0
       O_RDWR
   19: S_IFREG mode:0775 dev:32,16 ino:745551 uid:1011 gid:10 size:933
       O_RDONLY|O_LARGEFILE FD_CLOEXEC

Note that both fd 0 and fd 11 are the same pipe.

fd 1 and fd 2 were redirected to a debug output file by the shell script.

17936 is a "cat" command capturing the data from the pipe.
It should terminate when the pipe is closed.

datsun$ pfiles 17936
17936:  cat
   Current rlimit: 1024 file descriptors
    0: S_IFIFO mode:0000 dev:231,0 ino:178470 uid:1011 gid:10 size:0
       O_RDWR
    1: S_IFREG mode:0600 dev:0,2 ino:4450889 uid:1011 gid:10 size:7649
       O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
    2: S_IFREG mode:0600 dev:0,2 ino:1624593 uid:1011 gid:10 size:7
       O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE
   11: S_IFIFO mode:0000 dev:231,0 ino:178470 uid:1011 gid:10 size:0
       O_RDWR

Note that it also has the pipe open on fd 11.

Because so many processes have the pipe open, the fact that the JVM
end closed the pipe makes no difference.  The pipe doesn't get an EOF
and so the cat process hangs forever.

It looks like there might be some sort of bug in the JVM where it
doesn't always close the write end of the pipe before starting the
subprocess that should only get the read end of the pipe.  The parent
should keep the write end open and close the read end, and the child
should keep the read end open and close the write end.

And the weird thing is this doesn't happen every time.  It only seems
to fail after running for a long time, and even then it only fails
sometimes.

This also fails on 1.4.1.

Comments
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.1_05 mantis-rc tiger FIXED IN: 1.4.1_05 mantis-rc tiger INTEGRATED IN: 1.4.1_05 mantis-b22 mantis-rc tiger tiger-b07
14-06-2004

SUGGESTED FIX *** /tmp/geta2406 2003-04-22 10:58:12.965745474 -0700 --- UNIXProcess_md.c.linux 2003-04-21 15:51:08.805167000 -0700 *************** *** 224,230 **** long desc, dirdesc; pid_t pid; ! if ((pid = getppid()) < 0) return 0; sprintf(procdir, "/proc/%d/fd", pid); close(3); /* to be sure a descriptor is available for opendir */ --- 224,230 ---- long desc, dirdesc; pid_t pid; ! if ((pid = getpid()) < 0) return 0; sprintf(procdir, "/proc/%d/fd", pid); close(3); /* to be sure a descriptor is available for opendir */ *** /tmp/geta2414 2003-04-22 10:59:47.485482077 -0700 --- UNIXProcess_md.c.solaris 2003-04-21 15:51:26.165099000 -0700 *************** *** 188,194 **** char procdir[20]; long desc; pid_t pid; ! if ((pid = getppid()) < 0) return 0; sprintf(procdir, "/proc/%d/fd", pid); close(3); /* to be sure a descriptor is available for opendir */ --- 188,194 ---- char procdir[20]; long desc; pid_t pid; ! if ((pid = getpid()) < 0) return 0; sprintf(procdir, "/proc/%d/fd", pid); close(3); /* to be sure a descriptor is available for opendir */
11-06-2004

PUBLIC COMMENTS Reading from the stream obtained from Process.getInputStream() would sometimes not get an EOF after the process had closed the write end of that pipe.
10-06-2004

EVALUATION We will attempt to resolve this for Tiger. -- iag@sfbay 2003-04-07 This appears to be a Solaris bug in the implementation of /proc/<pid>/fd. The bug is hard to reproduce, and I have not been able to reproduce it consistently. However, the following program ------------------------------------------------------------------- public class PipeHang { public static void main (String[] args) throws Exception { Process cat = Runtime.getRuntime().exec(new String[] {"cat"}); // Attack of the file descriptor snatchers for (int i = 0; i < 50; ++i) Runtime.getRuntime().exec(new String[] {"/bin/sleep", "999"}); cat.getOutputStream().write(new byte[] {'M','e','o','w','\n'}); cat.getOutputStream().close(); byte [] buffer = new byte [100]; int len; while ((len = cat.getInputStream().read(buffer)) > 0) System.out.print (new String (buffer,0,len)); } } ------------------------------------------------------------------- hangs more than 50% of the time when run on a Solaris 7 or Solaris 8 machine. You might have to fiddle with the number of sleep processes to reproduce the hang. The traditional way to close all file descriptors is a loop like this: int max_fds = SYSTEM_DEPENDENT_VALUE(): for (int i = 3; i <= max_fds; ++i) close (i); There are two problems with this: - max_fds can be a large number, making this quite inefficient. - The system might have no limit whatsoever on the number of file descriptors, e.g. via (as root) ulimit -Hn unlimited; ulimit -Sn unlimited In this case, the loop is impractical - inefficient and a stress test for the OS, which we should avoid. See bug 4413680. This is why we want to use the approach of reading the directory entries for /proc/getpid()/fd/ Unfortunately, this method is buggy on Solaris. (This was already known. See this comment from UNIXProcess_md.c.solaris: /* Close pipe fds here since they don't always show up * in /proc/<pid>/fd */ ) Another approach is to use the close-on-exec flag on file descriptors to explicitly specify that they should not be open in the child. See fcntl(2). The difficulty with that approach is that we would need to audit *all* operations in the "java" process that could create file descriptors. This cannot be a complete solution because file descriptors can be created by user-defined native code. Here is an interesting comment from 4413680 Runtime exec hangs if ulimit nofiles is unlimited. For future reference to those who may encounter bugs like this: libc (starting in Solaris 9) implements closefrom(3C) which efficiently closes all FD's greater than a given number. There is no 100% guaranteed reliable solution. We should: - set the close-on-exec flag on the parent end of pipe file descriptors created by Runtime.exec() This will probably make the specific reported bug just go away. - close all file descriptors using the /proc/<pid>/fd method. - Because we don't trust the previous step, we supplement it with additional bug workaround code. For example, we could use the /proc/<pid>/fd method twice, or remember the maximum fd listed in /proc/<pid>/fd, and brute-force close all file descriptors 3 <= fd <= max_fd_found + 256 Experimentation is required to find a bug workaround that is effective in practice. - perhaps consider using closefrom() in Solaris 9 via libdl. ###@###.### 2003-04-18 The above analysis is wrong. The implementation of /proc on Solaris is NOT buggy. Instead, it was the function closeDescriptors() in UNIXProcess_md.c.solaris and UNIXProcess_md.c.linux that was buggy. Here's what was happening: After we fork(), the child process wants to close all of the file descriptors greater than 3. To do this, it enumerates the entries in the directory /proc/PID/fd. Of course, it should call getpid() to get its pid (actually, even better is to use /proc/self/fd, but we can leave that improvement for when 4790606 (process) Native forkAndExec code should be unified (sol, lnx) is addressed.) but instead it calls getppid (get PARENT pid). Many engineer eyeballs have read that code and their brains had filtered out that extra 'p', since it doesn't make sense to enumerate the PARENT's fds. Of course, at the exact moment of the fork(), the parent and child have identical sets of file descriptors, and so /proc/getpid()/fd and /proc/getppid()/fd have identical contents. But both processes then execute independently. In particular, the parent dutifully closes file descriptors it doesn't need, and these closes cause FAILURES to close in the child. Of course, all of these are highly dependent on timing and the availability of CPUs to run the two processes concurrently. See bug 4850368 (process) getInputStream() attaches to forked background processes (Linux) for an example of a good way to tickle this bug on Uniprocessor Linux. The test case reported in this bug is better for reproducing this bug on Solaris. ###@###.### 2003-04-22
22-04-2003