JDK-8343285 : java.lang.Process is unresponsive and CPU usage spikes to 100%
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21,23,24
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS: os_x
  • CPU: generic
  • Submitted: 2024-10-29
  • Updated: 2024-11-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 24
24Unresolved
Related Reports
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
A DESCRIPTION OF THE PROBLEM :
In JDK 8, using the Process class to execute the command "/Applications/LibreOffice.app/Contents/MacOS/soffice --help"works as expected. However, after switching to JDK 21, the thread executing the command becomes unresponsive, and the forked process causes CPU usage to spike to 100%.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. install soffice  
2. run code
3. switch jdk

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
return immediately
ACTUAL -
no result

---------- BEGIN SOURCE ----------
            ProcessBuilder processBuilder = new ProcessBuilder(
                    "/Applications/LibreOffice.app/Contents/MacOS/soffice",
                    "--help"
            );

            Process process = processBuilder.start();

            BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
            String line;
            while ((line = reader.readLine()) != null) {
                System.out.println(line);
            }

            int exitCode = process.waitFor();
            System.out.println("Process exited with code: " + exitCode);
---------- END SOURCE ----------


Comments
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/21992 Date: 2024-11-08 19:08:28 +0000
08-11-2024

It looks like on Linux the default value for the number of file descriptors is 0x100000 (i.e. 1,048,576) which takes under1 second to iterate through on an old x86_64 machine on both Linux and macOS. That value is still much larger than the old max value that we used to have on macOS (i.e. 10,240) by 2 orders of magnitude.
08-11-2024

We need to figure this out, without forcing users to adapt since it is the default behavior.
08-11-2024

I missed that. I would definitively call this a regression then. At this point then I am leaning towards lowering the value to something where anyone who uses similar logic in their code to iterate over the allowed range of file descriptors does not run into this. I will look into a reasonable value to use...
08-11-2024

> But this is not default behavior, a client must explicitly ask for "MaxFDLimit" to encounter this. [~gziemski] That is not correct - MaxFDLimit is true by default.
08-11-2024

Removed the "regression" label, if the function works as designed. Is there any workaround to suggest for users of soffice; until a new version is released? Or perhaps only a release note suggest patience in the startup of soffice.
07-11-2024

BTW, I'm not sure how to run dtruss on macOS with SIP ON, but an easy way to sample a "hung" process is to use "sample PID" command. You get a report that shows something like: Call graph: 8299 Thread_4069956 DispatchQueue_1: com.apple.main-thread (serial) + 8299 start (in dyld) + 1909 [0x200f87345] + 8299 main (in soffice) + 11 [0x100e71f5b] + 8286 sal_detail_initialize (in libuno_sal.dylib.3) + 178 [0x109978c82] + ! 8226 fstat$INODE64 (in libsystem_kernel.dylib) + 10 [0x7ff812440852] + ! : 8051 ??? (in <unknown binary>) [0x7ff8a272ea78] + ! : 38 ??? (in <unknown binary>) [0x7ff8a272ea08] + ! : 30 ??? (in <unknown binary>) [0x7ff8a272ec60] + ! : 24 ??? (in <unknown binary>) [0x7ff8a272e69c] + ! : 23 ??? (in <unknown binary>) [0x7ff8a272c21c] + ! : 20 ??? (in <unknown binary>) [0x7ff8a272e5dc] + ! : 16 ??? (in <unknown binary>) [0x7ff8a272ea70] + ! : 14 ??? (in <unknown binary>) [0x7ff8a272e5cc] + ! : 9 ??? (in <unknown binary>) [0x7ff8a272e698] + ! : 1 ??? (in <unknown binary>) [0x7ff8a272e678] + ! 33 DYLD-STUB$$fstat$INODE64 (in libuno_sal.dylib.3) + 0 [0x109986cbe] + ! 16 cerror_nocancel (in libsystem_kernel.dylib) + 36,6 [0x7ff8124407e4,0x7ff8124407c6] + ! 11 cerror_nocancel (in libsystem_kernel.dylib) + 6 [0x7ff8124407c6] + ! 11 ??? (in <unknown binary>) [0x7ff8a272d324] + 13 sal_detail_initialize (in libuno_sal.dylib.3) + 178 [0x109978c82] 8299 Thread_4069959: com.apple.rosetta.exceptionserver 8299 ??? (in runtime) load address 0x7ff7ffd47000 + 0x4414 [0x7ff7ffd4b414]
07-11-2024

Behaves as expected.
07-11-2024

LibreOffice starts jvm with "MaxFDLimit", which: "Bump the number of file descriptors to maximum (Unix only)" That limit used to be 10240 in jdk22 and before. For jdk23 and going forward, we changed it to the maximum that the macOS allowed. However it turned out that ksh had an issue where the maximum value overflowed an "int", so we rounded it down to MAX_INT. Later, we found out that in jdwp agent even with the rounded down value, code was timing out because it was trying to close all possible file descriptors, just like LibreOffice. We found a way to close only those file descriptors that were being actually used. That fix is here: https://github.com/openjdk/jdk/commit/a6632487863db5ff3136cdcc76b7440c15ce6be9#diff-1a48137c6688c91d10f931b3e37e4b961b24748fbcb2906d629807aea53db80fR71 3 ways LibreOffice can fix it on their end: - do not use "MaxFDLimit" - always use the workaround that limits the MAX to 100,000 - only close file descriptors that the process is using, just like in src/jdk.jdwp.agent/unix/native/libjdwp/exec_md.c So, it is a regression in a sense that the behavior has changed. But this is not default behavior, a client must explicitly ask for "MaxFDLimit" to encounter this. So it is up to the client to handle the higher limit, if that's what they ask for.
07-11-2024

We could decrease the default limit to a smaller, but still reasonable value - larger than the old one, but small enough that walking those fd's does not appear to hang the process (but that is dependent on the machine performance) I will double check to see if there is no API like "closefrom()" on macOS...
06-11-2024

From https://www.gnu.org/software/gnulib/manual/html_node/closefrom.html “The [POSIX] standard developers rejected a proposal to add closefrom() to the [POSIX] standard. Because the standard permits implementations to use inherited file descriptors as a means of providing a conforming environment for the child process, it is not possible to standardize an interface that closes arbitrary file descriptors above a certain value while still guaranteeing a conforming environment.”
06-11-2024

I see the same buggy application logic (in gdb) has been reported in https://bugs.openjdk.org/browse/JDK-8324577?focusedId=14718514&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14718514 The only solution would be to revert the fix.
04-11-2024

Not sure there is really anything we can or should do about this. The user running soffice could just as easily change the rlimit value and cause soffice to make excessive stat calls. But I'm unclear if the problem is that we have now exceeded the 100000 cap, or are now enforcing it (rather than some lower limit)? Calling stat 100000 times does not seem like a reasonable thing to do at all.
03-11-2024

It looks like a regression and as the runtime team is responsible, it's our responsibility. I will assign it to myself and start working on it.
01-11-2024

The change to JDK-8324577 has brought about this issue by changing the RLIMIT_NOFILE. [~gziemski][~dholmes] Please review and comment. Should this be considered a regression? Should the issue be re-assigned to hotspot/runtime? Soffice has code to guard against an excessive range of fd's. However, the change to vm initialization has foiled the sanity check, resulting in soffice exhaustive checking all fd up to init_max. The code in soffice is: `#if defined MACOSX && !HAVE_FEATURE_MACOSX_SANDBOX // On macOS when not sandboxed, soffice can restart itself via exec (see // restartOnMac in desktop/source/app/app.cxx), which leaves all file // descriptors open, which in turn can have unwanted effects (see // <https://bugs.libreoffice.org/show_bug.cgi?id=50603> "Unable to update // LibreOffice without resetting user profile"). But closing fds in // restartOnMac before calling exec does not work, as additional threads // might still be running then, which can still use those fds and cause // crashes. Therefore, the simplest solution is to close fds at process // start (as early as possible, so that no other threads have been created // yet that might already have opened some fds); this is done for all kinds // of processes here, not just soffice, but hopefully none of our processes // rely on being spawned with certain fds already open. Unfortunately, Mac // macOS appears to have no better interface to close all fds (like // closefrom): long openMax = sysconf(_SC_OPEN_MAX); // When LibreOffice restarts itself on macOS 11 beta on arm64, for // some reason sysconf(_SC_OPEN_MAX) returns 0x7FFFFFFFFFFFFFFF, // so use a sanity limit here. if (openMax == -1 || openMax == std::numeric_limits<long>::max()) { openMax = 100000; } assert(openMax >= 0 && openMax <= std::numeric_limits< int >::max()); for (int fd = 3; fd < int(openMax); ++fd) { struct stat s; if (fstat(fd, &s) != -1 && S_ISREG(s.st_mode)) close(fd); } #endif `
01-11-2024

Using dtruss, it appears that soffice is spinning using fstat to check for fd (file descriptors) to be closed. The fd numbers increase sequentially. I think eventually, it will continue, when the loop finishes. There's more to discover.
31-10-2024

The observations on MacOS: JDK 21.0.4.0.2: Passed. JDK 21.0.5+1: Failed, no response. JDK 23: Failed. JDK 24ea+10: Failed.
30-10-2024