JDK-8354560 : Exponentially delay subsequent native thread creation in case of EAGAIN
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21,25
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2025-04-15
  • Updated: 2025-05-29
  • Resolved: 2025-05-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25
25 b24Fixed
Related Reports
Relates :  
Description
Filed on behalf of Yannik Stradmann:

https://mail.openjdk.org/pipermail/hotspot-runtime-dev/2025-April/077952.html

I'd like to propose a change to hotspot's error handling when spawning native 
threads in os::create_thread().

Currently, if EAGAIN is encountered, we retry three times back-to-back.

During recent years, I've experienced instabilities on certain systems, where back-to-back (re-)requests of native threads kept hitting the depleted resource pool and, eventually, failed.

I therefore propose to introduce an exponential backoff when hitting EAGAIN during native thread creation. Hotspot will thereby be more kind to an already depleted resource, reduce stress on the kernel and become more robust on systems under high load.

For reference, I am attaching a patch against os_linux.cpp, which has been running in production on a mid-scale Jenkins cluster over the past three years. If you approve the modification, I'm happy to create a pull request that includes the other platforms (where applicable).
The current choice of constants is arbitrary and I'd welcome any suggestions here.


Please note that this is my first time contributing to OpenJDK, please excuse potential unfamiliarities with the process.

Yannik


diff --git a/src/hotspot/os/linux/os_linux.cpp b/src/hotspot/os/linux/os_linux.cpp
index 4e26797cd5b..2858fbba247 100644
--- a/src/hotspot/os/linux/os_linux.cpp
+++ b/src/hotspot/os/linux/os_linux.cpp
@@ -1064,10 +1064,28 @@ bool os::create_thread(Thread* thread, ThreadType thr_type,
     ResourceMark rm;
     pthread_t tid;
     int ret = 0;
-    int limit = 3;
-    do {
+    int limit = 5;
+    useconds_t delay = 1'000;
+    constexpr useconds_t max_delay = 1'000'000;
+
+    while (true) {
       ret = pthread_create(&tid, &attr, (void* (*)(void*)) thread_native_entry, thread);
-    } while (ret == EAGAIN && limit-- > 0);
+
+      if (ret != EAGAIN) {
+          break;
+      }
+
+      if (limit-- <= 0) {
+          break;
+      }
+
+      log_warning(os, thread)("Failed to start native thread (%s), retrying after %dus.", os::errno_name(ret), delay);
+      ::usleep(delay);
+      delay *= 2;
+      if (delay > max_delay) {
+          delay = max_delay;
+      }
+    }
 
     char buf[64];
     if (ret == 0) {
Comments
Changeset: 27a42435 Branch: master Author: Yannik Stradmann <yjs@stradmann.name> Committer: David Holmes <dholmes@openjdk.org> Date: 2025-05-19 21:28:02 +0000 URL: https://git.openjdk.org/jdk/commit/27a4243561e31d6f2858dd0c0bd356e2849ed87c
19-05-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/24682 Date: 2025-04-16 10:34:18 +0000
16-04-2025

This is not an unreasonable idea. But it is very hard to evaluate the effectiveness of such a change. Do you have any actual data on how many retries you have had to wait to succeed? When the retries were added in: https://bugs.openjdk.org/browse/JDK-8268773 there was some discussion across a number of bug reports and two PRs about the potential usefulness of even doing a basic retry as the error condition was considered to unlikely to be self correcting. But as per that original change, adding a delay between retries does no harm other than delaying the ultimate reporting of an error, so it may be okay to put in place if it will do some good.
15-04-2025