Bug ID: JDK-8217475 Unexpected StackOverflowError in "process reaper" thread

JDK-8217475 : Unexpected StackOverflowError in "process reaper" thread

Type: Bug
Component: core-libs
Sub-Component: java.lang
Affected Version: 11,12,13,14,15,16

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2019-01-22
Updated: 2023-12-08
Resolved: 2020-07-10

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11	JDK 15	JDK 16
11.0.23-oracleFixed	15 b32Fixed	16Fixed

Related Reports

Duplicate :	JDK-8230299 - runtime/cds/SpaceUtilizationCheck.java timesout with StackOverflowError with -Xcomp
Duplicate :	JDK-8236143 - sun/management/jdp/JdpDefaultsTest.java times out with -Xcomp
Duplicate :	JDK-8222441 - StackOverflowError in "process reaper" thread has returned
Duplicate :	JDK-8226827 - runtime/appcds/PrintSharedArchiveAndExit.java timed out
Relates :	JDK-8173817 - StackOverflowError in "process reaper" thread
Relates :	JDK-8249217 - Unexpected StackOverflowError in "process reaper" thread still happens
Relates :	JDK-8225035 - Thread stack size issue caused by large TLS size

Description

Test shows:

Exception in thread "process reaper" java.lang.StackOverflowError
	at java.base/java.util.concurrent.ConcurrentHashMap.fullAddCount(ConcurrentHashMap.java:2576)
	at java.base/java.util.concurrent.ConcurrentHashMap.addCount(ConcurrentHashMap.java:2326)
	at java.base/java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1075)
	at java.base/java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1541)
	at java.base/java.lang.invoke.MethodType$ConcurrentWeakInternSet.add(MethodType.java:1380)
	at java.base/java.lang.invoke.MethodType.makeImpl(MethodType.java:327)
	at java.base/java.lang.invoke.MethodHandleNatives.findMethodHandleType(MethodHandleNatives.java:377)
	at java.base/java.util.concurrent.CompletableFuture.completeValue(CompletableFuture.java:305)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2143)
	at java.base/java.lang.ProcessHandleImpl$1.run(ProcessHandleImpl.java:162)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:835)

which is a very small stack to be exhibiting SOE.

Testcase:

jdk/modules/scenarios/container/ContainerTest.java

Run command:

Command line: [/scratch/opt/mach5/mesos/work_dir/jib-master/install/jdk13-jdk.189/linux-x64-debug.jdk/jdk-13/fastdebug/bin/java -cp /scratch/opt/mach5/mesos/work_dir/slaves/2dd962d0-8988-479b-a804-57ab764ada59-S1209/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/cace545e-997e-47cd-b730-7ffcd73c5bc8/runs/2cface49-ac1e-4f35-90ec-7fbeb6b79638/testOutput/test-support/jtreg_open_test_jdk_jdk_lang/classes/2/jdk/modules/scenarios/container/ContainerTest.d:/scratch/opt/mach5/mesos/work_dir/jib-master/install/jdk13-jdk.189/src.full/open/test/jdk/jdk/modules/scenarios/container:/scratch/opt/mach5/mesos/work_dir/slaves/2dd962d0-8988-479b-a804-57ab764ada59-S1209/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/cace545e-997e-47cd-b730-7ffcd73c5bc8/runs/2cface49-ac1e-4f35-90ec-7fbeb6b79638/testOutput/test-support/jtreg_open_test_jdk_jdk_lang/classes/2/test/lib:/scratch/opt/mach5/mesos/work_dir/jib-master/install/jdk13-jdk.189/src.full/open/test/lib:/scratch/opt/mach5/mesos/work_dir/jib-master/install/java/re/jtreg/4.2/promoted/all/b13/bundles/jtreg_bin-4.2.zip/jtreg/lib/testng.jar:/scratch/opt/mach5/mesos/work_dir/jib-master/install/java/re/jtreg/4.2/promoted/all/b13/bundles/jtreg_bin-4.2.zip/jtreg/lib/javatest.jar:/scratch/opt/mach5/mesos/work_dir/jib-master/install/java/re/jtreg/4.2/promoted/all/b13/bundles/jtreg_bin-4.2.zip/jtreg/lib/jtreg.jar -Xmx512m -XX:MaxRAMPercentage=6 -ea -esa -Xcomp -XX:+CreateCoredumpOnCrash -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+IgnoreUnrecognizedVMOptions -XX:+DeoptimizeALot --module-path mlib -m container ]

Comments

Fix request [11u] I backport this for parity with 11.0.23-oracle. Adjust the stackSize. Clean backport. SAP nightly testing passed.
07-12-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/2341 Date: 2023-12-06 06:31:25 +0000
06-12-2023
URL: https://hg.openjdk.java.net/jdk/jdk15/rev/224aa251bffb User: rriggs Date: 2020-07-10 13:34:40 +0000
10-07-2020
Right, that make sense. But the mention of 8k does not seem to be in line with the size of the shadow pages that are 80+k (x86). I would have thought that the red and yellow zones would be sufficient to cover the SOE exception handler case. Anyway, its a sensitive area, so I'll go ahead with the workaround.
08-07-2020
This is the check that is put in when the VM has to call into Java code. The code has this comment: // Returns true if the current stack pointer is above the stack shadow // pages, false otherwise. bool os::stack_shadow_pages_available(Thread *thread, const methodHandle& method, address sp) { if (!thread->is_Java_thread()) return false; // Check if we have StackShadowPages above the yellow zone. This parameter // is dependent on the depth of the maximum VM call stack possible from // the handler for stack overflow. 'instanceof' in the stack overflow // handler or a println uses at least 8k stack of VM and native code // respectively. I think what it is trying to help with is code like this: try { new Foo(); // triggers class initialization for Foo } catch (StackOverflowError soe) { // we want enough stack here to do something with soe }
08-07-2020
Is there any point to asking, why in this particular flow to call a method, the shadow stack size is included in the stack size check? There must be many other paths that call methods without pre-checking the availability of the stack shadow pages. (Asking before I put in a workaround for debug mode).
08-07-2020
As seen in the stack trace hs_err_pid3358.log, javaCalls.call_helper invokes os::stack_shadow_pages_available to check the stack. From a core dump. (gdb) p thread._stack_red_zone_size $15 = 4096 (gdb) p thread._stack_yellow_zone_size $16 = 8192 (gdb) p thread._stack_reserved_zone_size $17 = 4096 (gdb) p thread._stack_shadow_zone_size $18 = 90112 4k + 8k + 4k + 88k = 104k These zone sizes of computed in cpu/xxx/globals_xxx.hpp, the big question is about the shadow zone. For x86 it is: cpu/x86/globals_x86.hpp:#define DEFAULT_STACK_SHADOW_PAGES (NOT_WIN64(20) WIN64_ONLY(7) DEBUG_ONLY(+2)) Note that in the computation of the default shadow stack size, it is 8k larger in debug mode which may explain why this failure is only seen in debug builds. So, of the 128k requested by the java for the stack size, all but a little is reserved. At the point of that particular crash, the hs_err file asserts that 103k is free. Stack: [0x00007f7b3405b000,0x00007f7b3407f000], sp=0x00007f7b34074fd0, free space=103k At the point of checking the stack the values are: sp: 7f7b34074fc0, fsize: 00000108, limit: 7f7b34075000 The size of the shadow pages is excessive but is explained by the apparent need for a 64k stack allocated buffer for networking. (globals_x86.hpp:64-66) For the purposes of the ProcessHandle reaper thread, its a complete waste. If there is a continued justification for wasting that much space in every thread then the process reaper should raise its request by 8k in debug builds to avoid this problem.
07-07-2020
It appears that the class being initialized is java.util.Random, the superclass of ThreadLocalRandom. It is also very odd that ThreadLocalRandom does not show up in -verbose:class logging. I don't know the full significance of the comment in getProbe() saying it does not force initialization. In a non-SOV run, Random does not show up in class load logging.
07-07-2020
Attached hs_err_pid3358. I have a core dump of a modified javaCalls.callHelper to print the sp, framesize, and the limit just before calling fatal(...). The "free space=103k" looks incorrect based on the actual sp and limit. I'm not sure what to look for in the core dump.
01-07-2020
Appendix patching is: Patching optional argument in MH calls which are unknown during compilation - it is C1 specific http://hg.openjdk.java.net/jdk/jdk/file/de6ad5f86276/src/hotspot/share/c1/c1_GraphBuilder.cpp#l1901 http://hg.openjdk.java.net/jdk/jdk/file/de6ad5f86276/src/hotspot/share/c1/c1_Runtime1.cpp#l1123 With thanks to [~kvn].
01-07-2020
Yes you are right Peter, I flagged the wrong code. I was mistakenly thinking the lambda expression would result in the $1 nested class, but that is completely wrong - and we already have the actual source of the failure - line 162 - from the exception. I don't think there is any issue with having a wrong, or incomplete, stack in the hs-err file (though the hs_err file shows compiled Java frames so doesn't exactly align with the exception stacktrace. It is curious that 103KB is not enough stack to continue with the <clinit> execution, but even if it was we may well need even more stack after that whilst executing the <clinit>. The "appendix patching" is certainly consuming additional stack space, and seems related to method/var-handle mechanics used by CompletableFuture.
01-07-2020
These are -Xcomp runs so there is a lot of compilation going on. The hs_err files document that there is quite a bit of stack available (103k). I'll try to reproduce and find out the sp vs limit.
30-06-2020
...Is it possible that something happens in the thread started by processReaperExecutor but before the above Runnable.run() is called such that execution of code in that Runnable by that thread triggers this StackOverflowError which shows the Runnable.run() on the stack trace, followed by some frames that are not actually part of current call stack but originate from execution pre-dating the execution of run() method by the same thread. There's a substantial code that is executed by the newly started thread in ThreadPoolExecutor (starting with Worker.run()) that may involve initialization of other classes such as ForkJoinPool (which initializes VarHandle(s) in its <clinit> etc...). So what does "move_appendix_patching" actually do and is it possible that when it bails out in this place: if (!os::stack_shadow_pages_available(THREAD, method, sp)) { // Throw stack overflow exception with preinitialized exception. Exceptions::throw_stack_overflow_exception(THREAD, __FILE__, __LINE__, method); return; ...that call stack is in such a state that throw_stack_overflow_exception enlists frames composed from later and earlier execution somehow glued together?
30-06-2020
Well, I don't think that lambda creation shown above is in the stack trace. From typical name of the class and method shown in stack trace: java.lang.ProcessHandleImpl$1.run() ... I would say it is the run() method of an anonymous inner Runnable sub-class in line 135 that is being executed: // spawn a thread to wait for and deliver the exit value processReaperExecutor.execute(new Runnable() { // Use inner class to avoid lambda stack overhead public void run() { int exitValue = waitForProcessExit0(pid, shouldReap); if (exitValue == NOT_A_CHILD) { // pid not alive or not a child of this process // If it is alive wait for it to terminate long sleep = 300; // initial milliseconds to sleep int incr = 30; // increment to the sleep time long startTime = isAlive0(pid); long origStart = startTime; while (startTime >= 0) { try { Thread.sleep(Math.min(sleep, 5000L)); // no more than 5 sec sleep += incr; } catch (InterruptedException ie) { // ignore and retry } startTime = isAlive0(pid); // recheck if it is alive if (startTime > 0 && origStart > 0 && startTime != origStart) { // start time changed (and is not zero), pid is not the same process break; } } exitValue = 0; } newCompletion.complete(exitValue); // remove from cache afterwards completions.remove(pid, newCompletion); } }); ...this Runnable is executed by processReaperExecutor which has been constructed in the static initializer shown earlier, so initialization and execution of lambda has long been done successfully before this happens here. So my guess is that something in this Runnable.run() is causing the "load_appendix_patching" followed by MethodHandleNatives.findMethodHandleType etc... But what?
30-06-2020
In the hs-err file cases we appear to triggering the SOE during static initialization of a class in response to an attempt to call a static method on that class. V [libjvm.so+0xa7c9de] Exceptions::throw_stack_overflow_exception(Thread, char const, int, methodHandle const&)+0xde V [libjvm.so+0xd008af] JavaCalls::call_helper(JavaValue, methodHandle const&, JavaCallArguments, Thread)+0x39f V [libjvm.so+0xcc1c2f] InstanceKlass::call_class_initializer(Thread)+0x1bf V [libjvm.so+0xcc3118] InstanceKlass::initialize_impl(Thread)+0x628 V [libjvm.so+0xcc2e7d] InstanceKlass::initialize_impl(Thread)+0x38d V [libjvm.so+0x112705f] LinkResolver::resolve_static_call(CallInfo&, LinkInfo const&, bool, Thread)+0xcf V [libjvm.so+0x112df03] LinkResolver::resolve_invoke(CallInfo&, Handle, constantPoolHandle const&, int, Bytecodes::Code, Thread)+0x183 V [libjvm.so+0x153f363] SharedRuntime::find_callee_info_helper(JavaThread, vframeStream&, Bytecodes::Code&, CallInfo&, Thread)+0x5f3 V [libjvm.so+0x154172e] SharedRuntime::resolve_sub_helper(JavaThread, bool, bool, Thread)+0x18e V [libjvm.so+0x1541d2e] SharedRuntime::resolve_helper(JavaThread, bool, bool, Thread)+0x4e V [libjvm.so+0x1542031] SharedRuntime::resolve_static_call_C(JavaThread)+0x131 v ~RuntimeStub::resolve_static_call In the call_helper we have: if (!os::stack_shadow_pages_available(THREAD, method, sp)) { // Throw stack overflow exception with preinitialized exception. Exceptions::throw_stack_overflow_exception(THREAD, __FILE__, __LINE__, method); return; so we don't think we have enough stack to make the call into Java to run <clinit>. So what are we trying to initialize? Unfortunately that's not discernible from the hs_err file, but looking at fullAddCount we would have to suspect this: if ((h = ThreadLocalRandom.getProbe()) == 0) { But that is the "end game". The more interesting question is how we got into that code, which as Roger indicates stems back to appendix patching (whatever that is!): V [libjvm.so+0x72652c] Runtime1::patch_code(JavaThread, Runtime1::StubID)+0x14fc V [libjvm.so+0x728e77] Runtime1::move_appendix_patching(JavaThread*)+0x37 v ~RuntimeStub::load_appendix_patching Runtime1 stub J 13558 c1 java.lang.ProcessHandleImpl$1.run()V java.base@16-internal (136 bytes) @ 0x00007f3a69f559fa [0x00007f3a69f54b40+0x0000000000000eba] Looking in ProcessHandleImpl.java I see: private static final Executor processReaperExecutor = doPrivileged((PrivilegedAction<Executor>) () -> { ThreadGroup tg = Thread.currentThread().getThreadGroup(); while (tg.getParent() != null) tg = tg.getParent(); ThreadGroup systemThreadGroup = tg; final long stackSize = Boolean.getBoolean("jdk.lang.processReaperUseDefaultStackSize") ? 0 : REAPER_DEFAULT_STACKSIZE; ThreadFactory threadFactory = grimReaper -> { Thread t = new Thread(systemThreadGroup, grimReaper, "process reaper", stackSize, false); t.setDaemon(true); // A small attempt (probably futile) to avoid priority inversion t.setPriority(Thread.MAX_PRIORITY); return t; }; return Executors.newCachedThreadPool(threadFactory); }); So my take here is that (as has been mentioned above by Martin IIRC) the issue may well be the use of lambda expressions resulting in a ton of internal method handle related initialization. So my suggestion is to get rid of the use of lambda in this context.
30-06-2020
There is one explicit call to Exceptions::throw_stack_overflow_exception in javaCalls.call_helper. Common to the two hs_err files is a call to RuntimeStub::load_appendix_patching but that may be ordinary.
29-06-2020
Setting -XX:AbortVMOnException=java.lang.StackOverflowError produced the hs_err file attached running java.lang.System.LoggerFinder.modules.UnnamedLoggerForImageTest The head of the stack is: Current thread (0x00007f3a809508f0): JavaThread "process reaper" daemon [_thread_in_vm, id=23466, stack(0x00007f3a156000 00,0x00007f3a15624000)] Stack: [0x00007f3a15600000,0x00007f3a15624000], sp=0x00007f3a15619e10, free space=103k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa7a164] Exceptions::debug_check_abort_helper(Handle, char const)+0x184 V [libjvm.so+0xa7ae41] Exceptions::_throw(Thread, char const, int, Handle, char const)+0x121 V [libjvm.so+0xa7c9de] Exceptions::throw_stack_overflow_exception(Thread, char const, int, methodHandle const&)+0xde V [libjvm.so+0xd008af] JavaCalls::call_helper(JavaValue, methodHandle const&, JavaCallArguments, Thread)+0x39f V [libjvm.so+0xcc1c2f] InstanceKlass::call_class_initializer(Thread)+0x1bf V [libjvm.so+0xcc3118] InstanceKlass::initialize_impl(Thread)+0x628 V [libjvm.so+0xcc2e7d] InstanceKlass::initialize_impl(Thread)+0x38d V [libjvm.so+0x112705f] LinkResolver::resolve_static_call(CallInfo&, LinkInfo const&, bool, Thread)+0xcf V [libjvm.so+0x112df03] LinkResolver::resolve_invoke(CallInfo&, Handle, constantPoolHandle const&, int, Bytecodes::Code, Thread)+0x183 V [libjvm.so+0x153f363] SharedRuntime::find_callee_info_helper(JavaThread, vframeStream&, Bytecodes::Code&, CallInfo&, Thread)+0x5f3 V [libjvm.so+0x154172e] SharedRuntime::resolve_sub_helper(JavaThread, bool, bool, Thread)+0x18e V [libjvm.so+0x1541d2e] SharedRuntime::resolve_helper(JavaThread, bool, bool, Thread)+0x4e V [libjvm.so+0x1542031] SharedRuntime::resolve_static_call_C(JavaThread*)+0x131 v ~RuntimeStub::resolve_static_call J 14116 c1 java.util.concurrent.ConcurrentHashMap.fullAddCount(JZ)V java.base@16-internal (462 bytes) @ 0x00007f3a6a52fa4c [0x00007f3a6a52f9e0+0x000000000000006c] J 4789 c2 java.util.concurrent.ConcurrentHashMap.addCount(JI)V java.base@16-internal (280 bytes) @ 0x00007f3a70ad9d34 [0x00007f3a70ad9ce0+0x0000000000000054] ...
29-06-2020
Added a hs_err_pid22832.log with similar symptoms.
29-06-2020
I don't think we can actually see where/why the original SOE is triggered in the hs_err file. What we are seeing is the stack where we create the SOE instance - which is a different thing. We need to abort when we get the page fault (assuming this is a case where the SOE is actually triggered by a page fault).
29-06-2020
We saw a sudden spike in these occurrences in the past few CI runs. I just tagged 4 different occurrences.
25-05-2020
Another instance: JDK-8230299
30-08-2019
[~plevart], I'm just lurking here and don't have much background info. However, I just decided to try out your Test program that you have in your comment above on my setup. It worked/passed fine with 128000 both with and without the "new CompletableFuture<>().complete(null); " commented out and with no other changes. I'm on macOS 10.14.1 and have tried this with a pre-built JDK 11: openjdk 11.0.1 2018-10-16 OpenJDK Runtime Environment 18.9 (build 11.0.1+13) OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode) as well as a locally built version of the latest upstream jdk repo: openjdk 14-internal 2020-03-17 OpenJDK Runtime Environment (build 14-internal+0-adhoc.jaikiran.jdk) OpenJDK 64-Bit Server VM (build 14-internal+0-adhoc.jaikiran.jdk, mixed mode, sharing)
12-07-2019
I agree, the command line does not reveal any special native tool being used. But something is either eating some or not giving the reaper thread the full 128k of stack that it is asking for...
28-06-2019
[~plevart] AFAIK there is no TLS involved here.
26-06-2019
Probably relates to (will be fixed by) JDK-8225035
19-06-2019
To put things into perspective, I created a small experiment. Here's a simple test: import java.util.concurrent.CompletableFuture; public class Test { static void recurse(CompletableFuture<Boolean> cf, int i) { if (i == 0) { cf.complete(Boolean.TRUE); } else { recurse(cf, i - 1); } } public static void main(String[] args) throws Exception { int stackSize = Integer.parseInt(args[0]); System.out.println("stackSize=" + stackSize); Boolean.TRUE.booleanValue(); // pre-load Boolean class new CompletableFuture<>().complete(null); var cf = new CompletableFuture<Boolean>(); new Thread( Thread.currentThread().getThreadGroup(), () -> { try { recurse(cf, 310); } catch (StackOverflowError e) { System.out.println("Stack overflow at depth: " + e.getStackTrace().length); e.printStackTrace(); cf.complete(Boolean.FALSE); } }, "SmallStackThread", stackSize ).start(); if (!cf.get()) throw new AssertionError("Failure"); } } This test passes when run with stackSize of 128000 (the same as used for process reaper thread) on JDK 11 on Linux but fails when the line with new CompletableFuture<>().complete(null) is commented-out. The following is the stack trace of such failure: stackSize=128000 Stack overflow at depth: 321 java.lang.StackOverflowError at java.base/java.lang.ref.Reference.<init>(Reference.java:395) at java.base/java.lang.ref.WeakReference.<init>(WeakReference.java:57) at java.base/java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.<init>(MethodType.java:1342) at java.base/java.lang.invoke.MethodType$ConcurrentWeakInternSet.get(MethodType.java:1289) at java.base/java.lang.invoke.MethodType.makeImpl(MethodType.java:299) at java.base/java.lang.invoke.MethodHandleNatives.findMethodHandleType(MethodHandleNatives.java:376) at java.base/java.util.concurrent.CompletableFuture.completeValue(CompletableFuture.java:305) at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2072) at Test.recurse(Test.java:7) at Test.recurse(Test.java:9) ... 307 lines skipped .... at Test.recurse(Test.java:9) at Test.recurse(Test.java:9) at Test.lambda$main$0(Test.java:26) at java.base/java.lang.Thread.run(Thread.java:834) ...but the test also passes when the line with new CompletableFuture<>().complete(null) is left commented out and the invocation of method recurse(cf, 310) is changed to recurse(cf, 294), which suggests that 128k of stack should be plenty more than needed to link the VarHandle call site. The original stack trace attached to this issue above suggests that in that particular case, there is much less stack available than 128k for that reaper thread. The cure for this issue therefore lies in identifying why it is so and fixing that. Even pre-linking the VarHandle might not be enough - the real issue might be elsewhere (maybe it is related to glibc's treatment of thread-local storage which is allocated on the thread's stack?)
18-06-2019
Paul Sandoz would be the right person to ask whether the 1st invocation of a particular VarHandle polymorpihic method with a particular "signature" performs the linking of the method/callsite to the target. Intuitively it has to be so otherwise the performance of VarHandle(s) would be inadequate. So intuitively, pre-invoking new CompletableFuture<>().complete(null) does that pre-linking. But I'm not sure if this lazy linking happens once per method/signature or once per call site. But it doesn't matter. The call site is in the CompletableFuture.complete() which we would like to pre-link. I don't know what "deoptimizeALot" is for, but it sounds like it has to do with JIT which comes later, after linking. Invoking new CompletableFuture<>().complete(null) in CompletableFuture.<clinit> seems like a more general solution than invoking it in ProcessHandleImpl.<clinit>, but also less precise. Imagine what is more likely to be triggered from a stack-constrained thread: CompletableFuture.<clinit> or ProcessHandleImpl.<clinit> ? I think keeping CompletableFuture lazy or making it more eagerly initialized does not make a particular difference in general. You can get into problems either way. ProcessHandleImpl OTOH is initialized just before it is 1st needed to start the stack-constrained reaper thread which might need to invoke CompletableFuture.complete(), which seems like a more suitable point to pre-initialize CompletableFuture's infrastructure.
18-06-2019
Peter's hack seems likely to work. But unlike semi-transparent <clinit>, where a user can explicitly load a class to run <clinit>, there doesn't seem to be any guaranteed way to run "j.l.invoke initialization". Maybe deoptimizeALot will cause the init code to be re-run? Is there any way we could fix CompletableFuture's <clinit> so that VarHandle initialization gets further? How crazy is it to put new CompletableFuture<>().complete(null); at the end of CompletableFuture's <clinit> ?
17-06-2019
A Quick and dirty fix could be to pre-initialize the CompletableFuture's RESULT VarHandle in a thread that happens to initialize the ProcessHandleImpl class: Index: src/java.base/share/classes/java/lang/ProcessHandleImpl.java <+>UTF-8 =================================================================== --- src/java.base/share/classes/java/lang/ProcessHandleImpl.java (revision 55398:e53ec3b362f42ca94b120141b6da6dcfeba346f2) +++ src/java.base/share/classes/java/lang/ProcessHandleImpl.java (revision 55398+:e53ec3b362f4+) @@ -74,6 +74,11 @@ initNative(); long pid = getCurrentPid0(); current = new ProcessHandleImpl(pid, isAlive0(pid)); + + // pre-initialize CompletableFuture.RESULT VarHandle so that we don't get + // StackOverflowError later when CompletableFuture.complete is 1st called + // from a stack-constrained reaper thread ... + new CompletableFuture<>().complete(null); } private static native void initNative(); This would prevent the following call frames to be executed as part of process reaper's CF.complete call: at java.base/java.util.concurrent.ConcurrentHashMap.fullAddCount(ConcurrentHashMap.java:2576) at java.base/java.util.concurrent.ConcurrentHashMap.addCount(ConcurrentHashMap.java:2326) at java.base/java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1075) at java.base/java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1541) at java.base/java.lang.invoke.MethodType$ConcurrentWeakInternSet.add(MethodType.java:1380) at java.base/java.lang.invoke.MethodType.makeImpl(MethodType.java:327) at java.base/java.lang.invoke.MethodHandleNatives.findMethodHandleType(MethodHandleNatives.java:377) There is however other code that may get executed synchronously as part of next completion stage, like the following: ProcessHandleImpl.completion(pid, true).handle((exitcode, throwable) -> { synchronized (this) { this.exitcode = (exitcode == null) ? -1 : exitcode.intValue(); this.hasExited = true; this.notifyAll(); } if (stdout instanceof ProcessPipeInputStream) ((ProcessPipeInputStream) stdout).processExited(); if (stderr instanceof ProcessPipeInputStream) ((ProcessPipeInputStream) stderr).processExited(); if (stdin instanceof ProcessPipeOutputStream) ((ProcessPipeOutputStream) stdin).processExited(); return null; }); ...but that hasn't caused problems before, so it apparently needs less stack.
17-06-2019
When I first implemented "process reaper" long ago, I gave it a small stack size because it didn't do very much - just call waitpid. It seems that over the years more machinery got added and now the thread is busy doing "findMethodHandle", which didn't exist back then. There seems to be a high risk that any small-stack-size thread will somehow trigger java.lang.invoke machinery, and that will somehow trigger one-time setup, similar to <clinit>, with resulting StackOverflowError. I'm sorry that "process reaper" caused so many problems over the years, but it does highlight the thorny fixed-size stack problem that continues to not get addressed.
14-06-2019
Added [~rriggs] and [~martin] to the Watchers due to their knowledge and history in this area. This seem very similar to JDK-8173817 (as Dean noted a few months ago!). Have changes in the j.u.c code caused a potential increase in stack requirements?
14-06-2019
Correction: looks like DeoptimizeALot may be the common factor.
15-04-2019
This is not restricted to -Xcomp and -XX:+DeoptimizeALot as far as I can see. We are seeing fairly regular occurrences of this failure mode in the CI testing - see for example JDK-8222441 (which I just closed as a duplicate of this).
15-04-2019
Looks like a repeat of JDK-8173817.
23-02-2019