Bug ID: JDK-8364654 Add diagnostic option to use sigaltstack on Linux for Hotspot created threads

Type: Enhancement
Component: hotspot
Sub-Component: runtime

Priority: P4
Status: Open
Resolution: Unresolved

Submitted: 2025-08-04
Updated: 2025-09-02

Other
tbdUnresolved

Background:

In complicated environments, when mixing Java and native code, it is possible for the process to be force killed by the Linux kernel with signal 11. This occurs when the stack is very close to the yellow zone, and the JVM calls a native function with a large stack frame. It may jump past the yellow zone entirely into the red zone or past the end of the stack. If past the end of the stack, since no alternative signal handler stack is present, the kernel force kills the process with almost no information (no stack trace, hs-err log, or core dump). If in the red zone, the signal handler will fault again, it will be unprotected, and resume from the red zone. Potentially the native code will go past the end of the stack. In these extreme cases almost no information is retrievable, making the root cause un-debuggable. This type of un-debuggable stack overflow crash could also happen from JVM internal threads, e.g. JDK-8366118.

Proposal:

Add diagnostic options to use sigaltstack on Linux for Hotspot created threads through two new flags. "-XX:+UseSigAltStack" and "-XX:SigAltStackSize=<size>". The former is used to enable it, and the latter would have a size starting at SIGSTKSZ. Additionally when UseSigAltStack is enabled, we should pass SA_ONSTACK when setting up signal handlers.

Due to complicated interaction with JVM's signal handling and stack-overflow detection mechanisms, as well as issues such as specifying a sufficient SigAltStackSize and potential memory overhead, UseSigAltStack and SigAltStackSize will remain as diagnostic options and will not be turned on by default. This feature is only supposed to help debug those un-debuggable crashes mentioned above.

Thanks. Let's treat this RFE as similar but different from JDK-7154055 and JDK-7109520. Those two old issues did not propose to add sigaltstack as a diagnostic-only feature, and were targeting JDK6/7.
02-09-2025
Is this still a duplicate of JDK-7154055 ? It's closed as Won't Fix. I changed it to relates.
01-09-2025
> It sounds like we have an agreement for adding this feature as a diagnostic option. You read an awful lot in to "A diagnostic use may be acceptable"! If this feature were to go in, then IMO it could only go in as a diagnostic feature. But that doesn't mean it should go in. That is yet to be determined. At the moment the proposal seems incomplete.
27-08-2025
Thank you for the feedback! It sounds like we have an agreement for adding this feature as a diagnostic option. I reopened the bug and updated description. I'll try resolving issues about stack-walking for hs_err and thread termination, and will add a test and better documentation. Then I'll send out the PR for review.
26-08-2025
This patch helped immensely in debugging https://bugs.openjdk.org/browse/JDK-8366118. For the patch, the latest version is in https://github.com/caoman/jdk/tree/JDK-8364654-sigaltstack. With +UseSigAltStack and SigAltStackSize=256K, on linux-x64 9 tests are failing, mainly due to hs-err log lacking some information. (My previous testing was in our internal runs of jtreg tier1 instead of GitHub pre-submit, and we did not encounter these errors.) The result still shows that the JVM works well in most cases with +UseSigAltStack, except with some differences in crash logs. Anyway, how about the proposal to add this feature via DIAGNOSTIC or EXPERIMENTAL flags?
26-08-2025
Apologies that I have not had time to investigate your reproducer yet. That is an interesting Proof-of-Concept but I don't understand how we can use the alt-stack for general purpose signal-handling: how does stack-walking for hs_err production work? I expected the altstack to only be used for the specific cases that need it. Also I think it needs a bit more work to be robust i.e. don't you need to disable the alt-stack during thread termination before deleting it? The thread could still hit a fault on its way to proper termination. A diagnostic use may be acceptable, though we still need to have test code for it to ensure it does not bit-rot. You will also need documentation explaining when to use it and what it can help with.
26-08-2025
Thank you for digging those up. Unfortunately increasing StackShadowPages does not help. I attached a repro case and it keeps crashing without any stack trace even with -XX:StackShadowPages=50. Additionally, I have implemented a prototype at https://github.com/caoman/jdk/commit/1dc6f4bab11e59f304ab85b0e5c9a05cef18f2be. It does not look too complex and solves the problem with the attached repro case. I did additional testing by flipping `UseSigAltStack` to true by default, and increasing default -XX:SigAltStackSize to 256KB, and all tier1 jtreg tests passed on Linux. I agree there are hard-to-debug issues when the alt-stack itself is too small. Could we consider providing the `UseSigAltStack` feature as a DIAGNOSTIC or EXPERIMENTAL JVM flag? It could be supported by best-effort. E.g. if a bug only happens with `-XX:+UseSigAltStack`, it will not be a high-priority to fix it.
19-08-2025
Here is some previous discussion in relation to this: https://mail.openjdk.org/pipermail/hotspot-runtime-dev/2011-August/002354.html I also found some later internal discussion from 2013 that raised a few other points. Many of the issues related to this no longer apply e.g. issues with Solaris T1 thread library; issues with x86 "current thread" cache. But there was also an issue that our stack-walkers couldn't walk the alt-stack so would need updating. Also I note this comment from Coleen: "To handle large native stacks, you have to increase the StackShadowPages so that they cover the estimated size of the native stacks. StackRedPages and StackYellowPages should stay the same. That's how the design is supposed to work, ..."
18-08-2025
Sorry for the delayed response (vacation). My recollection is that someone did investigate such an alt-stack implementation 10-15 years ago, and that we rejected it at the time because it was too complex and potentially introduced a whole swag of new bugs if the alt-stack was itself too small. Unfortunately I don't think we have the related discussion and findings preserved anywhere. I still feel that increasing yellow/red pages should be able to provide some relief here. There will always be bugs in native code that our stack guards can't help with because native code can potentially jump over anything, but for the more usual cases of simply rolling over the end of the stack due to excessive recursion, it should help.
18-08-2025
Interestingly osThread_linux\|bsd\|aix.hpp have members _alt_sig_stack, set_alt_sig_stack(address val), alt_sig_stack() that appear unused. We could use them for this RFE if we could move forward.
13-08-2025
Thanks. It looks like the issues in JDK-4852809 were due to the alt-stack being part of the thread's stack (-Xss): > we install alternate signal stack for SIGSEGV at the lower end of thread stack This RFE will propose using a different chunk of memory other than thread's stack for alt-stacks (such as malloc'ed memory), for providing better error-handling for stack overflows. Have we explored such an approach before? Were there problems encountered?
06-08-2025
@manc I have made JDK-4852809 viewable.
06-08-2025
> and got rid of them (JDK-4852809). Could you provide a summary of the issues previously encountered with alt-stacks? We cannot view JDK-4852809.
05-08-2025
We tried increasing yellow/red zone sizes, and it wasn't enough. If you are close enough to the yellow/red zone in native code and touch it, the signal frame that is pushed onto the stack can push you over. When the signal handler attempts to run, it will SIGSEGV before the JVM signal handler can run and `mprotect` the pages. This ends with the kernel force killing the process. There are other scenarios where this can happen. The reason I was suggesting sigaltstack is that it is the only comprehensive solution that works in all cases (unless your signal handler does something extremely unadvised). Though I can see why apprehension exists due to the complexity. Is there a world where this can be revisited and re-added? Maybe more comprehensive test coverage for them? The main problem areas I can think of is making sure stack unwinding code can handle signal frames, including nested, and correctly identify the signal stack and the main stack. When trying to identify the root cause, and increasing yellow/red zone sizes doesn't work, the only recourse is hacking in sigaltstack. When you need it, you really need it. Though I agree actually requiring it should be rare.
05-08-2025
[~jcking] As mentioned in JDK-7154055 we used alt-stacks many years ago and got rid of them (JDK-4852809). There have been a number of issues filed over the years about this and the consensus has been that the purported benefit from using this - in the one case where you may need it - is not worth the complexity and potential for other bugs that it introduces.
05-08-2025
BTW if you encounter a situation in the field where you think this may be occurring then increasing the size of the yellow and red zones is probably the best recourse.
05-08-2025

Relates :	JDK-7154055 - Please add alternate signal stacks to Linux JVM for better error reporting
Relates :	JDK-8366118 - DontCompileHugeMethods is not respected with -XX:-TieredCompilation
Relates :	JDK-7109520 - Can't get hs_err log on native stack overflow on Linux