JDK-8319090 : StackOverflow from ThreadLocal.set might shadow the root cause
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 11,17,21,22
  • Priority: P4
  • Status: New
  • Resolution: Unresolved
  • Submitted: 2023-10-30
  • Updated: 2023-11-10
Related Reports
Relates :  
Relates :  
Description
One of the cases for ThreadLocal is to pass context via try-finally.

For instance:
https://github.com/spring-projects/spring-framework/blob/8770544769367b780b3667b832920c78d4a87611/spring-aop/src/main/java/org/springframework/aop/interceptor/ExposeInvocationInterceptor.java#L94-L101

private static final ThreadLocal<MethodInvocation> invocation =
	new NamedThreadLocal<>("Current AOP method invocation");

public Object invoke(MethodInvocation mi) throws Throwable {
	MethodInvocation oldInvocation = invocation.get();
	invocation.set(mi);
	try {
		return mi.proceed();
	}
	finally {
		invocation.set(oldInvocation);
	}
}

The code assumes it will "always" restore the original invocation value, however, it fails to do so in the case of StackOverflowError in the "finally" block.

There's a similar case in OpenJDK as well: https://github.com/openjdk/jdk/blob/3934127b087ade1c1286008df3497ca6d84778a5/src/jdk.dynalink/share/classes/jdk/dynalink/LinkerServicesImpl.java#L161-L172

I suggest adding @ReservedStackAccess to ThreadLocal#set, setInitialValue, remove so the code that uses finally { threadLocal.set(...) } doesn't fail with StackOverflowError

The case has been raised in https://mail.openjdk.org/pipermail/core-libs-dev/2023-September/112702.html , and I think adding @RSA is a reasonable improvement targeting a common case which is otherwise hard to diagnose.
Comments
>- size your stack properly That is not possible with recursive functions :-/ > unable to terminate That would be devastating indeed
10-11-2023

The recommendation to the application developers is always the same: - size your stack properly - when the JVM tells you that the stack is too small, make sure you don't miss it. Supporting ReservedStackAccess is hard, as you found out already in JDK-8318888; we don't want to add it anywhere where it's not absolutely necessary. We added it on locks because a missing unlock could leave the JVM in a state where it's unable to report the problem and unable to terminate. As you may see, it's much more severe than an invalid ThreadLocal.
10-11-2023

The problematic case was that `ExposeInvocationInterceptor` got SOE in production while trying to "revert ThreadLocal" to its "previous" value, so **subsequent** application calls executed in the problematic thread resulted in unexpected "401 Unauthorized" while the authorization was 100% active. The case was hard to diagnose as the problematic "401 Unauthorized" errors did not reproduce always, and the SOE was rotated from the logs. SOE in ThreadLocal is harmful as its effect lasts over time, it makes subsequent requests within the same thread fail, and it makes the errors hard to diagnose. In other words, I completely agree that a single application request might fail due to SOE, and it does not justify RSA alone. However, it is bad to have a corruption in ThreadLocal that would poison the subsequent requests executing within the same thread. Of course, migrating from ThreadLocal to ScopedValue would be great in the long run, however, there are a lot of existing applications and libraries that have been compiled long ago, and it would be challenging to replace all the ThreadLocal-as-scopedvalue usages with ScopedValue. That is why I believe adding RSA to ThreadLocal is justified. Of course, the applications should monitor errors, however, I am leaning towards that JVM could do better here to help users avoid "poisoned ThreadLocal" issues. {quote}StackOverflowError was supposed to be hard enough to miss, and errors are not supposed to be recoverable.{quote} What is the recommendation to the application developers then? If SOE is not supposed to be recoverable, then I am puzzled about the reasons for having a "reserved stack area". Does that mean all the applications should go for -XX:+UnlockDiagnosticVMOptions -XX:AbortVMOnException=java.lang.StackOverflowError to ensure the application does not go into an invalid state in case of SOE? Does it make sense to add a flag like CrashOnStackOverflowError similar to the currently existing CrashOnOutOfMemoryError?
09-11-2023

This was discussed on core-libs-dev but there wasn't agreement to add ReservedStackAccess to the ThreadLocal code. In passing, ExposeInvocationInterceptor could potentially migrate to ScopedValue in the future. SV has VM support for recovering from SO.
03-11-2023

What kind of problems with diagnosing this situation do you experience? StackOverflowError was supposed to be hard enough to miss, and errors are not supposed to be recoverable. Adding ReservedStackAccess here is not an option we want to explore.
03-11-2023