Bug ID: JDK-8318888 ReentrantLock results in a locked state after StackOverflowError

Other
tbdUnresolved

[~dholmes] I took some time to change HotSpot to not throw the delayed SOE but that in itself is not enough to address the situation, even with ample sprinkling of @RSA and @ForceInline or @DontInline. As there's interplay with inlining as well so I presume that there's significant, additional, work to arrive at a solution which would make the example code behave as the reporter expects. Someone'll have to definitely need to sit down to do a re-take on a possible solution, which is likely going to need an associated JEP, or at least a CSR. It's a well-scoped problem enough to be a possible intern project, of course with ample access to mentorship.
15-03-2024
[~vklang] There is definitely an issue here, but it is a very complex issue that may not have a solution. The deferred SOE is problematic for correctly written code. It may well take a JEP to resolve this, if for no reason than it will undo parts of an existing JEP. But I'm concerned that RSA has reached its limits and we cannot make lock/unlock completely SOE safe, particularly when virtual threads could be involved. That is not to say we can't make any improvements though, but it will take a very detailed analysis of the RL code and the RSA placement to try and ensure e.g. that SOE will only be thrown if the lock has not been acquired. Might be an interesting intern project?
15-03-2024
[~dholmes] I'm inclined to close this as "Not a bug" since it seems to behaving just as spec:ed. Changing the behavior of RSA in relation to SOE and evaluation of logic annotated with @ReservedStackAccess would likely be another JEP. Does this resonate with you?
14-03-2024
[~dholmes] I'd expect the reserved area to act like the yellow area unless there's an RSA already on the stack. But I see what you mean, you need to check the dispatch target before the call if it is RSA-annotated otherwise you wouldn't be able to call an RSA-annotated method if you're already at the end of the regular stack area.
21-11-2023
[~vklang] It is not an issue for lock() - no problem if that throws SOE as it is outside the try block. But yes unlock() is critical itself. Though now I would have to check the details of exactly how RSA works. Does it actually guard the call to the top-level RSA method or does it only come into play once there is an RSA method on the stack? If the latter then that is a problem.
21-11-2023
[~dholmes] Additionally, I think we should look into whether `lock()` and `unlock()` themselves should be RSA-annotated (instead of right now where the method they call internally is) -- as I think the problem might also surface when we're right before the reserved stack space and the call to `lock()`/`unlock()` itself would SOE.
21-11-2023
[~vsitnikov] I think you misread what I wrote but what you are describing is basically what the RSA mechanism does. It reserves a few pages which are assumed/expected to be sufficient for all RSA call chains. If stack bang detects overflow would occur and we are in a RSA call chain then we unguard the reserved pages so the call can proceed and succeed, then we reguard them. But in addition, and what seems completely wrong, we then throw the SOE anyway. But removing the delayed SOE is what I think needs to happen here. Though we also have to verify the current stack requirements of the lock/unlock call chains, especially in relation to virtual threads.
21-11-2023
> IF an SOE is thrown by lock() or unlock() then you'd have to (defensively) presume that the lock is in an inconsistent state. That is right. However, I always thought that the treatment there was to avoid or at least significantly reduce the chances of running into SOE in lock() and unlock(). I guess reserving some pages should be good enough. Of course, there might be cases like deoptimization (or a debugger) triggering in the middle of lock(), so it could request unanticipated extra stack space, however a few reserved pages should heal a great amount of SOE in Locks in production systems. I wonder if somebody could schedule a CI run that would "remove deferred SOE" and check if there are tests that would fail. If we are lucky, it could give us some use cases that rely on "deferred SOE".
20-11-2023
[~vsitnikov] Presuming that execution of lock() and unlock() are done successfully under RSA I think it would make sense to not delay an SOE at all. From that perspective, IF an SOE is thrown by lock() or unlock() then you'd have to (defensively) presume that the lock is in an inconsistent state.
20-11-2023
This is a bit of a tricky problem in general. As JEP 270 was specifically targeting java.util.concurrent, I think it needs to be viewed through that lens. Currently, if ´lock.lock()` throws a StackOverflowError the user still doesn't know whether the lock was successfully acquired or not, and how should they know whether to try to unlock it? Also, when attempting to unlock it, it might also throw a StackOverflowError, and how does the user know if it was successfully unlocked before the SOE was thrown? JEP-270: "If the protection of the reserved zone has been removed to allow a critical section to complete its execution, the protection must be restored and the delayed StackOverflowError thrown as soon as the thread exits the critical section." This above means that even if a lock is acquired in the @ReservedStackAccess scope, you get an SOE. And if the SOE happened in the invocation to `lock()` (which is not marked as @ReservedStackAccess) the lock was not acquired: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/concurrent/locks/ReentrantLock.java#L321 In both these cases, what would correct code look like? Currently lock acquisition is likely not done within a try-block, so that means that if an SOE is thrown by `lock()` then it must not have acquired the lock, otherwise it would leave the program in inconsistent state with unmatched lock-unlock pairs. Would it be a (partial) solution for deferred SOEs to be thrown on the next invoke-instruction which is not within, or itself marked with, @ReservedStackAccess? Then: void foo() { lock.lock(); // Trips SOE here, but as it happens within @ReservedStackAccess it is not rethrown yet try { foo(); // Next invoke-instruction after tripping SOE, rethrow here } finally { lock.unlock(); // Marked as @ReserveStackAccess, so do not throw SOE after tripping here, as there's an SOE in flight already } }
20-11-2023
> I'm not sure what you are suggesting. Even if the method requires a small amount of stack, that amount of stack must be available If that worked then JEP 270 would not be needed in the first place. Imagine "stack size" is 1024 pages. Imagine the application calls a recursive method it exhausts all the stack pages leaving just 2 bytes free on the stack. If ReentrantLock was unlucky enough to start at that moment, it would fail with SOE even though ReentrantLock does not require much stack on its own. My suggestion "to prevent many SOE when JVM executes a method that is known to require a small amount of stack" could work as follows. Imagine the JVM leaves the last 4 pages as the reserved area while it keeps 1020 pages as the "regular stack area". Then it could eliminate SOE failures in "known to be safe" methods at the cost of failing "unknown methods" with slightly greater probability. Imagine JVM leaves 1020 for the application use and keeps the last 4 pages for "method that is known to require a small amount of stack". Then: 1) If the stack pointer is within 1...1020 page, no SOE is thrown 2) If the stack pointer is about to exceed 1020, and the call stack does not include a method that is "known to require a small amount of stack", then JVM throws SOE 3) If the stack pointer is about to exceed 1020, and the call stack includes a method marked as "known to require a small amount of stack", then JVM skips SOE 4) If the stack pointer is about to exceed 1024, then JVM throws SOE anyway
20-11-2023
> Would it be a (partial) solution for deferred SOEs to be thrown on the next invoke*-instruction which is not within, or itself marked with, @ReservedStackAccess? Since the reserved area allows "lock" and "unlock" to execute successfully, the stack does not overflow, so it does not sound like raising SOE is needed.
20-11-2023
After (re)reading the JEP: https://openjdk.org/jeps/270 It seems the delayed SOE was actually the key design point. And I'm now reminded off my own comments at the time: https://bugs.openjdk.org/browse/JDK-8046936?focusedId=13572634&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13572634 "We can set the reserved space such that all direct uses of ReentrantLock in core library code will not encounter StackOverflow. So this does help to make the core libraries more robust - which was the motivation for this. It can't help general user code that uses ReentrantLock. " But 9 years on I'm certainly wondering why we bother throwing the delayed SOE instead of just letting execution proceed.
19-11-2023
> it could automatically prevent many SOE when JVM executes a method that is known to require a small amount of stack I'm not sure what you are suggesting. Even if the method requires a small amount of stack, that amount of stack must be available. The RSA mechanism is designed to ensure that a certain amount of stack is available by unprotecting a few guard pages that would normally be protected. Hence reserving sufficient stack for a RSA call chain. Of course that call chain must fit within the reserved number of pages. The flaw with the current system is the delayed SOE. It assumes that if we had to defer the SOE for the RSA call chain, then we are really out of stack but we didn't want to corrupt the operation of the RSA call chain, and so we throw the SOE immediately afterwards. But that breaks the use to try/finally with locking. If we are truly out of stack then we should simply hit a regular SOE on the next call after the RSA call chain - which in correctly written locking code would be inside the try block.
19-11-2023
I suspect that Virtual Threads have had an impact on the code paths and stack depth even for regular threads.
19-11-2023
Since @RSA-annotated methods are guaranteed to fit within the reserved stack area, then throwing a delayed SOE does not provide much value. The JVM should ignore throwing SOE when it is within the reserved area and it is executing a method that was marked as "guaranteed to require a small amount of stack". The execution was "guaranteed by OpenJDK developer to succeed", so what is the reason to throw SOE? Removing redundant code might even increase cache locality and performance. It might be an interesting enhancement if there was a static analysis that collected a conservative estimation of the required stack usage during the build time. For instance, the current @RSA placement requires human review, and somebody else might unknowingly violate assumptions of the @RSA when they modify lock, unlock, or its implementation details. At the same time, if the build scripts could collect the call graph for `.unlock`, then it could deduce that OpenJDK's implementation of `.unlock` calls to small private methods only, and it could deduce that "unlock fits within 100 (or whatever) bytes on the stack". Of course, if the method calls something like Object#toString, then the estimator would assume it would require an infinite amount of stack since there might be overrides that can be recursive. Such "call graph analysis" might help to validate if @RSA-annotated method indeed fits within the small amount of stack, and if the analysis results are stored with the JVM build, then it could automatically prevent many SOE when JVM executes a method that is known to require a small amount of stack. For instance, I guess static analysis could automatically deduce that ThreadLocal.set can fit within the reserved area, so JVM could automatically skip generating SOE when running ThreadLocal.set, thus it would automatically prevent SOE like in JDK-8319090.
17-11-2023
Many years of thought - no satisfying solution - JDK-8028686
17-11-2023
Hmmm throwing a delayed StackOverflowError makes RSA effectively useless for locking as you note. I'd have to reach out to [~fparain] as the original designer of RSA for his thoughts. Can we determine that a delayed SOE is causing the current problem? Regardless I think we have to change things to ensure delayed SOE is not a thing - we either have to throw before we do anything for the RSA-annotated lock(), or else it must not throw. I suspect with RSA we focussed on not corrupting the lock state, but that doesn't help with not being able to write code that guarantees lock/unlock work as required.
17-11-2023
>Can we determine that a delayed SOE is causing the current problem? Yes. Exception in thread "main" java.lang.StackOverflowError: Delayed StackOverflowError due to ReservedStackAccess annotated method at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1059) at java.base/java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:494) at LockVsStackOverflow.doSmth(LockVsStackOverflow.java:30) Now, the exception currently manifests in the release-part, but, if I make sure to add enough @ReservedStackAccess in AQS.release and AQS.getState, then the problem starts manifesting in the lock acquisition: Exception in thread "main" java.lang.StackOverflowError: Delayed StackOverflowError due to ReservedStackAccess annotated method at java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:152) at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
17-11-2023
After some further investigation, here are my findings: First, even if the correct methods are annotated with @ReservedStackAccess, the documentation for it states that "Even if access to this reserved area has been granted, the JVM might decide to throw a delayed StackOverflowError when the thread exits the annotated method." This means that acquiring the lock needs to be performed within the try-block, otherwise the unlocks won't be paired with the locks. Since I don't think people write code like that (since if we get a StackOverflowError during `lock()` then we don't know if the lock was locked or not), we need to either: A) make sure that invocations to lock() and unlock() always succeed (stack-wise) by allowing them to execute under reserved stack space but WITHOUT throwing a delayed StackOverflowError. B) make sure that if lock() throws an StackOverflowError that it never succeeded in incrementing the counter either by rollback, or by initial stack probing. Since this issue is in the intersection between API, JVM, and Concurrency -- I'd be interested in hearing more points of view on the issue.
16-11-2023
Weirdly enough I didn't see any improvement with "@ReservedStackAccess" on `unlock()`. Will have to investigate further.
14-11-2023
But this is not a hotspot issue, so moving to core-libs.
27-10-2023
I agree, it is not sufficient to have it on tryRelease, as the unlock() itself could throw before we get there.
27-10-2023
Looks like `unlock` needs `@ReservedStackAccess` as well? [~fparain], what do you think?
26-10-2023

Duplicate :	JDK-8028686 - ReentrantLock calls can hang from lack of stack space
Duplicate :	JDK-8319309 - Lock not released after stack overflow
Relates :	JDK-8319090 - StackOverflow from ThreadLocal.set might shadow the root cause
Relates :	JDK-8046936 - JEP 270: Reserved Stack Areas for Critical Sections
Relates :	JDK-8028686 - ReentrantLock calls can hang from lack of stack space