Bug ID: JDK-8318895 Deoptimization results in incorrect lightweight locking stack

JDK-8318895 : Deoptimization results in incorrect lightweight locking stack

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version:
21,22,repo-lilliput-17,repo-lilliput-21 21,22,repo-lilliput-17,repo-lilliput-21

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2023-10-26
Updated: 2024-07-10
Resolved: 2023-11-13

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 21	JDK 22
21.0.2Fixed	22 b24Fixed

Related Reports

Relates :	JDK-8227745 - Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents
Relates :	JDK-8291555 - Implement alternative fast-locking scheme
Relates :	JDK-8324881 - ObjectSynchronizer::inflate(Thread* current...) is invoked for non-current thread
Relates :	JDK-8320196 - assert(cur != anon_owner_ptr()) failed: no anon owner here with Graal
Relates :	JDK-8315880 - Change LockingMode default from LM_LEGACY to LM_LIGHTWEIGHT

Description

While working on recursive-locking support for lightweight locking (-XX:LockingMode=2), we hit an assert that we have an inconsistent lock stack.

One of the problematic tests is the following:
```
class EARelockingArgEscapeLWLockedInCalleeFrame_2Target extends EATestCaseBaseTarget {
...
    public void dontinline_testMethod() {
        XYVal l1 = new XYVal(1, 1);       // ArgEscape
        XYVal l2 = new XYVal(4, 2);       // NoEscape, scalar replaced
        synchronized (l1) {                   // eliminated
            synchronized (l2) {               // eliminated
                l1.dontinline_sync_method(this);  // l1 escapes
            }
        }
        iResult = l2.x + l2.y;
    }
```

The test runs the above method enough time to for it to get C2 compiled. Then it attaches a debugger and adds a breakpoint deeper inside dontlinine_sync_method.

What we see in gdb is that `l1` is an existing object that has been pushed onto the deoptee thread's locking stack because of the dontinline_sync_method call. We then try to "replay" `synchronized (l1)` and `synchronized (l2)` in Deoptimization::relock_objects. This gives us the incorrect lock order [l1, l1, l2] while the actual code has the [l1, l2, l1] lock order. This then messes up the lock stack in the unlock path when we perform our recursive lock handling.

The lightweight code in openjdk/jdk works because the recursive lock gets inflated and removed from the lock stack. However, I can get the upstream lightweight locking to assert by removing the recursive locking in the test case:

```
class EARelockingArgEscapeLWLockedInCalleeFrame_2Target extends EATestCaseBaseTarget {
...
    public void dontinline_testMethod() {
        XYVal l1 = new XYVal(1, 1);
        XYVal l2 = new XYVal(4, 2);
        XYVal l3 = new XYVal(5, 3);
        synchronized (l1) {
            synchronized (l2) {
                l3.dontinline_sync_method(this);
            }
        }
        iResult = l2.x + l2.y;
    }
```

This asserts with the following:
```
#  Internal Error (/home/stefank/git/jdk/open/src/hotspot/share/runtime/lockStack.inline.hpp:86), pid=265437, tid=265440
#  assert(contains(o)) failed: entry must be present: 0x00000000ffa00130
...
V  [libjvm.so+0x1743709]  LockStack::remove(oop)+0x3e9  (lockStack.inline.hpp:86)
V  [libjvm.so+0x173f78d]  ObjectSynchronizer::exit(oop, BasicLock*, JavaThread*)+0x2fd  (synchronizer.cpp:590)
V  [libjvm.so+0xe7667c]  InterpreterRuntime::monitorexit(BasicObjectLock*)+0x18c  (interpreterRuntime.cpp:779)
j  EARelockingArgEscapeLWLockedInCalleeFrame_2Target.dontinline_testMethod()V+47
j  EATestCaseBaseTarget.run()V+64

Comments

A pull request was submitted for review. URL: https://git.openjdk.org/lilliput-jdk17u/pull/77 Date: 2024-04-09 07:17:10 +0000
09-04-2024
This fix seems incorrect - see JDK-8324881. Correction: I see now that this fix simply duplicates pre-existing "brokenness".
30-01-2024
A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u/pull/400 Date: 2023-11-24 11:50:40 +0000
24-11-2023
[jdk21u-fix-request] Approval Request from Roman Kennke This change fixes a critical bug in lightweight-locking. It has been in mainline for more than 2 weeks and has not accrued any bug-tail. The fix only affects lightweight locking code-paths. Lightweight locking has been introduced in jdk21, but is not enabled by default.
24-11-2023
Fix was pushed while main issue was targeted to '23'. Reset the main issue to fixed in '22' and copied the Robo Duke entry here.
13-11-2023
Dukebot added a comment - 2 days ago Changeset: ea1ffa34 Author: Roman Kennke <rkennke@openjdk.org> Date: 2023-11-10 15:28:27 +0000 URL: https://git.openjdk.org/jdk/commit/ea1ffa34192448317ce9a61a3588b0dee3a2ef44
13-11-2023
OK, that makes sense. Thanks.
09-11-2023
> ... Storing it into a static ... That store is done using JVMTI. Escape analysis ignores JVMTI. It does so for good reason because every object can escape through JVMTI. https://github.com/reinrich/jdk/commit/b72b3b3d7d1b5927811ae49e3ddea01d298dcb85 is a test case that would fail without eager relocking. L1722 does the store into the field (in that example it's not static). With https://github.com/reinrich/jdk/commit/f7a90c13e27e9fe38c892f069bd8d58484f59445 you can disable eager relocking with -XX:+UseNewCode. The following will provoke the failure: make test TEST="test/jdk/com/sun/jdi/EATests.java" TEST_VM_OPTS="-XX:+UseNewCode"
09-11-2023
I don't think that scenario is possible, but if it was possible, it could happen without JVMTI being involved. In order for the lock to be eliminated, the object can't escape. Storing it into a static so another thread can find it is the definition of escape.
09-11-2023
> > The eager relocking is needed because it has to be done before the object escapes the current thread through JVMTI. > > [~rrich] I can see why we need to allocate the eliminated locals so JVMTI to access them, but I don't see why we need to relock eagerly. Can you elaborate a little more on what could go wrong if we don't? E.g. the reference of an object could be stored into a static field where another application thread finds it. The second thread could use synchronized methods to change the state of the object. The first thread might observe an inconsistent state of the object because of the eliminated locking. I think I could add a test case for this to EATests.java
09-11-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16568 Date: 2023-11-08 19:00:53 +0000
08-11-2023
> The eager relocking is needed because it has to be done before the object escapes the current thread through JVMTI. [~rrich] I can see why we need to allocate the eliminated locals so JVMTI to access them, but I don't see why we need to relock eagerly. Can you elaborate a little more on what could go wrong if we don't?
08-11-2023
I see, thanks. > If it 's the case that with `exec_mode == Unpack_none` there are no locked objects in any callee frames It's the other way round: just with `exec_mode == Unpack_none` there might be callee frames with locked objects and therefore it is necessary to inflate with LM_LIGHTWEIGHT. With the other modes the relocking is only ever done in the context of the top frame. It should not be necessary to inflate as far as I can tell.
08-11-2023
> I thought inflating would only be needed for `exec_mode == Unpack_none` though(?) This was purely a defensive implementation, I have very little knowledge of the deoptimization code. Inflating a locked object is always correct, and restores a correct lock stack. If it's the case that with `exec_mode == Unpack_none` there are no locked objects in any callee frames then then inflating should not be necessary as long as the re-locks are done in the correct order. The current code (comments) does not make this clear, as it only fixes eliminated recursive locks and mentions that they only exists with `exec_mode == Unpack_none`. It was not clear to me that there are no locked objects further up/down (the callee direction) the call stack.
08-11-2023
Hi [~rkennke] can you take over this fix? I have some note that there might be more to it (I wrote om.wait enter...)
08-11-2023
We've discussed this internally and came to the same conclusion that relocking with LM_LIGHTWEIGHT should inflate the locks. Seeing that it's already implemented delights me very much :) Thanks a lot! I thought inflating would only be needed for `exec_mode == Unpack_none` though(?)
08-11-2023
I believe the best that we can do is to inflate the rematerialized lock and remove all related entries from the lock-stack. I don't think this would be much of a performance concern, lightweight (or legacy) locking is only beneficial when locks are uncontended and churned (i.e. short-lived and created en-masse), and it helps mostly because it avoids the overhead of allocating the lock and reducing the load on deflation. But we are already deopting - that path is slow anyway and inflating the monitor there, once, surely doesn't hurt much. (Uncontended) locking monitors as such should be as fast as LW locks because we have fast-paths for that in C2, and it's really only a CAS. If you agree with that assessment, I would take this bug and provide a fix. Please let me know!
08-11-2023
In the current recursive ligthweight prototype we have been using https://github.com/openjdk/jdk/commit/2a3d50ccba46b005d3944d76597a797ad3a9a130 as a fix for this issue.
08-11-2023
> > That said, I think I don't understand enough about how we eliminate in the first place.and why we need to re-materialize. Is it the case that any locks in earlier frames cannot be held by any other thread, so the objects being locked are isolated in accessibility to the one thread? If that were the case then re-locking out of order would be harmless because the locking would effectively be a no-op anyway. > Yes, either the owner is an object local to the current thread (no other thread knows about it because it didn't escape), or it's a nested lock so the thread already owns it. The re-locking is still needed because the interpreter is going to try to unlock those monitors. What isn't needed, in my opinion, is doing the relocking eagerly with callees still on the stack. The eager relocking is needed because it has to be done before the object escapes the current thread through JVMTI. > Why is this not a problem with legacy locking? Legacy locking eliminates locking operations but not the locks in the frames. It might be possible to do the same with light weight locking: allocate all locks needed when entering a compiled method and pop them when leaving again.
06-11-2023
An Example Frame M1 (C2 compiled) elim. locking of o1 Frame M2 <- top of stack locks o1 - M1 allocates o1 - M1 calls M2 and passes o1 as parameter - C2 (or another jit) proves by escape analysis that o1 is local to the current thread but passed as argument to M2 (`ArgEscape`). C2 eliminates locking of o1 based on this. - M2 locks o1 (non recursively) Now JVMTI agent retrieves o1 from frame M2. This invalidates the assumption that o1 is local to the thread which allocated it. o1 is relocked before it is passed to the JVMTI agent. For relocking the stack is walked. At frame M1 it is found that `ArgEscape` parameters are passed. This is the condition to trigger relocking (`cvf->arg_escape()` at https://github.com/openjdk/jdk/blob/c146685ca9354ce4bf99c9b262119a4643df1e69/src/hotspot/share/runtime/escapeBarrier.cpp#L101-L102). But o1 is locked in M2 which is a callee frame. With legacy locking the lock in frame M2 is converted to a recursive legacy lock by setting the displaced header to null. Before the original displaced header is copied to the legacy lock in M1. The address of the legacy lock in M1 is installed in o1's mark word (see https://github.com/openjdk/jdk/blob/c146685ca9354ce4bf99c9b262119a4643df1e69/src/hotspot/share/runtime/deoptimization.cpp#L1621-L1627).
06-11-2023
> Do you mean that we can rematerialize Y with rematerializing frames preceding Y? That sounds completely broken. Yes, that is the way the code currently works, unless I read it wrong. See VM_GetOrSetLocal::doit_prologue(), where it causes eager deoptimization of a specific depth (frame). > If we've eliminated locks for X, Y and Z and we need to re-materialize them, the surely we need to re-materialze from the outer frames down to the one we need? Yes, that would make sense, if we truly cared about lock order at all times. But it appears lightweight locking is the only locking mode that really cares. >That said, I think I don't understand enough about how we eliminate in the first place.and why we need to re-materialize. Is it the case that any locks in earlier frames cannot be held by any other thread, so the objects being locked are isolated in accessibility to the one thread? If that were the case then re-locking out of order would be harmless because the locking would effectively be a no-op anyway. Yes, either the owner is an object local to the current thread (no other thread knows about it because it didn't escape), or it's a nested lock so the thread already owns it. The re-locking is still needed because the interpreter is going to try to unlock those monitors. What isn't needed, in my opinion, is doing the relocking eagerly with callees still on the stack. > Why is this not a problem with legacy locking? Legacy doesn't track lock order anywhere, other than BasicObjectLock records on the stack, and those records are in the correct order. It's only lightweight LockStack that gets out of sync. Or maybe there is a deadlock-detection feature that might detect this problem in legacy mode? I haven't looked to see howThreadMXBean.findMonitorDeadlockedThreads() gets its ordered list of monitors. > If the order of locks is legitimately not critical, should we disable the checking until we have a better idea? Right, it wasn't critical for legacy, or this problem would have been noticed sooner. It still might be considered a desirable santity check. I think it is worth investigating a solution where JVMTI only rematerializes the locals eagerly, and allows the relocking to be done lazily, but I feel like that could be considered an enhancement.
02-11-2023
Why is this not a problem with legacy locking? If the order of locks is legitimately not critical, should we disable the checking until we have a better idea?
02-11-2023
That said, I think I don't understand enough about how we eliminate in the first place.and why we need to re-materialize. Is it the case that any locks in earlier frames cannot be held by any other thread, so the objects being locked are isolated in accessibility to the one thread? If that were the case then re-locking out of order would be harmless because the locking would effectively be a no-op anyway.
02-11-2023
I think this is re-exposing the concerns I had about EA and lock elimination that I should have, but didn't, investigate further back in 2019: https://mail.openjdk.org/pipermail/hotspot-runtime-dev/2019-December/037583.html
02-11-2023
Do you mean that we can rematerialize Y with rematerializing frames preceding Y? That sounds completely broken. If we've eliminated locks for X, Y and Z and we need to re-materialize them, the surely we need to re-materialze from the outer frames down to the one we need?
02-11-2023
> If it knows the set of monitors it needs to re-lock, surely we can access that in the correct order ?? Yes, the C2 frames know which C2 locks go in which order, but how do we find locks for an arbitrary frame in LockStack, which has no frame boundaries? If X calls Y calls Z, and Y is the one being rematerialized by JVMTI, then Y needs to insert its locks after those of X but before those of Z.
02-11-2023
> Any C2 monitors that were eliminated are relocked eagerly at the time of the JVMTI call If it knows the set of monitors it needs to re-lock, surely we can access that in the correct order ??
01-11-2023
FWIW, we inflate these problematic locks in the branch that adds support for recursive lightweight locking.
01-11-2023
It looks like there are a couple different problems here. JDK-8227745 allows JVMTI calls to rematerialize compiled frame state at an arbitrary call depth. Any C2 monitors that were eliminated are relocked eagerly at the time of the JVMTI call, rather than lazily when control reaches that frame. This allows broken lock order, as seen in the lightweight locking LockStack, between compiled frames. Even if we fix the JVMTI problem, there is also a lock order problem when relocking monitors within compiled frames. However, compiled frames know the lock order, so in theory the locks could be inserted into the LockStack in the correct order if there was an API to do that. So we have both an inter- and intra-compiled-frame problem, and it is only visible to lightweight locking. I don't see any easy way to insert locks into the LockStack in the correct order for JVMTI. The entire LockStack would probably need to be cleared and recomputed. The easiest solution is probably to inflate the locks, so they are removed from the LockStack.
01-11-2023
ILW = HLM = P3
31-10-2023
[~dholmes] Yes, you are correct. My attempt to copy-n-paste failed me. I'll update the summary.
27-10-2023
[~stefank] is there a typo in the example - the second code fragment, purportedly without the recursive locking, appears identical to the first? So IIUC given: synchronized (l1) { // eliminated synchronized (l2) { // eliminated the deoptimizer reifies these elimiated actions in the wrong order.
27-10-2023
A reproducer can be found here: https://github.com/openjdk/jdk/compare/master...stefank:jdk:8318895_reproducer
26-10-2023