JDK-8276851 : [AArch64] Improve fast_unlock C2 intrinsic
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 18,19
  • Priority: P3
  • Status: Closed
  • Resolution: Not an Issue
  • CPU: aarch64
  • Submitted: 2021-11-09
  • Updated: 2024-10-02
  • Resolved: 2024-10-02
Related Reports
Relates :  
Relates :  
Description
The successor protocol of the JVM's locking protocol requires the unlocker to check if the successor is racingly set right after storing NULL to the ObjectMonitor owner. Failing to do so will result in nobody waking up the successor, until the next lock operation on the same ObjectMonitor. The result isn’t quite an indefinite hang. The runtime code ensures that at least one ”responsible” thread blocks for a limited amount of time with exponential backoff. So the end result of not honoring the successor protocol can be a long stall, but not an indefinite stall.

We do not perform this check at all in the AArch64 fast_unlock C2 intrinsic. Therefore we are susceptible to these kind of stalls on this platform. Implementing said check will require ldar to be used on succ.
Comments
This was fixed by JDK-8320318 so therefore it's no longer an issue.
02-10-2024

Back to you Fredrik.
31-07-2024

This is related to a lot of runtime code that we're working on. Can I reassign this to runtime subcomponent?
06-03-2024

I added a 'sync' keyword used by the runtime team so I can find this with the rest of these issues.
17-11-2023

So the X64 code that we're talking about is here: src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp: #else // _LP64 // It's inflated Label CheckSucc, LNotRecursive, LSuccess, LGoSlowPath; cmpptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)), 0); jccb(Assembler::equal, LNotRecursive); // Recursive inflated unlock decq(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions))); jmpb(LSuccess); bind(LNotRecursive); movptr(boxReg, Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(cxq))); orptr(boxReg, Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(EntryList))); jccb (Assembler::notZero, CheckSucc); // Without cast to int32_t this style of movptr will destroy r10 which is typically obj. movptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), NULL_WORD); jmpb (DONE_LABEL); // Try to avoid passing control into the slow_path ... bind (CheckSucc); // The following optional optimization can be elided if necessary // Effectively: if (succ == null) goto slow path // The code reduces the window for a race, however, // and thus benefits performance. cmpptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(succ)), NULL_WORD); jccb (Assembler::zero, LGoSlowPath); xorptr(boxReg, boxReg); // Without cast to int32_t this style of movptr will destroy r10 which is typically obj. movptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), NULL_WORD); // Memory barrier/fence // Dekker pivot point -- fulcrum : ST Owner; MEMBAR; LD Succ // Instead of MFENCE we use a dummy locked add of 0 to the top-of-stack. // This is faster on Nehalem and AMD Shanghai/Barcelona. // See https://blogs.oracle.com/dave/entry/instruction_selection_for_volatile_fences // We might also restructure (ST Owner=0;barrier;LD _Succ) to // (mov box,0; xchgq box, &m->Owner; LD _succ) . lock(); addl(Address(rsp, 0), 0); cmpptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(succ)), NULL_WORD); jccb (Assembler::notZero, LSuccess); // Rare inopportune interleaving - race. // The successor vanished in the small window above. // The lock is contended -- (cxq|EntryList) != null -- and there's no apparent successor. // We need to ensure progress and succession. // Try to reacquire the lock. // If that fails then the new owner is responsible for succession and this // thread needs to take no further action and can exit via the fast path (success). // If the re-acquire succeeds then pass control into the slow path. // As implemented, this latter mode is horrible because we generated more // coherence traffic on the lock *and* artificially extended the critical section // length while by virtue of passing control into the slow path. // box is really RAX -- the following CMPXCHG depends on that binding // cmpxchg R,[M] is equivalent to rax = CAS(M,rax,R) lock(); cmpxchgptr(r15_thread, Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner))); // There's no successor so we tried to regrab the lock. // If that didn't work, then another thread grabbed the // lock so we're done (and exit was a success). jccb (Assembler::notEqual, LSuccess); // Intentional fall-through into slow path bind (LGoSlowPath); orl (boxReg, 1); // set ICC.ZF=0 to indicate failure jmpb (DONE_LABEL); bind (LSuccess); testl (boxReg, 0); // set ICC.ZF=1 to indicate success jmpb (DONE_LABEL); #endif The end of fast_unlock() in c2_MacroAssembler_aarch64.cpp does not have all this X64 logic that's checking the successor field. (Both have logic to check the queues.) One thing that stuck out when I just did a read thru were these two sections: // Without cast to int32_t this style of movptr will destroy r10 which is typically obj. movptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), NULL_WORD); and: // Without cast to int32_t this style of movptr will destroy r10 which is typically obj. movptr(Address(tmpReg, OM_OFFSET_NO_MONITOR_VALUE_TAG(owner)), NULL_WORD); I suspect the casts were removed as part of the NULL -> NULL_WORD work so now I'm wondering if we still destroy r10 (and thus obj)...
17-10-2023

I agree with the assessment. Should be an RFE to better support the successor protocol to reduce bounded but possilby long spurious stalls. Or something like that.
09-03-2022

It looks to me like x86 (#ifndef _LP64 section in c2_MacroAssembler_x86.cpp) is doing the same as aarch64. [~eosterlund], based on Patric's evaluation, do you still consider this a bug, or should it be an RFE with a less dangerous sounding synopsis?
09-03-2022

After some thought and discussion with Erik (Erik also consulting D.Dice), it is apparent that there might be additional considerations to be made. While we could change the Aarch64 code to double check the successor (_succ), it does not seem to be necessary (*). There are two (optimising) mechanisms at play here, the successor protocol and the responsible (_Responsible) ditto. The responsible protocol will ensure progress even in cases when the successor ditto fails to handle races. Both attempting to improve certain situations. However, both these protocols are attempting to improve on situations when spin-locks have failed, trying to "optimise" a situation when congestion is already a fact. Fast unlocking _could_ be as simple as to only make a check on the lock queue and if no queue is present, release the lock (fast path) in JIT code, otherwise handle queue and wakeup (slow path) through VM code. Q1. Would the simplified approach (above) be too slow a path in the congested scenario? Do we know? Q2. Should we try to seal the races in the successor protocol, thus making the responsible protocol superfluous? Q3. Should we rely on the responsible protocol alone, adjusting the back-off to a regular exponential approach, thus allowing us to remove the successor protocol altogether? (*) The same seems to be the case for PPC and S390.
11-01-2022

No reproducer and no "sane" way to test change, risk to high for JDK 18 (RDP1). Moving to JDK 19.
16-12-2021

Now, where's that reproducer... ;-)
15-11-2021

Sorry Patric! ;-)
12-11-2021

ILW = Possible stall of application, never observed yet, possibly disable fast_unlock intrinsic = HLM = P3
09-11-2021