Bug ID: JDK-8316921 Perf regressions up to 14% in b16 many benchmarks all platforms

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 21,22,repo-lilliput-17

Priority: P4
Status: Closed
Resolution: Duplicate

Submitted: 2023-09-25
Updated: 2024-03-20
Resolved: 2024-02-28

JDK 23
23Resolved

Includes Dacapo fop, xalan, SPECjvm2008-XML.transform-G1 and various J2dBench.
I did CI build by build tracking and this is related to JDK-8315880.

Also duplicates JDK-8320934
28-02-2024
Comparing LM_LEGACY to LM_LIGHTWEIGHT in our performance tests after integrating recursive locking only shows a small regression for SPECjbb2005 on Intel based machines (JDK-8320934). Our promoted build JDK 23+11-809. linux-aarch64 -1.34% linux-x64 -3.21% macosx-aarch64 -1.55% windows-x64 -6.01% Comparing SPECjbb2005 locally on Linux AMD didn't show a significant difference for me. Since we have a bug to cover the SPECjbb2005 case, I'm closing this as ? Delivered.
27-02-2024
Runtime Triage: likelihood is low now because LM_LIGHTWEIGHT is not the default for JDK-22. With this change user have to explicitly turn this feature on, which leads to a lower likelihood of encountering this issue. Updating triage ILW: ILW = HLL = P4
30-01-2024
We are currently investigating all regressions to figure out if there's anything we can do to mitigate them. We have a couple of weeks left to do so, but if we can't resolve them in the near future we'll likely have to revert back to the legacy locking mode for JDK 22 and try to resolve the issues in JDK 23 instead. One reason for some of the regressions seems to be that -XX:+UseLSE performs poorly for failing CASes on macos AArch64 (only tested on M1 macs). We can see that by just changing the cmpxchg in the lock paths to use the -XX:-UseLSE version we can regain some of the performance regressions. We also see that -XX:+UseLSE is causing regressions for legacy locking. j2dbench is severely affected by this. Some benchmarks seems to regress because the lack of recursive locking support in lightweight locking. We're investigating alternatives for that, but don't yet have a stable implementation for that. We're looking into if the implementation is broken or if there's a pre-existing bug that it triggers. SPECjbb2005 regresses a few percentages. That needs to be investigated. It's seems like C2 bails when it sees recursive locking in our micro benchmarks: https://github.com/openjdk/jdk/blob/9cf334fb6488188ea4236e5d156b11245bace88f/test/micro/org/openjdk/bench/vm/lang/LockUnlock.java#L84 by removing the local variable in this micro we get C2 to compile the method again.
19-10-2023
Yeah. See JDK-8316880
28-09-2023
While performing some experiments with different versions of implementations for lightweight recursion we've found that the existing lock code uses the rscratch1 register as an input or output register for MacroAssembler::cmpxchg. This is problematic because if we run with -XX:-UseLSE, rscratch1 is clobbered. This seems to at least break the C2 ObjectMonitor recursion check. It might have other consequences as well. We have started a patch that both adds some strict asserts to identify the potential problematic places and works around the identified problematic paths: https://github.com/openjdk/jdk/compare/master...stefank:jdk:aarch64_locking_registers This is WIP but might be good for others to take a look at.
28-09-2023
ILW = HML = P2
26-09-2023
BTW, I've also got some observations that might be useful: - IIRC, the SPECjvm xalan and transform benchmarks are also mostly running C1 compiled code. IIRC, they keep compiling new bytecode which is generated by XLST, and it rarely reaches the point where C2 kicks in. (You might want to verify this, my memory may be wrong.) - All my ideas to to implement recursive locks would have significantly impacted the non-recursive case, mostly because it needs to inspect the top of lock-stack (or even the whole lock-stack) - It may be worth investigating how reliably can C2 eliminate recursive locks. Because if can reliably eliminate many/most such locks, then maybe we can make it so that C2 generated code doesn't have to implement it (other than dealing with recursive locks that come from elsewhere) and thus not affect the non-recursive fast-path all that much.
26-09-2023
Ok, let me know when I can help with recursive locking impl. BTW, I've found that it is a fairly narrow characteristic behaviour that leads to performance regressions. It is not recursive locking per-se that is problematic. It has to be in conjunction with a high churn of lock objects. In other words, a performance impact only happens, when the application: - Allocates a lot of lock-objects - And uses them only a few times - And does recursive locking (which StringBuffer and friends tend to do because all methods are synced and they call each other) Longer-lived (IOW, well-behaved) lock objects don't expose the problem, because the first time we encounter recursive lock, it would be inflated and subsequently we would use the very optimized OM paths. It is the high churn which leads to the OM inflation not being amortized and adding up to a performance problem. That's why StringBuffer is such a big problem: they are usually used very short-lived and they do a ton of recursive locking for no good reason. I could solve a lot of problems by replacing a single StringBuffer in XSLT libs with StringBuilder, and maybe we should also consider following down that route. I find it hard to think about situations where StringBuffer's synchronization is genuinely useful. Vector, Stack, etc are often longer-lived, but not always. C2 is likely able to inline the cases when sync'ed methods call each other, and then eliminate the inner recursive locks. Benchmarks which are bimodal because JIT heuristics decide one way or the other suck, but I've no idea how to change that.
26-09-2023
FWIW, we have also observed that many of regressing benchmarks are due to recursive locking and therefore Axel is experimenting experimenting with various ways to implement recursive locking for lightweight locking. I've also found one interesting observation for DaCapo fop: There's a large variance in the score and it almost bi-modal. Sometimes the test starts to spew out a massive amount of monitors because of recursive locking and sometimes it doesn't. It all seems to depend on if some of the functions are compiled with C1 or C2. I can force the issue by running with -XX:+TieredCompilation and -XX:TieredStopAtLevel=3. I guess that C2 manages to remove / fuse locks. One question is why that doesn't always happen? Maybe something that could be worth digging into.
26-09-2023
I suppose we could support recursive locking relatively easily and similar to how the old stack-locking does it: - Upon locking, inspect the top-of-lockstack, and if equal to lock-obj, then simply push a NULL on top of the lock-stack. - Upon unlocking, pop the top-of-lockstack, and if it is NULL, then don't unlock the obj. This would implement 'adjacent' recursive locking, e.g. if we lock A-B-B, but not interleaved recursive locking, e.g. A-B-A. I think the latter is not very frequent, at least not in the scenarios where lightweight-locking would be beneficial (StringBuffer, Vector, etc). However, it would add more work to the common locking path and thus pessimize the non-recursive case.
26-09-2023
It looks like those are the usual suspects of programs which do single-threaded uncontended locking by means of StringBuffer and maybe j.u.Vector and the likes Potential problem which may affect performance of such code may be lack of support for recursive locking in the new lightweight locking protocol.
26-09-2023
Assigning to Axel because we think he's looking at these performance regressions.
25-09-2023

Duplicate :	JDK-8319796 - Recursive lightweight locking
Relates :	JDK-8320934 - SPECjbb2005 performance with LM_LIGHTWEIGHT and recursive locking changes
Relates :	JDK-8320322 - Should UseLSE be default for macosx?
Relates :	JDK-8315880 - Change LockingMode default from LM_LEGACY to LM_LIGHTWEIGHT
Relates :	JDK-8316880 - AArch64: "stop: Header is not fast-locked" with -XX:-UseLSE since JDK-8315880
Relates :	JDK-8319251 - [REDO] Change LockingMode default from LM_LEGACY to LM_LIGHTWEIGHT
Relates :	JDK-8291555 - Implement alternative fast-locking scheme