JDK-8320934 : SPECjbb2005 performance with LM_LIGHTWEIGHT and recursive locking changes
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 22
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • Submitted: 2023-11-28
  • Updated: 2024-03-20
  • Resolved: 2024-03-20
Related Reports
Blocks :  
Relates :  
Relates :  
Description
With the recursive lightweight locking change in JDK-8319796, there is still a 3-6% regression in SPECjbb2005 on AMD (*not* on Intel) with LM_LIGHTWEIGHT. This needs to be resolved before returning the default to LM_LIGHTWEIGHT.

Charile Hunt:
In short, Erik is right that there is a higher regression with LW locking as the hyper-thread of a core becomes active. And, there is high variability in frontend stalls along with high variability in branches, and the lack of a huge difference is branch-misses (less than 5%) may imply that the differences in implementation between Legacy and LW locking on SPECjbb2005 may introduce to a different sequence, or priority of if conditions being executed, and depending on the sequence or condition being checked, some branch(es) fall out into a shorter instruction path

Roman:
I did some more experiments: I added a very simple recursive implementation on top of PR16606, I only added a few instructions to check the entries on the lock-stack. This caused a regression of ~10%. So I took this a little further and added the same (or similar) instructions on top of mainline, where they really do nothing, because of lack of recursive support. I also narrowed it down by selectively adding/removing some of those instructions, and it looks very much like it is the loading of lock-stack stuff in LW-unlock path that is causing most of the AMD regressions. It is not the branches (I removed them) and not the stuff in the locking path (I removed that, too). My theory is that loading the lock-stack pre-empts something else from the L1 cache, which causes stalls somewhere else - this is also what the profiles look like.
I don’t know a way around that problem, though.

Stefan:
I remember seeing similar odd things. Just adding an unnecessary loads in the unlock path caused disproportional regressions

Charlie:
Nice experiment by Roman. His observation, along with what you recalled in your observations could offer an explanation for high variability I noticed in both frontend stalls and backend stalls.
I am assuming that we wouldn't always be loading the "lock-stack stuff" on every recursive lock execution in SPECjbb2005. Are there some cases where we wouldn't be doing the "lock-stack stuff" and others where we would?
Comments
[~coleenp] I've run SPECjbb2005 on the same machine that exhibited the problematic regression, again. Scores: Legacy locking: 1753311 LW locking (with recursive): 1736471 That's a regression of ~1% and much better than what we've observed initially.
20-03-2024

Our Oracle performance team has helped us do more performance analysis of SPECjbb2005. Thanks to [~huntch] and [~ecaspole]. They've found the following results with more fine grained hardware counters and some experiments with an optimized version of jbb2005. Charlie's optimized version of jbb2005 removes some large sources of CPU cache misses and removed unnecessary String allocations, but maintains the unnecessary use of synchronized locks. This version shows a regression of about 3%-3.5%. The hardware counters are more realistic and show ~ 10% more instructions / longer path length (52,217 vs 57950) ~ 5% lower CPU cycles per instruction (0.770 vs 0.810). In his words: "Those two observations suggest two things to me. On this machine with this optimized jbb2005, lightweight locking has more instructions to execute. But, although there are more instructions lightweight locking is a little more efficient in executing instructions per CPU cycle. But not efficient enough to overcome the additional number of instructions. All of the above said, my opinion remains the same about whether a 3.5% regression on this machine with the optimized jbb2005 is concerning. jbb2005 does not represent a very large population of Java apps. It has over use of synchronization, its hottest method is a HashMap.get() on a HashMap with Integer keys ranging from 1 to 20,000 (it should have used an ArrayList instead), and it uses TreeMap where it should have used a Queue. IIRC, something like 80% of its object allocations are Strings and char[]. And, when we introduce smaller object headers we may likely gain that 3.5% back." Charlie also ran more experiments where he removed some more unnecessary locks, improving both Legacy and Lightweight locking performance, so that Lightweight was left with a 1%-2% regression. gprofng confirmed that lightweight has a slightly longer path length. His last experiment was an attempt to see what performance looked like with a 40% CPU usage vs. 100% CPU usage, since 40% is more generally representative of what most production apps run, with maybe the exception of batch processing apps that are compute bound. At 40% CPU usage, Lightweight locking is as good, and actually just slightly faster than Legacy locking by about 0.5%. This result is corroborated by some results that Eric and I found, where the performance regression wasn't reproduced on smaller configurations: 14 cores vs. 72 core hyper-threaded Intel machines. [~rkennke] recent results also support that this regression is higher on only some systems. We are closing this issue as WNF. The benefits of moving to Lightweight locking as default locking implementation is that it will prepare us for two things: 1. moving to OM-World locking which has a basis in the Lightweight locking implementation, and 2. eventually deprecating and removing the Legacy mode of locking, reducing the maintenance burden for a very complex part of the Java implementation, unless some unforeseen problems arise. In order to do this, we will accept the small regression on certain platforms, and will await and monitor customer feedback.
20-03-2024

We've done some extensive performance testing, with linux perf and find that the lightweight locking code has 4% more instructions than the legacy code, but we measured no real increase in cache misses and stalls with these additional instructions. We are considering accepting the 3-4% performance regression that we see on some machines (large 72 processor hyperthreaded Intel machines). The performance regression is a lot smaller for SPECjbb2005 on other configurations, including aarch64. Still running one more sample on M1. [~rkennke] Can you rerun some performance testing on your end and let us know what you find?
14-03-2024

SPECjbb2005 has a call stack with MonitorEnter (contended monitor): jdk.JavaMonitorEnter { startTime = 20:08:23.435 (2024-03-05) duration = 20.8 ms monitorClass = java.io.PrintStream (classLoader = bootstrap) previousOwner = "Thread-51" (javaThreadId = 118) address = 0x7F0234002460 eventThread = "Thread-61" (javaThreadId = 128) stackTrace = [ spec.jbb.DeliveryTransaction.display(PrintStream) spec.jbb.DeliveryHandler.handleDelivery(DeliveryTransaction) spec.jbb.DeliveryTransaction.process() spec.jbb.TransactionManager.runTxn(Transaction, long, long, double) spec.jbb.TransactionManager.goManual(int, TimerData) ... ] Both DeliveryTransaction methods display and process are synchronized methods. The method in between doesn't appear to call display with a lock held, so this has recursive locks. JMC shows java.io.PrintStream monitor blocked for 50s with 64 threads.
08-03-2024

On 2023-11-13 19:34, Kennke, Roman wrote: > I am not totally sure if my measurements and conclusions is correct, but it currently looks as if re-instating the monitor-check here: > > https://urldefense.com/v3/__https://github.com/xmas92/jdk/blob/39b421c1a09bf2728a79db8ce94a7230e2befec5/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp*L1085C8-L1085C8__;Iw!!ACWV5N9M2RV99hQ!NpAaDqPSOA87eG7cD4GKBJKI6Mu3YBbOEYTNdXiLQdZQUojrr9JFGO9M_3qzm8foxM4d7EnsfAks8NELSiAhYg$ > > helps a little bit. > > Could it be that specjbb2005 leads to a surpriging amount of unlock-failures, and thus gets penalised by your changes so much, because the uniock path is done optimistically and needs to work extra to recover when the uniock fails. > > Does anybody know how to get perf to drill (annotate) into JITed code? I know that I did this before, but can’t seem to figure out how. So far I’m running with the extra options -XX:+UnlockDiagnosticVMOptions -XX:+DumpPerfMapAtExit -XX:+PreserveFramePointer and this helps to see the Java methods in the call-graph, but it can’t drill into the generated assembly code… This is what I've written down: perf record -k 1 <...>/java -agentpath:<...>/libperf-jvmti.so perf inject -i perf.data --jit -o perf.data.jitted StefanK
07-03-2024

[~rkennke] I'm going to assign myself to this issue. Let me know if you've had further thoughts and experiments.
28-02-2024

Running SPECjbb2005 locally with our new tool that has the compare function, it is not reporting a significant performance difference on AMD linux. I only ran 4 samples though. {code} Checking for updates Baseline:SPECjbb2005.2024-02-26-21_36_31 -b SPECjbb2005 -j $clean/build/linux-x64/images/jdk -i 4 --jvm-flags="-XX:LockingMode=1" Test:SPECjbb2005.2024-02-26-22_07_22 -b SPECjbb2005 -j $clean/build/linux-x64/images/jdk -i 4 --jvm-flags="-XX:LockingMode=2" Name Cnt Base Error Test Error Unit Change SPECjbb2005 4 1787482.439 ± 152565.046 1764071.075 ± 184850.641 hbOps 0.99x (p = 0.255 ) :first 74701.589 ± 12036.975 70320.315 ± 4007.226 hbOps 0.94x (p = 0.014 ) :interval_average 18427.750 ± 1572.149 18186.250 ± 1904.723 hbOps 0.99x (p = 0.255 ) :last 1787482.439 ± 152565.046 1764071.075 ± 184850.641 hbOps 0.99x (p = 0.255 ) :last_warehouse 96.000 ± 0.000 96.000 ± 0.000 lbOps 1.00x (p = 0.000 ) :overall_average 1562019.936 ± 138837.841 1526548.968 ± 159434.789 hbOps 0.98x (p = 0.074 ) :peak 1876894.128 ± 198692.803 1849491.975 ± 173923.416 hbOps 0.99x (p = 0.229 ) :peak_warehouse 54.000 ± 44.770 48.000 ± 0.000 lbOps 0.89x (p = 0.182 ) :rw.runtime 456.000 ± 0.000 456.250 ± 3.231 lbOps 1.00x (p = 0.391 ) * = significant {/code} please format?? Running with our performance testing framework, I see up to 3.21% (marked significant) regression on Linux Intel x64. This is running "Tuned" with server -XX:+UseParallelGC -Xmx30g -Xms30g -Xmn26g -XX:+UseNUMA -XX:-PrintWarnings -XX:+UseLargePages -XX:-UseAdaptiveSizePolicy -XX:-UseAdaptiveNUMAChunkSizing.
28-02-2024

Assigning to [~rkennke] - Can you post your experiments here and data so we can work together on this? Is the regression for just AMD or also Intel? Thank you!
04-12-2023

> I am assuming that we wouldn't always be loading the "lock-stack stuff" on every recursive lock execution in SPECjbb2005. Are there some cases where we wouldn't be doing the "lock-stack stuff" and others where we would? Without recursive locking, we don't need to load the top of the lock-stack at all. We only need to decrement the top-of-stack-offset, but that is harmless. With recursive LW locking we need to load and check the top-of-lock-stack and inspect the top-most entries to decide whether or not we are unlocking a recursive lock.
29-11-2023