Bug ID: JDK-8263864 ~17% regression for DaCapo h2 after JDK-8253064

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 16,17,18

Priority: P4
Status: Closed
Resolution: Duplicate

Submitted: 2021-03-19
Updated: 2022-02-01
Resolved: 2022-01-31

JDK 18
18Resolved

All GCs (Serial, Parallel, G1, Shenandoah, Z) are effected; 15-20% regression for different GCs.

```
java -Xms4g -Xmx4g -jar dacapo-9.12-MR1-bach.jar h2 -s huge -t 1 -n 1
```

Example of output using G1:

Before
===== DaCapo 9.12-MR1 h2 PASSED in 124958 msec =====

After
===== DaCapo 9.12-MR1 h2 PASSED in 151726 msec =====

So far the regression is observed only on Intel, but not on AMD.

More testing shows that this regression is only for single thread. If I use more threads (e.g. -t 2), the regression seems to go away.

Only issues for which a changeset was pushed under that bug ID should be marked as fixed. Re-opening to close as a duplicate of JDK-8277180.
31-01-2022
The regression has been dealt with on x86_64 and AArch64 via JDK-8277180.
22-11-2021
The regression has now been fixed on Oracle supported platforms via JDK-8277180. Just going to ping [~mdoerr] and [~shade] before closing this, in case there is interest in fixing this for other platforms. And then go ahead and close this bug.
22-11-2021
Another way of looking at the problem is that it's a bit weird that inflated locks are that much slower to lock and unlock, that it causes a 17% regression when they are used more. In fact, it turns out that the reason for that is that the C2 fast locking/unlocking intrinsics deal with recursive stack locks, but not recursive ObjectMonitor locking. Sprinkling a few more instructions in that intrinsic to deal with that, completely removes the regression as well, by letting us stay in the C2 code and not spill everything to call the JVM, which appears to be the overwhelming majority of the performance difference here. It seems reasonable that recursive locking should be reasonably fast both for stack locks and ObjectMonitors, if that buys us the freedom to just look at memory overheads in the deflation heuristics.
09-11-2021
The default value for AvgMonitorsPerThreadEstimate is 1024, and MonitorUsedDeflationThreshold is 90. When making a decision about deflation, we will hence check if the number of used monitors is over 90% of 1024 * threads. Given that the number ranges from 4-6, that will never come remotely close to happening. The deflation heuristics are now purely based on keeping down the number of monitors if they become too many. There is no notion of deflating anyway, even though there is by far no need for it footprint wise, because we expect there would be a performance win in doing so. It's not clear to me that it in the general case makes sense to do that at all, or if it is just a random think when running DaCapo h2 with -t 1 that it happens to benefit from that. In fact, the higher the frequency of deflation, the better in this benchmark. Because all it does seems to be to take a lock over and over again, and almost always by the same thread. But aggressively deflating in the general case, does not necessarily always result in a win. It all comes down to whether the cost of doing a thread-local handshake with all threads + walking all the monitors and deflating them, is cheaper than taking an unnecessarily slow path for locking the potentially almost-single-threaded locks over and over again, that could have been optimized. There is no simple way of knowing that, and there are (too?) many factors around that, which depends on the scale of the application. I think I am more concerned in the general case about doing too aggressive deflation when there is no need for it, having a negative impact on larger workloads with more threads involved, and more standard locking practices (where the application does not spend an unreasonable amount of time taking a lock over and over again).
08-11-2021
I guess the master thread might notify the worker threads to perform work, on the same monitor that the workers use for synchronizing work stealing. So the hot loop of the worker threads would take the lock and pick the next task. If the time to execute a task is relatively trivial, then a large portion of the time will be spent taking the lock. And it only got inflated because of the notification to run the single worker.
08-11-2021
[~dholmes] When you run with -t 1, the benchmark is counter intuitively seemingly working with 2 threads. It seems like the -t flag says how many worker threads should be used, but there is also a master thread. And when you run perf you see that they both have significant amounts of CPU time. So I suppose it must be those threads coordinating. Either there is very mild contention between the two therads, or they communicate with wait/notify, which forces inflation.
08-11-2021
Just curious, but with `-t 1` how is inflation actually triggered in the first place? Does the benchmark do that on purpose?
02-11-2021
Wonder if some deflation will help this timeout. JDK-8273107
02-11-2021
I have now looked into why we regress. The answer is in the monitorinflation logs. Before the blamed patch, the heuristics would do some random deflation from time to time, even though there wasn't seemingly any need for it (very few monitors in the system - like less than 5). After the patch, the new heuristics is a bit more relaxed, and waits for there to be a need for deflation. In this workload, it ends up not doing any deflation at all (while the benchmark is running), because there are so few monitors. It would appear that with -t 1, we are therefore missing out on deflation, which re-enables stack locks. Changing the heuristics to do some sporadic deflation whether it seems needed or not, removes the regression in this benchmark. Re-enabling the stack locks seems important for this benchmark to perform well in the very mildly contended case.
02-11-2021
ILW=MLM=P4
01-06-2021
So yeah, we did run a whole bunch of benchmarks before integrating, including DaCapo h2. Results were typically on par, but sometimes with a little boost with the new approach. However, it seems like we never tried custom settings for the h2 benchmark, to provoke the scenario where a single thread is used.
27-05-2021
ILW = MMM = P3
23-03-2021
[~eosterlund] - I thought we did DaCapo h2 testing before we integrated JDK-8253064?
19-03-2021

Duplicate :	JDK-8257975 - Regression ~ 2.5% in SwingMark.Table after JDK-8253064
Duplicate :	JDK-8277180 - Intrinsify recursive ObjectMonitor locking for C2 x64 and A64
Duplicate :	JDK-8281043 - Intrinsify recursive ObjectMonitor locking for PPC64
Relates :	JDK-8253064 - monitor list simplifications and getting rid of TSM
Relates :	JDK-8273107 - RunThese24H times out with "java.lang.management.ThreadInfo.getLockName()" is null