JDK-8270323 : Regression > 3% in Perfstartup-Noop-G1 in 18-b4
Type:Bug
Component:hotspot
Sub-Component:compiler
Affected Version:18,19
Priority:P4
Status:Open
Resolution:Unresolved
CPU:x86_64
Submitted:2021-07-12
Updated:2022-03-01
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
Seems to be related to JDK-8268276.
Perfstartup-Noop-G1 -3.13%
We discussed this with Sandhya, I will assign to her for the time being.
I will add some more profiling info this week.
Comments
Thanks for the updates! I will defer it to 19 for now as [~kvn] is currently on vacation. If you still want to fix this/JDK-8231349 in 18 before the fork, please move it to 18 again.
19-11-2021
Assigning to [~kvn] since he's assigned to the enhancement that'll likely fix this, JDK-8231349
I'll bring it up with the team, but I think we're ok with deferring this to 19 at this point.
19-11-2021
Hi Christian, [~redestad] was going to look at it so I have assigned the bug to him.
16-11-2021
Hi [~sviswanathan], as the fork's soon coming up in early December, are you planing to get this fixed in 18? Otherwise, it needs to be deferred to 19 once RDP 1 starts (because it's a P4).
16-11-2021
I filed a low-priority RFE (JDK-8231349) to make stub generation lazy a couple of years back. This would reduce cycles burned on bootstrap - but would at the time have little effect on "real" startup timings. With a real, observable regression then implementing that RFE is now a much higher priority. I'll see if I can wrap my head around it.
11-08-2021
This is a good candidate
09-08-2021
Yes, lazy stub generation is better idea!
14-07-2021
[~ecaspole] Thanks a lot for detailed analysis. A waiver currently would be very helpful till we are able to implement a solution in discussion with Vladimir Kozlov.
14-07-2021
[~kvn] Or a lazy stub generation for some of the large stubs, if such a thing is possible. Generate the optimized stub on first use. We could take a look when someone is available, I will reach out to you separately.
14-07-2021
:(
I would like to ask Performance group to "waive" this startup regression.
Most likely all recent new AVX512 implementation for stubs have the same effect. And also SSE/AVX implementation for math functions.
May be we should consider AOTing this and other stubs code during JDK build by running VM in special mode to produce object file for stubs and statically link it. But build machine may not have AVX512 - the main issue. So it would be some kind of "cross compilation" - generate AVX512 instructions on machine which do not have it. We did experiment several years ago to pre-generate template Interpreter code for iOS. So it is doable.
[~sviswanathan] Can your group consider to work on such project? We can discuss details separately.
14-07-2021
I think the explanation is that this is a pretty huge intrinsic and even though it is not used in running this "hello, world" type app, the time to emit the intrinsic is noticeable. I pasted some perf record lines from both builds at the bottom. Looking at the -XX:+PrintStubCode, there are about 800 lines of output for this change, about 3% of the total stubs.
For the previous build 82:
Performance counter stats for './doit.sh' (200 runs):
28.61 msec task-clock # 1.058 CPUs utilized ( +- 0.12% )
92 context-switches # 0.003 M/sec ( +- 0.29% )
1 cpu-migrations # 0.037 K/sec ( +- 4.41% )
3,898 page-faults # 0.136 M/sec ( +- 0.03% )
90,902,667 cycles # 3.177 GHz ( +- 0.11% )
88,352,776 instructions # 0.97 insn per cycle ( +- 0.01% )
16,804,504 branches # 587.304 M/sec ( +- 0.01% )
397,130 branch-misses # 2.36% of all branches ( +- 0.09% )
0.0270419 +- 0.0000936 seconds time elapsed ( +- 0.35% )
For this build with change 83:
Performance counter stats for './doit.sh' (200 runs):
29.40 msec task-clock # 1.058 CPUs utilized ( +- 0.13% )
91 context-switches # 0.003 M/sec ( +- 0.30% )
1 cpu-migrations # 0.041 K/sec ( +- 5.26% )
3,897 page-faults # 0.133 M/sec ( +- 0.03% )
93,477,816 cycles # 3.180 GHz ( +- 0.10% )
88,580,795 instructions # 0.95 insn per cycle ( +- 0.02% )
16,848,675 branches # 573.106 M/sec ( +- 0.01% )
399,708 branch-misses # 2.37% of all branches ( +- 0.10% )
0.0277898 +- 0.0000964 seconds time elapsed ( +- 0.35% )
build 82:
+ 7.34% 0.00% java libjvm.so [.] StubGenerator_generate ▒
...
+ 6.78% 0.00% java libjvm.so [.] StubRoutines::initialize2 ▒
+ 6.78% 0.00% java libjvm.so [.] StubGenerator::generate_all ▒
build 83:
+ 8.60% 0.00% java libjvm.so [.] StubGenerator_generate ▒
...
+ 8.06% 0.00% java libjvm.so [.] StubRoutines::initialize2 ▒
+ 8.06% 0.00% java libjvm.so [.] StubGenerator::generate_all ▒
When I ran with -XX:-UseBASE64Intrinsics the times are equal.
14-07-2021
ILW = MLH = P4
12-07-2021
[~ecaspole] If you could also share the steps to reproduce it will be very helpful.