JDK-8294741 : poor C1 performance on linux-aarch64 with VM option -XX:TieredStopAtLevel=2,3
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 20
  • Priority: P5
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: aarch64
  • Submitted: 2022-10-03
  • Updated: 2022-12-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
C1 performance on linux-aarch64 is not good.
It is observed with the VM option -XX:TieredStopAtLevel=2,3 for the following JVM TI test:
  serviceability/jvmti/vthread/SuspendResume1

This is a comment from Chris P. with performance measurements:
------
Here are the results of running 5 times on each of our 5 platforms with `-XX:TieredStopAtLevel=2"`

macosx-aarch64-debug 22s
windows-x64-debug 23s
windows-x64-debug 23s
macosx-aarch64-debug 23s
macosx-aarch64-debug 24s
macosx-aarch64-debug 24s
windows-x64-debug 24s
macosx-x64-debug 24s
macosx-aarch64-debug 25s
macosx-x64-debug 26s
macosx-x64-debug 26s
macosx-x64-debug 26s
macosx-x64-debug 26s
windows-x64-debug 28s
windows-x64-debug 33s
linux-x64-debug 43s
linux-x64-debug 45s
linux-aarch64-debug 2m 17s
linux-x64-debug 2m 20s
linux-x64-debug 2m 42s
linux-x64-debug 2m 59s
linux-aarch64-debug 3m 22s
linux-aarch64-debug 4m 31s
linux-aarch64-debug 6m 30s
linux-aarch64-debug 9m 22s

The last one was a timeout. So it seems that linux-aarch64 is consistently slow.
linux-x64 is also a bit slow in some cases.
It seems it would be worth understanding these performance differences.
-----

Feel free to close this if you already have a bug filed on this problem.
Comments
Hi all~ As this issue is related to vthread/loom, I suppose the execution time/performance latency is related to the number of CPU cores (native OS threads) we use. Here shows my test on Linux-aarch64, macOS-aarch64 and Linux-x86. We can witness timeout failure for all three platforms if small number of CPU cores are used. Is there anything I missed? Thanks. 1) Linux-aarch64-debug (with 160 CPU cores) 8 cores: 12m24.715s (timeout) 10 cores: 12m23.920s (timeout) 16 cores: 0m53.686s 20 cores: 0m53.327s 32 cores: 0m54.068s 64 cores: 0m48.149s 100 cores: 0m54.355s 2) macOS-aarch64-debug (using M1 with 8 CPU cores) 9m7.042s (timeout) 3) Linux-x86-debug (with 64 CPU cores) 8 cores: 11m27.765s (timeout) 10 cores: 0m58.059s 16 cores: 0m49.118s 20 cores: 0m49.077s 32 cores: 0m49.019s
16-12-2022

Since this issue happens with a vthread/loom test perhaps there's something related to Loom itself.
07-10-2022

macosx-aarch64 M1 is also aarch64 and time is similar to x64. We do need to investigate what is happening on linux-aarch64 (Ampere CPU). We know about bad performance of 2 sockets Ampere systems when accessing cache from different socket. But I hope the VM slices we have in OCI are on the same socket. There could be still issues with "ping-pong" cache accesses on the same socket but different cores groups. As Igor asked we need to see `-XX:TieredStopAtLevel=1` results to make sure that profiling (concurrent memory and caches update) is the cause. Or some kind of C1 compilation quality issue.
04-10-2022

The problem is that the linux-aarch64 is much slower than other architectures.
04-10-2022

ILW = test runs slowly; sometimes; no workaround = LMH = P5
04-10-2022

It kind of is supposed to be slower. Levels 2 and 3 do profiling and disable some optimizations (like range check predication) to make the profiling possible. How much faster is Level 1 on the same machines that the measurements were taken?
04-10-2022