JDK-8317806 : Aarch64 10x+ slower at clearing IC callsites than x64 causes long code cache unloading times
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 20,21,22
  • Priority: P4
  • Status: In Progress
  • Resolution: Unresolved
  • CPU: aarch64
  • Submitted: 2023-10-10
  • Updated: 2024-07-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
Using a code unloading stress benchmark unloading nmethods, this process takes much much longer on aarch64 than on x86 (6000ms cpu time vs. a few 100ms for 90k nmethods)

The reason seems to be hw instruction cache flushing (ICache::invalidate_range). In a test where this is not done (that is an incorrect change just for demonstration of the issue!) then the time spent in this call is comparable to x86.

The reason is that code unloading patches all CompiledICs in that phase, and issuing an instruction cache flush for every single location flushed, which is apparently slow.

This can lengthen the remark pause in G1 (but also full gc in other STW collectors) significantly in such situations.

Some options:
 * for the stw gcs, one could still forego the instruction cache flushing and do it (conservatively) for the whole code cache once. That might be faster than doing that thousands of times for very tiny locations separately.
This only works with STW collectors.

 * there is no need for patching code that is never executed again - if it would be possible to identify the necessary locations and only do the icache flush for these, the number of these icache flushes can likely be reduced significantly.

This also improves the situation when unloading code concurrently.

This issue has become apparent with JDK-8290025 when removing the code cache sweeper: previously this work has been done concurrently, and while doing unnecessary icache flushes may have only reduced execution performance, it now affects pause times too.
Comments
If it is Linux, what version of gcc did you use to build jdk? In gcc10 `__builtin___clear_cache` was optimized: https://github.com/gcc-mirror/gcc/commit/761e6bb9f7d2bd782d93e46baebade2eb1f7d16e
24-10-2023

What OS? For example, on Linux AArch64 `ICache::invalidate_range` uses `__builtin___clear_cache`.
24-10-2023

[~tschatzl] What CPU implementing AArch64 did you use?
24-10-2023

Related to JDK-8303971, [~dfenacci].
10-10-2023