Using a code unloading stress benchmark unloading nmethods, this process takes much much longer on aarch64 than on x86 (6000ms cpu time vs. a few 100ms for 90k nmethods)
The reason seems to be hw instruction cache flushing (ICache::invalidate_range). In a test where this is not done (that is an incorrect change just for demonstration of the issue!) then the time spent in this call is comparable to x86.
The reason is that code unloading patches all CompiledICs in that phase, and issuing an instruction cache flush for every single location flushed, which is apparently slow.
This can lengthen the remark pause in G1 (but also full gc in other STW collectors) significantly in such situations.
Some options:
* for the stw gcs, one could still forego the instruction cache flushing and do it (conservatively) for the whole code cache once. That might be faster than doing that thousands of times for very tiny locations separately.
This only works with STW collectors.
* there is no need for patching code that is never executed again - if it would be possible to identify the necessary locations and only do the icache flush for these, the number of these icache flushes can likely be reduced significantly.
This also improves the situation when unloading code concurrently.
This issue has become apparent with JDK-8290025 when removing the code cache sweeper: previously this work has been done concurrently, and while doing unnecessary icache flushes may have only reduced execution performance, it now affects pause times too.