JDK-8365926 : RISC-V: Performance regression in renaissance (chi-square)
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 24,25,26
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: riscv
  • Submitted: 2025-08-21
  • Updated: 2025-09-23
  • Resolved: 2025-09-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 25 JDK 26
25.0.2Fixed 26 b16Fixed
Related Reports
Causes :  
Relates :  
Relates :  
Relates :  
Description
When running e.g. chi-square a large performance regression can be seen on some hardware (in this case P550). 
These renaissance benchmarks are highly compiler dependent, meaning result can vary with 30% run to run due to differences in code cache (both due to profiling and due to placement of code).

One major factor is that pre-24 rv64 used trampoline calls:
##############
  0x00007ff43025ee8c:   jal	ra,0x00007ff43025f16c  // if target reachable we did a direct call here, otherwise via tramopline
...
  0x00007ff43025f16c:   auipc	t1,0x0                      ;   {trampoline_stub}
  0x00007ff43025f170:   ld	t1,12(t1) # 0x00007ff43025f178
  0x00007ff43025f174:   jalr	zero,0(t1)
  0x00007ff43025f178:   <8-byte address> // atomically patchable
#################

Due to issues with loading intra-cache and an unneeded jump this was change in: "8332689: RISC-V: Use load instead of trampolines"

##################
  0x00007ff3b4342c30:   auipc	t1,0x0
  0x00007ff3b4342c34:   ld	t1,832(t1) # 0x00007ff3b4342f70
  0x00007ff3b4342c38:   jalr	ra,0(t1)
...
  0x00007ff3b4342f70:   <8-byte address> // atomically patchable
#################

But this implementation didn't have direct calls, as they in practice are rare.
Comments
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk25u/pull/207 Date: 2025-09-17 00:19:54 +0000
17-09-2025

[jdk25u-fix-request] Approval Request from Fei Yang Same issue exists in jdk25u repo as well. Backport to fix performance regression on linux-riscv64 platform. This is a riscv specific change which has been verified with tier1-tier3 tests.
17-09-2025

OK. It should be clean backport. Let me launch some tiered tests on my SG2042 machine.
16-09-2025

Yes, we should most definitely back-port. I'm very busy at the momemnt.
16-09-2025

I assume jdk25u repo is affected by the same issue as well?
16-09-2025

Changeset: 5c1865a4 Branch: master Author: Robbin Ehn <rehn@openjdk.org> Date: 2025-09-12 08:01:50 +0000 URL: https://git.openjdk.org/jdk/commit/5c1865a4fcd5da80ddcc506f4e41aada0fb93970
12-09-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/26944 Date: 2025-08-26 14:43:05 +0000
29-08-2025

PS: When comparing performance across different JDK versions, please note that we've disabled several intrinsics on RISC-V platforms with slow misaligned memory accesses like Sifive P550/Unmatched in JDK mainline and JDK 25. These changes were made for performance issues under COH (Compact Object Headers). Examples: https://bugs.openjdk.org/browse/JDK-8351145, https://bugs.openjdk.org/browse/JDK-8359218, etc. Not sure if these will make a difference on your performance numbers shown on today's meeting. Just reminding :-)
26-08-2025

I'll let nmethod relocation go in first as it's reviewed and ready to go AFIACT. And pick up any pieces from there. The release in make_jal_opt so to make sure the store to instruction stream happens before I-cache flush. 1: store destination to stub 2: release 3: store destination to instruction stream 4: release 5: i-cache flush Thanks!
26-08-2025

Is `NativeCall::optimize_branch`a better name than `NativeCall::make_jal_opt`? Also I noticed redundant `OrderAccess::release()` in `NativeCall::make_jal_opt` as `set_stub_address_destination_at(stub_addr, dest)` in `NativeCall::set_destination_mt_safe` will issue the same release barrier. I will take a closer look when you have a pull request for review.
26-08-2025

Thanks [~rehn] for the extra infomation. Another question: I see people are working on relocation of nmethod within CodeCache (https://github.com/openjdk/jdk/pull/23573). I suppose this won't conflict with that?
26-08-2025

Linux ftrace do NOP<->JALR. (auipc is left intact in code stream)
25-08-2025

"We are assuming that a natural aligned 4-byte instruction can be atomically (from the I-fetcher point of view) changed. Meaning I-fetcher would either see the old instruction or the new, and it will see the new instruction eventually." -- Yes, that's our assumption. I am wondering if similar assumption is there in other programs. I guess maybe similar things happen in the linux kernel on RISC-V. As I remembered, hot-patching mechanism in the kernel space involves instruction patching on function entry point. But I am not sure about the details.
25-08-2025

Benchmark on P550: File: chi-square.jdk-23.prep Mean: 3189.5827 Standard Deviation: 284.6478 File: chi-square.jdk-25.prep Mean: 3424.8905 Standard Deviation: 222.2208 File: chi-square.jdk-26.prep Mean: 3144.8535 Standard Deviation: 229.2577 Note: as code cache layout differs between runs, this is 100 runs of 20 iteration and removing the 10 first iterations for each run.
25-08-2025

Yes, there is no spec for this, Zjid still delayed. We are assuming that a natural aligned 4-byte instruction can be atomically (from the I-fetcher point of view) changed. Meaning I-fetcher would either see the old instruction or the new, and it will see the new instruction eventually. We already have that assumption, e.g. CompiledDirectCall::set_to_interpreted. Now argubly there is a difference between changing only imm or changing a nop compared to jal vs jalr. Also note that on I/D coherent machines we can skip ICache::invalidate_range, as the store will invalidate thet cache-line and it should be evicted from L1I, to be refetched. uops caches may interfere with this on some machines, calling old dest should flush this out (we could help by adding fence.i in ic_miss/resolve)). But lets take that discussion if I or someone else tries to loosen this up for some hardware :) We certainly can change the make_jal_opt.
25-08-2025

Hi [~rehn]: I went through your changes and your new approach seems interesting and kind of different from the old LTS versions. And we are doing CMC (cross modifying code) turning indirect branch jalr into direct branch jal for both. Seems that it's safe to do such a thing for RISC-V. I have been trying to find more details from the RV SPECs about the side effect of instruction patching. But it seems that's not specified anywhere. Maybe I missed something? Please let me know if you have more about that. Thanks. BTW: I personally don't really like this name `NativeCall::make_jal_opt`. Maybe we can have a better name for it?
25-08-2025

I'm looking at adding back jal direct calls: https://github.com/openjdk/jdk/compare/master...robehn:8365926?expand=1 Note we cannot remove the ld.
21-08-2025