JDK-8307137 : [AArch64] MacroAssembler::lookup_interface_method could use conditional compare instead of branch
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: generic
  • CPU: aarch64
  • Submitted: 2023-04-28
  • Updated: 2024-07-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp MacroAssembler::lookup_interface_method currently loops over the itable list with

bind(search);
  // Check that the previous entry is non-null.  A null entry means that
  // the receiver class doesn't implement the interface, and wasn't the
  // same as when the caller was compiled.
  cbz(method_result, L_no_such_interface);
  if (itableOffsetEntry::interface_offset_in_bytes() != 0) {
    add(scan_temp, scan_temp, scan_step);
    ldr(method_result, Address(scan_temp, itableOffsetEntry::interface_offset_in_bytes()));
  } else {
    ldr(method_result, Address(pre(scan_temp, scan_step)));
  }
  cmp(intf_klass, method_result);
  br(Assembler::NE, search);

using two branches.  That could be replaced with

  bind(search);
  if (itableOffsetEntry::interface_offset_in_bytes() != 0) {
    add(scan_temp, scan_temp, scan_step);
    ldr(method_result, Address(scan_temp, itableOffsetEntry::interface_offset_in_bytes()));
  } else {
    ldr(method_result, Address(pre(scan_temp, scan_step)));
  }
  cmp(intf_klass, method_result);
  // Bits are: N Z V C.
  ccmp(method_result, 0, 0b0100, Assembler::NE);
  br(Assembler::NE, search);

with only one branch, at the cost of having to test why the loop exited after the loop.

That change trades a branch for an integer operation in each iteration of the loop, and on platforms with spare integer execution units and limited branch execution units (and branch predictors, and instruction prefetchers), the trade-off is a win.
Comments
Hi Peter, I ran your CCMP.tar.gz benchmark and observe the following results on different CPUs. As I can see, the ccmp approach is not always better for scanning through a zero-ending table. Ampere CPU ---------- processor : 0 BogoMIPS : 50.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x3 CPU part : 0xd0c // Neoverse N1 CPU revision : 1 | existing_search | proposed_search | ratio zero | 29692139 | 30429464 | 1.02 one | 31502592 | 32562520 | 1.03 two | 46775825 | 43481641 | 0.93 // since this line, proposed_search one is faster three | 50283291 | 46957186 | 0.93 four | 53612275 | 50135930 | 0.94 five | 56953940 | 53600235 | 0.94 six | 64656276 | 56949500 | 0.88 seven | 73482181 | 63483108 | 0.86 eight | 81650842 | 69480472 | 0.85 nine | 84786865 | 74166027 | 0.87 ten | 91891958 | 79643747 | 0.87 eleven | 98751448 | 84344742 | 0.85 twelve | 105126295 | 89420739 | 0.85 thirteen | 113453596 | 95036820 | 0.84 fourteen | 126232570 | 100951025 | 0.80 fifteen | 124139515 | 108082476 | 0.87 sixteen | 132047053 | 116328298 | 0.88 seventeen | 141586484 | 122911066 | 0.87 eighteen | 143982061 | 128212825 | 0.89 nineteen | 150786592 | 133038581 | 0.88 twenty | 159191413 | 138536941 | 0.87 twenty-one | 166767029 | 143029855 | 0.86 twenty-two | 180412249 | 147983330 | 0.82 twenty-three| 177480628 | 153742374 | 0.87 AWS c7g.2xlarge instance ------------------------ processor : 0 BogoMIPS : 2100.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x1 CPU part : 0xd40 // Neoverse V1 CPU revision : 1 | existing_search | proposed_search | ratio zero | 26942720 | 23299044 | 0.86 // proposed_search is better for zero offset only one | 26941080 | 26945224 | 1.00 two | 34640471 | 34641280 | 1.00 three | 38494086 | 38489235 | 1.00 four | 42339297 | 42336900 | 1.00 five | 46188041 | 46189127 | 1.00 six | 50035373 | 50038710 | 1.00 seven | 53885276 | 53883339 | 1.00 eight | 57730715 | 57758737 | 1.00 nine | 61583293 | 61827049 | 1.00 ten | 65433277 | 66770397 | 1.02 eleven | 69359088 | 72295437 | 1.04 twelve | 73130891 | 78200578 | 1.07 thirteen | 77168822 | 83937395 | 1.09 fourteen | 80828778 | 88867585 | 1.10 fifteen | 84708648 | 94186669 | 1.11 sixteen | 90597623 | 100342334 | 1.11 seventeen | 95907572 | 106567689 | 1.11 eighteen | 101318246 | 112572277 | 1.11 nineteen | 104804872 | 118290296 | 1.13 twenty | 104110630 | 123575994 | 1.19 twenty-one | 107947563 | 128643011 | 1.19 twenty-two | 111710656 | 134403456 | 1.20 twenty-three| 115920496 | 140373009 | 1.21
19-05-2023

Andrew Haley complained (https://mail.openjdk.org/pipermail/hotspot-dev/2023-May/073358.html) that the CCMP instruction is difficult to read and write. To help with that I have attached CCMP.tar.gz that contains a C++ program that compares the same key-value search implemented in four ways: (1) in C++, (2) using a simulation of the CCMP instruction to show how CCMP works, (3) using the asm from the existing itable stub, and (4) using the asm I am proposing for the itable stub. Untar the tarball, cd to the CCMP/ directory and run `make run`. After some functional tests to make sure all 4 searches find the same values (and don't find the keys that are not present in the key-value array), the program times searches of various depths in the key-value array. On my Ampere Altra aarch64 machine, searches for index 0 and 1 are 3%~5% faster using the existing itable stub search than the proposed itable stub search, but for longer searches the proposed search can be as much as 15% faster than the existing search. The C++ search is always faster than the existing asm search on my machine. But not as fast as the proposed search for longer searches. If readability (and writability) is the goal, and if we believe that itable stub searches are usually short, then maybe it would be best to write this search in C++. My ulterior motive for writing the simulation of the CCMP instruction is to show how to think about the instruction, and by showing how CCMP can be used to advantage in general searches to inspire people to use CCMP for general key-value searches. I believe key-value searches are a common operation, and making them go faster would be a good thing. In fact, CCMP can be useful for testing any conjunction, but the performance difference probably only shows up in loops.
06-05-2023

Andrew Haley asked (https://mail.openjdk.org/pipermail/hotspot-dev/2023-May/073375.html) for benchmark I used to measure the performance of various depths of interface dispatch. I have attached InterfaceStubTest.tar.gz that contains that program. Untar it, cd to the InterfaceStubTest/ directory, set JAVA_HOME to your favorite Java build, and run `make vm_version run`. If you are on an Ampere Altra aarch64 machine, you should get results quite like the table I posted, but with all the values divided by 10, because 1/10th the iterations was good enough.
06-05-2023

I think this is better off in hotspot/runtime.
02-05-2023

Setting the sub-component to 'compiler' for initial triage. Both the Compiler team and the Runtime team work on the MacroAssembler stuff so I tossed a coin...
29-04-2023