JDK-8274564 : Add VM options to control Thread.onSpinWait intrinsic code for AArch64
  • Type: CSR
  • Component: hotspot
  • Sub-Component: compiler
  • Priority: P4
  • Status: Provisional
  • Resolution: Unresolved
  • Fix Versions: 11-pool,17-pool,18
  • Submitted: 2021-09-30
  • Updated: 2021-10-09
Related Reports
CSR :  
Description
Summary
-------

Add new VM options `OnSpinWaitInst` and `OnSpinWaitInstCount` to control the code generated for JVM's `Thread.onSpinWait` intrinsic on AArch64.

Problem
-------

AArch64 ISA provides the `YIELD` [instruction](https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/YIELD--YIELD-). According to the specification the instruction can be used to implement spin pauses including `Thread.onSpinWait` intrinsic. However most known hardware implementations of AArch64 ISA treat the instruction as `NOP`. As a result no pauses happen.

Experiments with sequences of the `NOP` instructions or the `ISB` instructions have shown they can be used to implement spin pauses:

 - http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004870.html
 - https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-August/054033.html

The problem is that for some AArch64 microarchitectures `NOPs` should be used, for other AArch64 microarchitectures - `ISBs`. Also a number of instructions can depend on a microarchitecture. Future microarchitectures may have the proper implementation of the `YIELD` instruction.

Solution
--------

Add VM options to control which instruction and how many to use for `Thread.onSpinWait` intrinsic code on AArch64.

Specification
-------------

VM option to specify an instruction: `OnSpinWaitInst=inst`, where `inst` can be, for AArch64,

- `none`: no implementation for spin pauses. This is the default value.
- `nop`: use `nop` instruction for spin pauses.
- `isb`: use `isb` instruction for spin pauses.
- `yield`: use `yield` instruction for spin pauses.

VM option to specify a number of instructions: `OnSpinWaitInstCount=count`, where `count` specifies a number of `OnSpinWaitInst` and can be in `1..99` range.

It is an error to use `OnSpinWaitInstCount` when `OnSpinWaitInst` is `none`.

```
  product(ccstr, OnSpinWaitInst, "none",
          "Use instructions to implement java.lang.Thread.onSpinWait()."
          "Options: none, nop, isb, yield.")
  product(uint, OnSpinWaitInstCount, 1,
          "Use a number of OnSpinWaitInst.")
          range(1, 99)
```

See: https://github.com/openjdk/jdk/pull/5562/
Comments
> It is however an extremely low-level detail that will be way beyond > the ability of most users to effectively measure and optimise. I think that's overstating things a little bit: all it takes is a fairly simple JMH benchmark and look at the numbers. It might be that the benchmark doesn't exactly reproduce what you need in your application, but it would be a good starting point. We should ensure that a simple JMH benchmark is committed along with onSpinWait support.
09-10-2021

> A typical use case is that performance engineers use the options to find the best instruction and its count. Then compiler engineers set this as default. In rare cases the default can be changed with options but this would be either to disable the functionality or to change instruction count. Based on these cases being product might be too strong for the options. Maybe they should be experimental. If that is the expected use-case then diagnostic would be more appropriate.
09-10-2021

Hi Joe and David, Thank you for reviewing. > Is a dedicated option needed for this functionality? Shouldn't an intrinsic "do the right thing" for a given AArch64 micro-architecture? Even for a given micro-architecture, for example Neoverse N1, hardware implementations of the micro-architecture might have differences. The exact behaviour of the instructions we want to use is not specified in the micro-architecture documentation. Benchmarking on Graviton2 (one implementation of Neoverse N1) gave a solution. We think it should work for all implementations of Neoverse N1. However the solution might not be good for some CPUs of Ampere Altra Family (another implementation of Neoverse N1). Also there might be use cases/workloads when the found solution does not work. Yes, the option is needed. > It is however an extremely low-level detail that will be way beyond the ability of most users to effectively measure and optimise. Yes, the options are not for most users. A typical use case is that performance engineers use the options to find the best instruction and its count. Then compiler engineers set this as default. In rare cases the default can be changed with options but this would be either to disable the functionality or to change instruction count. Based on these cases being product might be too strong for the options. Maybe they should be experimental. >That said I assume the benchmarking already gave some insight into the number of each instruction that seems to be best for that microarchitecture, so is a default of 1 really right, or should it default to whatever the microbenchmarks determined was best? The default of `OnSpinWaitInstCount` is `1` because this is what was found for `isb` and what is supposed for `yield`. - For `OnSpinWaitInst=none`, `OnSpinWaitInstCount` must be `0`. - For `OnSpinWaitInst=isb`, `OnSpinWaitInstCount` makes sense to be `1..3`, based on benchmarking results. We can set `1..10` to allow experiments. - For `OnSpinWaitInst=yield`, it is suppose `OnSpinWaitInstCount` to be `1`. As no implementation of the instruction exists we don't know the range. - For `OnSpinWaitInst=nop`, `OnSpinWaitInstCount` makes sense to be `1..99`. The current specification is a merge all of ranges into one: `1..99`. Maybe the better specification is: ``` product(uint, OnSpinWaitInstCount, 0, "The number of OnSpinWaitInst instructions to generate." "It cannot be used with OnSpinWaitInst=none." "Options:" "For OnSpinWaitInst=nop, 1..99" "For OnSpinWaitInst=isb, 1..10" "For OnSpinWaitInst=yield, 1" ) ```
08-10-2021

[~darcy] I've only been taking a casual glance at the traffic on this as it is so Aarch64 specific. I find it unfortunate that the relevant microarchitecture versions can't be readily detected such that this would by default use the best form for that microarchitecture, but even then the ability to override that could be important. It is however an extremely low-level detail that will be way beyond the ability of most users to effectively measure and optimise. I do have some grammatical nits with the proposed wording and suggest the following: product(ccstr, OnSpinWaitInst, "none", "The instruction to use to implement java.lang.Thread.onSpinWait()." "Options: none, nop, isb, yield.") product(uint, OnSpinWaitInstCount, 1, "The number of OnSpinWaitInst instructions to generate.") range(1, 99) That said I assume the benchmarking already gave some insight into the number of each instruction that seems to be best for that microarchitecture, so is a default of 1 really right, or should it default to whatever the microbenchmarks determined was best?
08-10-2021

Moving to Provisional, not Approved for JDK 18. [~eastigeevich], if you want CSR review of release trains done now, please add additional fixVersion (17-pool, etc.) before the request is re-Finalized. Is a dedicated option needed for this functionality? Shouldn't an intrinsic "do the right thing" for a given AArch64 micro-architecture? [~dholmes], any comments on this?
08-10-2021