JDK-8352082 : Avoid scalar cmov in extreme int min/max reduction loop scenarios
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 25
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2025-03-14
  • Updated: 2025-03-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
One of the scenarios mentioned in JDK-8351409 affects int min/max loops:

Given a loop with a int min/max reduction pattern with one side of branch taken near 100% of time, when Supeword finds the pattern not profitable, then HotSpot will use scalar instructions (cmov) and performance will regress.

Example java code:
```
import java.util.Random;

public class TestIntMaxEmanuel {
  private static Random RANDOM = new Random();

  public static void main(String[] args) {
    int[] a = new int[64 * 1024];
    for (int i = 0; i < a.length; i++) {
      a[i] = RANDOM.nextInt();
    }


    {
      System.out.println("Warmup");
      for (int i = 0; i < 10_000; i++) { test1(a); }
      System.out.println("Run");
      long t0 = System.nanoTime();
      for (int i = 0; i < 10_000; i++) { test1(a); }
      long t1 = System.nanoTime();
      System.out.println("Time: " + (t1 - t0));
    }

    {
      System.out.println("Warmup");
      for (int i = 0; i < 10_000; i++) { test2(a); }
      System.out.println("Run");
      long t0 = System.nanoTime();
      for (int i = 0; i < 10_000; i++) { test2(a); }
      long t1 = System.nanoTime();
      System.out.println("Time: " + (t1 - t0));
    }
  }

  public static int test1(int[] a) {
    int x = Integer.MIN_VALUE;
    for (int i = 0; i < a.length; i++) {
      x = Math.max(x, a[i]);
    }
    return x;
  }

  public static int test2(int[] a) {
    int x = Integer.MIN_VALUE;
    for (int i = 0; i < a.length; i++) {
      x = (x >= a[i]) ? x : a[i];
    }
    return x;
  }
}
```

Running:
```
$ java -XX:CompileCommand=compileonly,TestIntMaxEmanuel::test* -XX:CompileCommand=printcompilation,TestIntMaxEmanuel::test* TestIntMaxEmanuel.java
CompileCommand: compileonly TestIntMaxEmanuel.test* bool compileonly = true
CompileCommand: PrintCompilation TestIntMaxEmanuel.test* bool PrintCompilation = true
Warmup
3774   93 %     3       TestIntMaxEmanuel::test1 @ 5 (27 bytes)
3775   94       3       TestIntMaxEmanuel::test1 (27 bytes)
3776   95 %     4       TestIntMaxEmanuel::test1 @ 5 (27 bytes)
3782   96       4       TestIntMaxEmanuel::test1 (27 bytes)
Run
Time: 699 568 502
Warmup
5193  101 %     3       TestIntMaxEmanuel::test2 @ 5 (34 bytes)
5193  102       3       TestIntMaxEmanuel::test2 (34 bytes)
5194  103 %     4       TestIntMaxEmanuel::test2 @ 5 (34 bytes)
5198  104       4       TestIntMaxEmanuel::test2 (34 bytes)
Run
Time: 243 859 789
```

Tracing the auto vectorizer we observe that `test1` was not vectorized because it's not profitable:

```
WARNING: Removed pack: not profitable:
    0:  577  MaxI  === _ 602 578  [[ 574 ]]  !orig=519,448,199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    1:  574  MaxI  === _ 577 575  [[ 567 ]]  !orig=516,199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    2:  567  MaxI  === _ 574 568  [[ 564 ]]  !orig=448,199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    3:  564  MaxI  === _ 567 565  [[ 519 ]]  !orig=199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    4:  519  MaxI  === _ 564 520  [[ 516 ]]  !orig=448,199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    5:  516  MaxI  === _ 519 517  [[ 448 ]]  !orig=199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    6:  448  MaxI  === _ 516 449  [[ 199 ]]  !orig=199,175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)
    7:  199  MaxI  === _ 448 200  [[ 602 357 252 ]]  !orig=175 !jvms: TestIntMaxEmanuel::test1 @ bci:15 (line 37)

After Superword::filter_packs_for_profitable

PackSet::print: 0 packs

SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize
```

`test2` is not vectorized because it contains control flow.

The assembly for `test1` looks like this:

```
  0x00007f3c7cad9e37:   cmpl		(%rsp), %eax
  0x00007f3c7cad9e3a:   movl		(%rsp), %r11d
  0x00007f3c7cad9e3e:   cmovll		%r11d, %eax         ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMaxEmanuel::test1@15 (line 37)
```

And for `test2`:

```
  0x00007f3c7cadb300:   cmpl		%r11d, %eax
  0x00007f3c7cadb303:   jl		0x7f3c7cadb367      ;*istore_1 {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TestIntMaxEmanuel::test2@25 (line 45)
```

This issue can also be observed with the `MinMaxVector` benchmark added in JDK-8307513. The test scenarios above are represented by `intReductionSimpleMax`. To emulate `test2`, simply disable the `_max` intrinsic (-max below). The results are on an AVX-512 machine are:

```
Benchmark                             (probability)  (range)  (seed)  (size)   Mode  Cnt       -max       +max   Units
MinMaxVector.intReductionSimpleMax              100      N/A     N/A    2048  thrpt    4    998.211    460.404  ops/ms (-53%)
```

Again we observe that `test2` equivalent is faster.

Please also note that testing int min has to be done with care, as explained in this [jmh-dev](https://mail.openjdk.org/pipermail/jmh-dev/2025-February/004094.html) mailing thread. One way to mitigate might be to use compiler threshold scaling, to make Math.min(II) be compiled later. This needs to be verified.

One important difference between int and long min/max is that the support for int vectorized min/max instructions are more widely available. So this regression should happen less often compared to long min/max.
Comments
ILW = Same as JDK-8351409 = P4
17-03-2025