Bug ID: JDK-8340206 Switch cmov[I, L] x86_64.ad instructs to branch+mov

Recently I've been trying to intrinsify Math.max(long, long) and Math.min(long, long) in JDK-8307513.

After sending a PR [1] for it, Jasmine recommended added a microbenchmark for vectorized min/max long.

I created the attached benchmark, and when I run it on a non-AVX-512 machine, I discovered a regression compared to the unpatched JDK:

PR:
```
Benchmark                           (probability)  (size)   Mode  Cnt      Score       Error   Units
MinMaxReductionBench.longMax              50     100  thrpt    8   9131.815 ±     4.942  ops/ms
MinMaxReductionBench.longMax              50    1000  thrpt    8    923.389 ±     0.287  ops/ms
MinMaxReductionBench.longMax              50   10000  thrpt    8     92.739 ±     0.201  ops/ms
MinMaxReductionBench.longMax              80     100  thrpt    8   9132.208 ±     6.734  ops/ms
MinMaxReductionBench.longMax              80    1000  thrpt    8    926.161 ±     2.555  ops/ms
MinMaxReductionBench.longMax              80   10000  thrpt    8     92.824 ±     0.104  ops/ms
MinMaxReductionBench.longMax             100     100  thrpt    8   9157.142 ±    18.448  ops/ms
MinMaxReductionBench.longMax             100    1000  thrpt    8    925.025 ±     0.560  ops/ms
MinMaxReductionBench.longMax             100   10000  thrpt    8     92.772 ±     0.164  ops/ms
MinMaxReductionBench.longMin              50     100  thrpt    8   8988.795 ±    89.391  ops/ms
MinMaxReductionBench.longMin              50    1000  thrpt    8    915.769 ±     5.681  ops/ms
MinMaxReductionBench.longMin              50   10000  thrpt    8     92.064 ±     0.109  ops/ms
MinMaxReductionBench.longMin              80     100  thrpt    8   8962.932 ±    53.301  ops/ms
MinMaxReductionBench.longMin              80    1000  thrpt    8    917.635 ±     1.200  ops/ms
MinMaxReductionBench.longMin              80   10000  thrpt    8     92.070 ±     0.104  ops/ms
MinMaxReductionBench.longMin             100     100  thrpt    8   8976.809 ±    68.086  ops/ms
MinMaxReductionBench.longMin             100    1000  thrpt    8    918.527 ±     0.303  ops/ms
MinMaxReductionBench.longMin             100   10000  thrpt    8     92.052 ±     0.069  ops/ms
```

Unpatched:
```
Benchmark                           (probability)  (size)   Mode  Cnt      Score       Error   Units
MinMaxReductionBench.longMax              50     100  thrpt    8    9137.237 ±     7.536  ops/ms
MinMaxReductionBench.longMax              50    1000  thrpt    8     923.167 ±     1.151  ops/ms
MinMaxReductionBench.longMax              50   10000  thrpt    8      92.779 ±     0.129  ops/ms
MinMaxReductionBench.longMax              80     100  thrpt    8   10964.305 ±   290.729  ops/ms
MinMaxReductionBench.longMax              80    1000  thrpt    8    1156.589 ±    34.254  ops/ms
MinMaxReductionBench.longMax              80   10000  thrpt    8     117.060 ±     0.413  ops/ms
MinMaxReductionBench.longMax             100     100  thrpt    8   10690.906 ±   605.322  ops/ms
MinMaxReductionBench.longMax             100    1000  thrpt    8    1185.421 ±    26.004  ops/ms
MinMaxReductionBench.longMax             100   10000  thrpt    8     120.284 ±     0.099  ops/ms
MinMaxReductionBench.longMin              50     100  thrpt    8   10532.930 ±   616.592  ops/ms
MinMaxReductionBench.longMin              50    1000  thrpt    8    1196.586 ±    37.352  ops/ms
MinMaxReductionBench.longMin              50   10000  thrpt    8     121.613 ±     0.079  ops/ms
MinMaxReductionBench.longMin              80     100  thrpt    8   10576.153 ±   703.215  ops/ms
MinMaxReductionBench.longMin              80    1000  thrpt    8    1175.234 ±     5.175  ops/ms
MinMaxReductionBench.longMin              80   10000  thrpt    8     119.226 ±     5.004  ops/ms
MinMaxReductionBench.longMin             100     100  thrpt    8   10233.026 ±    18.896  ops/ms
MinMaxReductionBench.longMin             100    1000  thrpt    8    1206.025 ±     1.366  ops/ms
MinMaxReductionBench.longMin             100   10000  thrpt    8     121.684 ±     0.066  ops/ms
```

The regression is observed at 80% and 100% probability of one of min/max branches taken more than the other. This is happening because of the assembly being produced.

With master and either of these 2 branch probabilities, HotSpot transforms the min/max operations to jmp+mov set of instructions, mimicking the bytecode implementation for those methods.

```
            ↗    0x00007f57800331a9:   movq		%r13, %r8
   0.02%    │    0x00007f57800331ac:   nopl		(%rax)              ;*lreturn {reexecute=0 rethrow=0 return_oop=0}
            │                                                              ; - java.lang.Math::max@11 (line 2035)
            │                                                              ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216)
            │                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
           ↗│    0x00007f57800331b0:   incl		%esi                ;*iinc {reexecute=0 rethrow=0 return_oop=0}
           ││                                                              ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@62 (line 214)
           ││                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
           ││    0x00007f57800331b2:   cmpl		%ecx, %esi
   0.03%  ╭││    0x00007f57800331b4:   jge		0x7f57800331e7      ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
          │││                                                              ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@9 (line 214)
          │││                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
          │││    0x00007f57800331b6:   movq		%r8, %r13           ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
          │││                                                              ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@12 (line 215)
          │││                                                              ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
   0.02%  │││ ↗  0x00007f57800331b9:   movq		0x10(%rdx, %rsi, 8), %r11;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││ │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@36 (line 215)
          │││ │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
   0.08%  │││ │  0x00007f57800331be:   movq		0x10(%rax, %rsi, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││ │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@17 (line 215)
          │││ │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
   4.36%  │││ │  0x00007f57800331c3:   movq		0x10(%r10, %rsi, 8), %r9;*laload {reexecute=0 rethrow=0 return_oop=0}
          │││ │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@23 (line 215)
          │││ │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
   0.02%  │││ │  0x00007f57800331c8:   movq		%r8, %rdi
   0.08%  │││ │  0x00007f57800331cb:   imulq		%r11, %rdi
  26.37%  │││ │  0x00007f57800331cf:   movq		%r9, %rbp
   0.05%  │││ │  0x00007f57800331d2:   imulq		%r11, %rbp
   0.03%  │││ │  0x00007f57800331d6:   imulq		%r9, %r8
          │││ │  0x00007f57800331da:   addq		%rdi, %r8
   0.59%  │││ │  0x00007f57800331dd:   addq		%rbp, %r8           ;*ladd {reexecute=0 rethrow=0 return_oop=0}
          │││ │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215)
          │││ │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
  34.25%  │││ │  0x00007f57800331e0:   cmpq		%r8, %r13
  32.66%  │╰│ │  0x00007f57800331e3:   jl		0x7f57800331b0      ;*iflt {reexecute=0 rethrow=0 return_oop=0}
          │ │ │                                                            ; - java.lang.Math::max@3 (line 2035)
          │ │ │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216)
          │ │ │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
          │ ╰ │  0x00007f57800331e5:   jmp		0x7f57800331a9
          ↘   │  0x00007f57800331e7:   movq		0x450(%r15), %r11   ; ImmutableOopMap {r10=Oop rbx=Oop rdx=Oop rax=Oop xmm0=Oop xmm1=Oop xmm2=Oop }
```

With the code in the PR [1], the intrinsic kicks in and therefore the bytecode implementation is not used. Without a backend implementation for min/max long `PhaseMacroExpand::expand_macro_nodes()` transforms the Max/Min nodes into branchless cmov calls:

```
   5.83%  │  0x00007fcccc032b72:   cmpq		%r14, %rcx
   0.46%  │  0x00007fcccc032b75:   cmovlq		%r14, %rcx
   2.72%  │  0x00007fcccc032b79:   addq		%r11, %r8
          │  0x00007fcccc032b7c:   imulq		0x28(%rbp, %rbx, 8), %r8
          │  0x00007fcccc032b82:   cmpq		%rax, %rcx
   4.64%  │  0x00007fcccc032b85:   cmovlq		%rax, %rcx          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216)
          │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
  12.27%  │  0x00007fcccc032b89:   addq		%r13, %r8           ;*ladd {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215)
          │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
          │  0x00007fcccc032b8c:   cmpq		%r10, %rcx
   8.38%  │  0x00007fcccc032b8f:   cmovlq		%r10, %rcx
  12.47%  │  0x00007fcccc032b93:   cmpq		%r8, %rcx
  13.08%  │  0x00007fcccc032b96:   cmovlq		%r8, %rcx           ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216)
          │                                                            ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
```

`cmov` instructions in loops like the one in the benchmark create long data dependencies. Branched version is able to predict and get ahead of the code and provide better performance in this scenario.

What is it happening at 50%? At that % of branch distribution unpatched HotSpot uses cmov instead of the branch, therefore it performs just as the patched code. This is happening due to `PhaseIdealLoop::conditional_move() `:

```
# Benchmark: org.openjdk.bench.java.lang.MinMaxReductionBench.longMax
# Parameters: (probability = 50, size = 10000)
...
c2, level 4, org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub, version 5, compile id 1072

   6.89%    0x00007fcd5c032b41:   imulq		0x20(%rbp, %rbx, 8), %r10
            0x00007fcd5c032b47:   addq		%r14, %rax
            0x00007fcd5c032b4a:   leaq		(%r9, %rsi), %r14
            0x00007fcd5c032b4e:   imulq		0x10(%rbp, %rbx, 8), %r14
            0x00007fcd5c032b54:   imulq		%rsi, %r9
            0x00007fcd5c032b58:   imulq		%rdi, %rdx
            0x00007fcd5c032b5c:   addq		%r9, %r14
            0x00007fcd5c032b5f:   addq		%rdx, %r10          ;*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
   6.37%    0x00007fcd5c032b62:   cmpq		%r14, %rcx
   0.20%    0x00007fcd5c032b65:   cmovlq		%r14, %rcx
   3.08%    0x00007fcd5c032b69:   addq		%r11, %r8
            0x00007fcd5c032b6c:   imulq		0x28(%rbp, %rbx, 8), %r8
            0x00007fcd5c032b72:   cmpq		%rax, %rcx
   4.94%    0x00007fcd5c032b75:   cmovlq		%rax, %rcx          ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
  10.84%    0x00007fcd5c032b79:   addq		%r13, %r8           ;*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
            0x00007fcd5c032b7c:   cmpq		%r10, %rcx
   8.16%    0x00007fcd5c032b7f:   cmovlq		%r10, %rcx
  13.59%    0x00007fcd5c032b83:   cmpq		%r8, %rcx
  12.58%    0x00007fcd5c032b86:   cmovlq		%r8, %rcx           ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
  18.94%    0x00007fcd5c032b8a:   addl		$4, %ebx            ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@62 (line 214)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
            0x00007fcd5c032b8d:   cmpl		0x48(%rsp), %ebx
            0x00007fcd5c032b91:   jl		0x7fcd5c032ae0      ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@9 (line 214)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
            0x00007fcd5c032b97:   movq		0x450(%r15), %r10   ; ImmutableOopMap {rbp=Oop xmm0=Oop xmm1=Oop xmm2=Oop xmm3=Oop xmm5=Oop xmm6=Oop }
                                                                      ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                                      ; - (reexecute) org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@65 (line 214)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
            0x00007fcd5c032b9e:   testl		%eax, (%r10)        ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                      ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@65 (line 214)
                                                                      ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121)
                                                                      ;   {poll}
   0.06%    0x00007fcd5c032ba1:   cmpl		0x44(%rsp), %ebx
            0x00007fcd5c032ba5:   jge		0x7fcd5c032bee
            0x00007fcd5c032ba7:   vmovq		%xmm5, %rdi
            0x00007fcd5c032bac:   movl		0x44(%rsp), %r9d
            0x00007fcd5c032bb1:   vmovq		%xmm6, %rdx
            0x00007fcd5c032bb6:   movl		%r9d, %r11d
            0x00007fcd5c032bb9:   subl		%ebx, %r11d
            0x00007fcd5c032bbc:   xorl		%r8d, %r8d
```

One way to improve this would be to change `cmovL_*` (and maybe `cmovI_*`?) to do a branch+mov instead of calling cmov.
In fact, this is already done for floats and doubles:

```
instruct cmovD_reg(cmpOp cop, rFlagsReg cr, regD dst, regD src)
%{
  match(Set dst (CMoveD (Binary cop cr) (Binary dst src)));

  ins_cost(200); // XXX
  format %{ "jn$cop    skip\t# signed cmove double\n\t"
            "movsd     $dst, $src\n"
    "skip:" %}
  ins_encode %{
    Label Lskip;
    // Invert sense of branch from sense of CMOV
    __ jccb((Assembler::Condition)($cop$$cmpcode^1), Lskip);
    __ movdbl($dst$$XMMRegister, $src$$XMMRegister);
    __ bind(Lskip);
  %}
  ins_pipe(pipe_slow);
%}
```

Unfortunately there is not enough source history to learn why this was done. Maybe Vladimir Kozlov knows? He was the last one to refactor that.

I quickly modified `cmovL_reg` to do this:

```
instruct cmovL_reg(cmpOp cop, rFlagsReg cr, rRegL dst, rRegL src)
%{
  match(Set dst (CMoveL (Binary cop cr) (Binary dst src)));

  ins_cost(200); // XXX
  format %{ "jn$cop     skip\t# signed cmove long\n\t"
            "movq       $dst, $src\n"
    "skip:" %}
  ins_encode %{
    Label Lskip;
    // Invert sense of branch from sense of CMOV
    __ jccb((Assembler::Condition)($cop$$cmpcode^1), Lskip);
    __ movq($dst$$Register, $src$$Register);
    __ bind(Lskip);
  %}
  ins_pipe(pipe_slow);
%}
```

And got the following benchmark results on my non-AVX512 system when combined with the max/min long intrinsic:

```
Benchmark                     (probability)  (size)   Mode  Cnt      Score       Error   Units
MinMaxReductionBench.longMax             50     100  thrpt    8  13824.545 ±   283.984  ops/ms
MinMaxReductionBench.longMax             50    1000  thrpt    8   1327.359 ±    15.786  ops/ms
MinMaxReductionBench.longMax             50   10000  thrpt    8    116.902 ±     2.291  ops/ms
MinMaxReductionBench.longMax             80     100  thrpt    8  13777.837 ±   142.830  ops/ms
MinMaxReductionBench.longMax             80    1000  thrpt    8   1452.889 ±     4.550  ops/ms
MinMaxReductionBench.longMax             80   10000  thrpt    8    146.434 ±     0.218  ops/ms
MinMaxReductionBench.longMax            100     100  thrpt    8  13782.930 ±     4.824  ops/ms
MinMaxReductionBench.longMax            100    1000  thrpt    8   1415.578 ±    18.276  ops/ms
MinMaxReductionBench.longMax            100   10000  thrpt    8    142.758 ±     0.645  ops/ms
MinMaxReductionBench.longMin             50     100  thrpt    8  12027.222 ±    42.673  ops/ms
MinMaxReductionBench.longMin             50    1000  thrpt    8   1350.302 ±     5.152  ops/ms
MinMaxReductionBench.longMin             50   10000  thrpt    8    137.061 ±     0.067  ops/ms
MinMaxReductionBench.longMin             80     100  thrpt    8  12246.891 ±   100.015  ops/ms
MinMaxReductionBench.longMin             80    1000  thrpt    8   1350.864 ±     4.566  ops/ms
MinMaxReductionBench.longMin             80   10000  thrpt    8    136.926 ±     0.299  ops/ms
MinMaxReductionBench.longMin            100     100  thrpt    8  12257.758 ±    99.920  ops/ms
MinMaxReductionBench.longMin            100    1000  thrpt    8   1349.478 ±     1.063  ops/ms
MinMaxReductionBench.longMin            100   10000  thrpt    8    136.912 ±     0.063  ops/ms
```

The performance numbers are higher across all branch probabilities and sizes compared with the current master.

[1] https://github.com/openjdk/jdk/pull/20098