Recently I've been trying to intrinsify Math.max(long, long) and Math.min(long, long) in JDK-8307513. After sending a PR [1] for it, Jasmine recommended added a microbenchmark for vectorized min/max long. I created the attached benchmark, and when I run it on a non-AVX-512 machine, I discovered a regression compared to the unpatched JDK: PR: ``` Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxReductionBench.longMax 50 100 thrpt 8 9131.815 ± 4.942 ops/ms MinMaxReductionBench.longMax 50 1000 thrpt 8 923.389 ± 0.287 ops/ms MinMaxReductionBench.longMax 50 10000 thrpt 8 92.739 ± 0.201 ops/ms MinMaxReductionBench.longMax 80 100 thrpt 8 9132.208 ± 6.734 ops/ms MinMaxReductionBench.longMax 80 1000 thrpt 8 926.161 ± 2.555 ops/ms MinMaxReductionBench.longMax 80 10000 thrpt 8 92.824 ± 0.104 ops/ms MinMaxReductionBench.longMax 100 100 thrpt 8 9157.142 ± 18.448 ops/ms MinMaxReductionBench.longMax 100 1000 thrpt 8 925.025 ± 0.560 ops/ms MinMaxReductionBench.longMax 100 10000 thrpt 8 92.772 ± 0.164 ops/ms MinMaxReductionBench.longMin 50 100 thrpt 8 8988.795 ± 89.391 ops/ms MinMaxReductionBench.longMin 50 1000 thrpt 8 915.769 ± 5.681 ops/ms MinMaxReductionBench.longMin 50 10000 thrpt 8 92.064 ± 0.109 ops/ms MinMaxReductionBench.longMin 80 100 thrpt 8 8962.932 ± 53.301 ops/ms MinMaxReductionBench.longMin 80 1000 thrpt 8 917.635 ± 1.200 ops/ms MinMaxReductionBench.longMin 80 10000 thrpt 8 92.070 ± 0.104 ops/ms MinMaxReductionBench.longMin 100 100 thrpt 8 8976.809 ± 68.086 ops/ms MinMaxReductionBench.longMin 100 1000 thrpt 8 918.527 ± 0.303 ops/ms MinMaxReductionBench.longMin 100 10000 thrpt 8 92.052 ± 0.069 ops/ms ``` Unpatched: ``` Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxReductionBench.longMax 50 100 thrpt 8 9137.237 ± 7.536 ops/ms MinMaxReductionBench.longMax 50 1000 thrpt 8 923.167 ± 1.151 ops/ms MinMaxReductionBench.longMax 50 10000 thrpt 8 92.779 ± 0.129 ops/ms MinMaxReductionBench.longMax 80 100 thrpt 8 10964.305 ± 290.729 ops/ms MinMaxReductionBench.longMax 80 1000 thrpt 8 1156.589 ± 34.254 ops/ms MinMaxReductionBench.longMax 80 10000 thrpt 8 117.060 ± 0.413 ops/ms MinMaxReductionBench.longMax 100 100 thrpt 8 10690.906 ± 605.322 ops/ms MinMaxReductionBench.longMax 100 1000 thrpt 8 1185.421 ± 26.004 ops/ms MinMaxReductionBench.longMax 100 10000 thrpt 8 120.284 ± 0.099 ops/ms MinMaxReductionBench.longMin 50 100 thrpt 8 10532.930 ± 616.592 ops/ms MinMaxReductionBench.longMin 50 1000 thrpt 8 1196.586 ± 37.352 ops/ms MinMaxReductionBench.longMin 50 10000 thrpt 8 121.613 ± 0.079 ops/ms MinMaxReductionBench.longMin 80 100 thrpt 8 10576.153 ± 703.215 ops/ms MinMaxReductionBench.longMin 80 1000 thrpt 8 1175.234 ± 5.175 ops/ms MinMaxReductionBench.longMin 80 10000 thrpt 8 119.226 ± 5.004 ops/ms MinMaxReductionBench.longMin 100 100 thrpt 8 10233.026 ± 18.896 ops/ms MinMaxReductionBench.longMin 100 1000 thrpt 8 1206.025 ± 1.366 ops/ms MinMaxReductionBench.longMin 100 10000 thrpt 8 121.684 ± 0.066 ops/ms ``` The regression is observed at 80% and 100% probability of one of min/max branches taken more than the other. This is happening because of the assembly being produced. With master and either of these 2 branch probabilities, HotSpot transforms the min/max operations to jmp+mov set of instructions, mimicking the bytecode implementation for those methods. ``` ↗ 0x00007f57800331a9: movq %r13, %r8 0.02% │ 0x00007f57800331ac: nopl (%rax) ;*lreturn {reexecute=0 rethrow=0 return_oop=0} │ ; - java.lang.Math::max@11 (line 2035) │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216) │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) ↗│ 0x00007f57800331b0: incl %esi ;*iinc {reexecute=0 rethrow=0 return_oop=0} ││ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@62 (line 214) ││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) ││ 0x00007f57800331b2: cmpl %ecx, %esi 0.03% ╭││ 0x00007f57800331b4: jge 0x7f57800331e7 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} │││ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@9 (line 214) │││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) │││ 0x00007f57800331b6: movq %r8, %r13 ;*aload_0 {reexecute=0 rethrow=0 return_oop=0} │││ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@12 (line 215) │││ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0.02% │││ ↗ 0x00007f57800331b9: movq 0x10(%rdx, %rsi, 8), %r11;*laload {reexecute=0 rethrow=0 return_oop=0} │││ │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@36 (line 215) │││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0.08% │││ │ 0x00007f57800331be: movq 0x10(%rax, %rsi, 8), %r8;*laload {reexecute=0 rethrow=0 return_oop=0} │││ │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@17 (line 215) │││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 4.36% │││ │ 0x00007f57800331c3: movq 0x10(%r10, %rsi, 8), %r9;*laload {reexecute=0 rethrow=0 return_oop=0} │││ │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@23 (line 215) │││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0.02% │││ │ 0x00007f57800331c8: movq %r8, %rdi 0.08% │││ │ 0x00007f57800331cb: imulq %r11, %rdi 26.37% │││ │ 0x00007f57800331cf: movq %r9, %rbp 0.05% │││ │ 0x00007f57800331d2: imulq %r11, %rbp 0.03% │││ │ 0x00007f57800331d6: imulq %r9, %r8 │││ │ 0x00007f57800331da: addq %rdi, %r8 0.59% │││ │ 0x00007f57800331dd: addq %rbp, %r8 ;*ladd {reexecute=0 rethrow=0 return_oop=0} │││ │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215) │││ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 34.25% │││ │ 0x00007f57800331e0: cmpq %r8, %r13 32.66% │╰│ │ 0x00007f57800331e3: jl 0x7f57800331b0 ;*iflt {reexecute=0 rethrow=0 return_oop=0} │ │ │ ; - java.lang.Math::max@3 (line 2035) │ │ │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216) │ │ │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) │ ╰ │ 0x00007f57800331e5: jmp 0x7f57800331a9 ↘ │ 0x00007f57800331e7: movq 0x450(%r15), %r11 ; ImmutableOopMap {r10=Oop rbx=Oop rdx=Oop rax=Oop xmm0=Oop xmm1=Oop xmm2=Oop } ``` With the code in the PR [1], the intrinsic kicks in and therefore the bytecode implementation is not used. Without a backend implementation for min/max long `PhaseMacroExpand::expand_macro_nodes()` transforms the Max/Min nodes into branchless cmov calls: ``` 5.83% │ 0x00007fcccc032b72: cmpq %r14, %rcx 0.46% │ 0x00007fcccc032b75: cmovlq %r14, %rcx 2.72% │ 0x00007fcccc032b79: addq %r11, %r8 │ 0x00007fcccc032b7c: imulq 0x28(%rbp, %rbx, 8), %r8 │ 0x00007fcccc032b82: cmpq %rax, %rcx 4.64% │ 0x00007fcccc032b85: cmovlq %rax, %rcx ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216) │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 12.27% │ 0x00007fcccc032b89: addq %r13, %r8 ;*ladd {reexecute=0 rethrow=0 return_oop=0} │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215) │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) │ 0x00007fcccc032b8c: cmpq %r10, %rcx 8.38% │ 0x00007fcccc032b8f: cmovlq %r10, %rcx 12.47% │ 0x00007fcccc032b93: cmpq %r8, %rcx 13.08% │ 0x00007fcccc032b96: cmovlq %r8, %rcx ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} │ ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216) │ ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) ``` `cmov` instructions in loops like the one in the benchmark create long data dependencies. Branched version is able to predict and get ahead of the code and provide better performance in this scenario. What is it happening at 50%? At that % of branch distribution unpatched HotSpot uses cmov instead of the branch, therefore it performs just as the patched code. This is happening due to `PhaseIdealLoop::conditional_move() `: ``` # Benchmark: org.openjdk.bench.java.lang.MinMaxReductionBench.longMax # Parameters: (probability = 50, size = 10000) ... c2, level 4, org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub, version 5, compile id 1072 6.89% 0x00007fcd5c032b41: imulq 0x20(%rbp, %rbx, 8), %r10 0x00007fcd5c032b47: addq %r14, %rax 0x00007fcd5c032b4a: leaq (%r9, %rsi), %r14 0x00007fcd5c032b4e: imulq 0x10(%rbp, %rbx, 8), %r14 0x00007fcd5c032b54: imulq %rsi, %r9 0x00007fcd5c032b58: imulq %rdi, %rdx 0x00007fcd5c032b5c: addq %r9, %r14 0x00007fcd5c032b5f: addq %rdx, %r10 ;*ladd {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 6.37% 0x00007fcd5c032b62: cmpq %r14, %rcx 0.20% 0x00007fcd5c032b65: cmovlq %r14, %rcx 3.08% 0x00007fcd5c032b69: addq %r11, %r8 0x00007fcd5c032b6c: imulq 0x28(%rbp, %rbx, 8), %r8 0x00007fcd5c032b72: cmpq %rax, %rcx 4.94% 0x00007fcd5c032b75: cmovlq %rax, %rcx ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 10.84% 0x00007fcd5c032b79: addq %r13, %r8 ;*ladd {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@52 (line 215) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0x00007fcd5c032b7c: cmpq %r10, %rcx 8.16% 0x00007fcd5c032b7f: cmovlq %r10, %rcx 13.59% 0x00007fcd5c032b83: cmpq %r8, %rcx 12.58% 0x00007fcd5c032b86: cmovlq %r8, %rcx ;*invokestatic max {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@58 (line 216) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 18.94% 0x00007fcd5c032b8a: addl $4, %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@62 (line 214) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0x00007fcd5c032b8d: cmpl 0x48(%rsp), %ebx 0x00007fcd5c032b91: jl 0x7fcd5c032ae0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@9 (line 214) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0x00007fcd5c032b97: movq 0x450(%r15), %r10 ; ImmutableOopMap {rbp=Oop xmm0=Oop xmm1=Oop xmm2=Oop xmm3=Oop xmm5=Oop xmm6=Oop } ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@65 (line 214) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) 0x00007fcd5c032b9e: testl %eax, (%r10) ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - org.openjdk.bench.java.lang.MinMaxReductionBench::longMax@65 (line 214) ; - org.openjdk.bench.java.lang.jmh_generated.MinMaxReductionBench_longMax_jmhTest::longMax_thrpt_jmhStub@17 (line 121) ; {poll} 0.06% 0x00007fcd5c032ba1: cmpl 0x44(%rsp), %ebx 0x00007fcd5c032ba5: jge 0x7fcd5c032bee 0x00007fcd5c032ba7: vmovq %xmm5, %rdi 0x00007fcd5c032bac: movl 0x44(%rsp), %r9d 0x00007fcd5c032bb1: vmovq %xmm6, %rdx 0x00007fcd5c032bb6: movl %r9d, %r11d 0x00007fcd5c032bb9: subl %ebx, %r11d 0x00007fcd5c032bbc: xorl %r8d, %r8d ``` One way to improve this would be to change `cmovL_*` (and maybe `cmovI_*`?) to do a branch+mov instead of calling cmov. In fact, this is already done for floats and doubles: ``` instruct cmovD_reg(cmpOp cop, rFlagsReg cr, regD dst, regD src) %{ match(Set dst (CMoveD (Binary cop cr) (Binary dst src))); ins_cost(200); // XXX format %{ "jn$cop skip\t# signed cmove double\n\t" "movsd $dst, $src\n" "skip:" %} ins_encode %{ Label Lskip; // Invert sense of branch from sense of CMOV __ jccb((Assembler::Condition)($cop$$cmpcode^1), Lskip); __ movdbl($dst$$XMMRegister, $src$$XMMRegister); __ bind(Lskip); %} ins_pipe(pipe_slow); %} ``` Unfortunately there is not enough source history to learn why this was done. Maybe Vladimir Kozlov knows? He was the last one to refactor that. I quickly modified `cmovL_reg` to do this: ``` instruct cmovL_reg(cmpOp cop, rFlagsReg cr, rRegL dst, rRegL src) %{ match(Set dst (CMoveL (Binary cop cr) (Binary dst src))); ins_cost(200); // XXX format %{ "jn$cop skip\t# signed cmove long\n\t" "movq $dst, $src\n" "skip:" %} ins_encode %{ Label Lskip; // Invert sense of branch from sense of CMOV __ jccb((Assembler::Condition)($cop$$cmpcode^1), Lskip); __ movq($dst$$Register, $src$$Register); __ bind(Lskip); %} ins_pipe(pipe_slow); %} ``` And got the following benchmark results on my non-AVX512 system when combined with the max/min long intrinsic: ``` Benchmark (probability) (size) Mode Cnt Score Error Units MinMaxReductionBench.longMax 50 100 thrpt 8 13824.545 ± 283.984 ops/ms MinMaxReductionBench.longMax 50 1000 thrpt 8 1327.359 ± 15.786 ops/ms MinMaxReductionBench.longMax 50 10000 thrpt 8 116.902 ± 2.291 ops/ms MinMaxReductionBench.longMax 80 100 thrpt 8 13777.837 ± 142.830 ops/ms MinMaxReductionBench.longMax 80 1000 thrpt 8 1452.889 ± 4.550 ops/ms MinMaxReductionBench.longMax 80 10000 thrpt 8 146.434 ± 0.218 ops/ms MinMaxReductionBench.longMax 100 100 thrpt 8 13782.930 ± 4.824 ops/ms MinMaxReductionBench.longMax 100 1000 thrpt 8 1415.578 ± 18.276 ops/ms MinMaxReductionBench.longMax 100 10000 thrpt 8 142.758 ± 0.645 ops/ms MinMaxReductionBench.longMin 50 100 thrpt 8 12027.222 ± 42.673 ops/ms MinMaxReductionBench.longMin 50 1000 thrpt 8 1350.302 ± 5.152 ops/ms MinMaxReductionBench.longMin 50 10000 thrpt 8 137.061 ± 0.067 ops/ms MinMaxReductionBench.longMin 80 100 thrpt 8 12246.891 ± 100.015 ops/ms MinMaxReductionBench.longMin 80 1000 thrpt 8 1350.864 ± 4.566 ops/ms MinMaxReductionBench.longMin 80 10000 thrpt 8 136.926 ± 0.299 ops/ms MinMaxReductionBench.longMin 100 100 thrpt 8 12257.758 ± 99.920 ops/ms MinMaxReductionBench.longMin 100 1000 thrpt 8 1349.478 ± 1.063 ops/ms MinMaxReductionBench.longMin 100 10000 thrpt 8 136.912 ± 0.063 ops/ms ``` The performance numbers are higher across all branch probabilities and sizes compared with the current master. [1] https://github.com/openjdk/jdk/pull/20098