Bug ID: JDK-8214751 X86: Support for VNNI Instructions

JDK-8214751 : X86: Support for VNNI Instructions

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 12

Priority: P4
Status: Resolved
Resolution: Fixed
CPU: x86

Submitted: 2018-12-04
Updated: 2022-05-16
Resolved: 2018-12-12

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 12	JDK 13
12 b24Fixed	13Fixed

Related Reports

Relates :	JDK-8219151 - Illegal instruction exception on JDK 12 due to incorrect CPU feature bits
Relates :	JDK-8236701 - [TESTBUG] compiler/loopopts/superword/Vec_MulAddS2I.java uses wrong flag -XX:-SuperWord
Relates :	JDK-8215353 - x86_32 build failures after JDK-8214751 (X86: Support for VNNI Instructions)
Relates :	JDK-8215891 - X86: Support for VNNI byte Instruction VPDPBUSD
Relates :	JDK-8216580 - Fix generation of VNNI vector code by allowing adjacent LoadS nodes to be isomorphic
Relates :	JDK-8216050 - Superword optimization fails with assert(0 <= i && i < _len) failed: illegal index
Relates :	JDK-8229694 - JVM crash in SWPointer during C2 OSR compilation
Relates :	JDK-8230185 - assert(is_Loop()) failed: invalid node class
Relates :	JDK-8230078 - compiler/loopopts/superword/Vec_MulAddS2I.java is unexpectedly slow in windows
Relates :	JDK-8239549 - AArch64: Backend support for MulAddVS2VI node

Description

This is VNNI VPDPWSSD instruction support with autovectorization.

It can  vectorize this operation in the loop:
out[i] += ((in1[2*i] * in2[2*i]) + (in1[2*i+1] * in2[2*i+1]));

This patch is useful for AI ML/DL applications such as convolution based Neural Nets.

More information on VNNI can be found here:
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
Code contributed by: razvan.a.lupusoru@intel.com and vdeshpande(vivek.r.deshpande@intel.com)

The initial performance gains with micro on skylake with AVX3 is 10.8x.
 and it generates 
vmovdqu                  xmm3, xmmword ptr [rbp+r8*2+0x10]  
vmovdqu                  xmm6, xmmword ptr [rdx+r8*2+0x10]  
vpmaddwd                 xmm3, xmm6, xmm3  
vpaddd                   xmm3, xmm3, xmmword ptr [r9+rdi*4+0x10]  
vmovdqu                  xmmword ptr [r9+rdi*4+0x10], xmm3  

It can generate vpdpwssd instruction on cascadelake.

The webrev is here:
http://cr.openjdk.java.net/~vdeshpande/8214751/VNNI/webrev.00/

Comments

Git URL: https://github.com/openjdk/jdk/commit/05e175bf1beeaecc24c120846595347cf08dd2c0
16-05-2022
URL: http://hg.openjdk.java.net/jdk/jdk/rev/4bb6e0871bf7 User: kvn Date: 2018-12-12 22:47:53 +0000
12-12-2018
Testing webrev.03 passed with known unrelated failures.
12-12-2018
Thanks Vladimir. I tested the change on my development machine.
12-12-2018
I added is_valid_counted_loop() in convert_add_to_muladd() and refactored code a little: http://cr.openjdk.java.net/~kvn/8214751/webrev.03/
12-12-2018
The updated webrev is here: http://cr.openjdk.java.net/~vdeshpande/8214751/VNNI/webrev.02/
10-12-2018