This also adds VNNI VPDPBUSD instruction support with autovectorization.
It can vectorize this operation in the loop: out[i] += ((in1[4*i] * in2[4*i]) + (in1[4*i+1] * in2[4*i+1]) + (in1[4*i+2] * in2[4*i+2]) + (in1[4*i+3] * in2[4*i+3])); where in1[] and in2[] are byte arrays and out[] is an int array.
This patch is useful for AI ML/DL applications such as convolution based Neural Nets.
More information on VNNI can be found here: https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf