A DESCRIPTION OF THE REQUEST :
An API to access SIMD instructions.
We need an API to access SIMD instructions, take advantage of hardware acceleration of vector-math. On the latest CPUs vector-math is factor 4 faster, which makes Java look inferiour. This can relatively easily be fixed, when we gain access to these CPU instructions indirectly.
java.lang.math.SIMD.add4(
float[] op1, int off1,
float[] op2, int off2,
float[] dst, int offDst
)
java.lang.math.SIMD.add4(
FloatBuffer op1, int off1,
FloatBuffer op2, int off2,
FloatBuffer dst, int offDst
)
default (bytecode) implementation of this method would be:
dst[offDst+0] = op1[off1+0] + op2[off2+0];
dst[offDst+1] = op1[off1+1] + op2[off2+1];
dst[offDst+2] = op1[off1+2] + op2[off2+2];
dst[offDst+3] = op1[off1+3] + op2[off2+3];
These methods are turned into instrincs at runtime (like sun.misc.Unsafe), using the vector-instructions of the current platform.
JUSTIFICATION :
With SIMD instructions one can do (theoreticly) 4 operations at a time. While most modern CPUs perform the SIMD instruction in 2+ cycles internally, this yields great performance improvements. In the latest (and upcoming) x86 CPUs, these operations are performed in 1 cycle internally.
The performance of the HotSpot JIT is ever increasing, but the gap between VM and native executable using SIMD, is widening. In vector-based code, or other mathematical SIMD-friendly algorithms, the performance can be multiplied by 200% - 400%, depending on the CPU's SIMD implementation.
__Making the JIT perform this optimisation behind the scenes is not sufficient__
Programmers can invent smart(er) ways of dealing with data to make it SIMD-friendly / SIMD-optimal, while the JIT might overlook cases, or considers it too complex.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
float[] translate = new float[4];
float[] scale = new float[4];
float[] src = new float[vectors * 4];
float[] dst = new float[vectors * 4];
int end = vectors * 4;
for(int i=0; i<end; i+=4)
{
SIMD.mul4(src, i, scale, 0, dst, i);
SIMD.add4(src, i, translate, 0, dst, i);
}