A DESCRIPTION OF THE REQUEST :
An API to access SIMD instructions.
We need an API to access SIMD instructions, take advantage of hardware acceleration of vector-math. On the latest CPUs vector-math is factor 4 faster, which makes Java look inferiour. This can relatively easily be fixed, when we gain access to these CPU instructions indirectly.
float op1, int off1,
float op2, int off2,
float dst, int offDst
FloatBuffer op1, int off1,
FloatBuffer op2, int off2,
FloatBuffer dst, int offDst
default (bytecode) implementation of this method would be:
dst[offDst+0] = op1[off1+0] + op2[off2+0];
dst[offDst+1] = op1[off1+1] + op2[off2+1];
dst[offDst+2] = op1[off1+2] + op2[off2+2];
dst[offDst+3] = op1[off1+3] + op2[off2+3];
These methods are turned into instrincs at runtime (like sun.misc.Unsafe), using the vector-instructions of the current platform.
With SIMD instructions one can do (theoreticly) 4 operations at a time. While most modern CPUs perform the SIMD instruction in 2+ cycles internally, this yields great performance improvements. In the latest (and upcoming) x86 CPUs, these operations are performed in 1 cycle internally.
The performance of the HotSpot JIT is ever increasing, but the gap between VM and native executable using SIMD, is widening. In vector-based code, or other mathematical SIMD-friendly algorithms, the performance can be multiplied by 200% - 400%, depending on the CPU's SIMD implementation.
__Making the JIT perform this optimisation behind the scenes is not sufficient__
Programmers can invent smart(er) ways of dealing with data to make it SIMD-friendly / SIMD-optimal, while the JIT might overlook cases, or considers it too complex.
EXPECTED VERSUS ACTUAL BEHAVIOR :
float translate = new float;
float scale = new float;
float src = new float[vectors * 4];
float dst = new float[vectors * 4];
int end = vectors * 4;
for(int i=0; i<end; i+=4)
SIMD.mul4(src, i, scale, 0, dst, i);
SIMD.add4(src, i, translate, 0, dst, i);