JDK-6526380 : Add API to access SIMD instructions
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.lang
  • Affected Version: 6
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2007-02-19
  • Updated: 2011-02-16
  • Resolved: 2007-03-01
Related Reports
Relates :  
An API to access SIMD instructions.

We need an API to access SIMD instructions, take advantage of hardware acceleration of vector-math. On the latest CPUs vector-math is factor 4 faster, which makes Java look inferiour. This can relatively easily be fixed, when we gain access to these CPU instructions indirectly.

                       float[] op1, int off1,
                       float[] op2, int off2,
                       float[] dst, int offDst

                       FloatBuffer op1, int off1,
                       FloatBuffer op2, int off2,
                       FloatBuffer dst, int offDst

default (bytecode) implementation of this method would be:
dst[offDst+0] = op1[off1+0] + op2[off2+0];
dst[offDst+1] = op1[off1+1] + op2[off2+1];
dst[offDst+2] = op1[off1+2] + op2[off2+2];
dst[offDst+3] = op1[off1+3] + op2[off2+3];

These methods are turned into instrincs at runtime (like sun.misc.Unsafe), using the vector-instructions of the current platform.

With SIMD instructions one can do (theoreticly) 4 operations at a time. While most modern CPUs perform the SIMD instruction in 2+ cycles internally, this yields great performance improvements. In the latest (and upcoming) x86 CPUs, these operations are performed in 1 cycle internally.

The performance of the HotSpot JIT is ever increasing, but the gap between VM and native executable using SIMD, is widening. In vector-based code, or other mathematical SIMD-friendly algorithms, the performance can be multiplied by 200% - 400%, depending on the CPU's SIMD implementation.

__Making the JIT perform this optimisation behind the scenes is not sufficient__

  Programmers can invent smart(er) ways of dealing with data to make it SIMD-friendly / SIMD-optimal, while the JIT might overlook cases, or considers it too complex.

float[] translate = new float[4];
float[] scale = new float[4];
float[] src = new float[vectors * 4];
float[] dst = new float[vectors * 4];

int end = vectors * 4;
for(int i=0; i<end; i+=4)
   SIMD.mul4(src, i, scale, 0, dst, i);
   SIMD.add4(src, i, translate, 0, dst, i);

EVALUATION For the Java platform, it would be very uncharacteristic to provide an API of this sort. The details of the SIMD intructions differ across archictures (and over time) and idioms that ran faster on some platforms could run slower on others. These sorts of tranformations are better left to the jvm, which can more flexibly accomodate any issues of alignment and padding, data dependancies, etc. Closing as will not fix.