United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6526380 Add API to access SIMD instructions
JDK-6526380 : Add API to access SIMD instructions

Details
Type:
Enhancement
Submit Date:
2007-02-19
Status:
Closed
Updated Date:
2011-02-16
Project Name:
JDK
Resolved Date:
2007-03-01
Component:
core-libs
OS:
windows_xp
Sub-Component:
java.lang
CPU:
x86
Priority:
P4
Resolution:
Won't Fix
Affected Versions:
6
Fixed Versions:

Related Reports
Relates:

Sub Tasks

Description
A DESCRIPTION OF THE REQUEST :
An API to access SIMD instructions.

We need an API to access SIMD instructions, take advantage of hardware acceleration of vector-math. On the latest CPUs vector-math is factor 4 faster, which makes Java look inferiour. This can relatively easily be fixed, when we gain access to these CPU instructions indirectly.

java.lang.math.SIMD.add4(
                       float[] op1, int off1,
                       float[] op2, int off2,
                       float[] dst, int offDst
)

java.lang.math.SIMD.add4(
                       FloatBuffer op1, int off1,
                       FloatBuffer op2, int off2,
                       FloatBuffer dst, int offDst
)


default (bytecode) implementation of this method would be:
dst[offDst+0] = op1[off1+0] + op2[off2+0];
dst[offDst+1] = op1[off1+1] + op2[off2+1];
dst[offDst+2] = op1[off1+2] + op2[off2+2];
dst[offDst+3] = op1[off1+3] + op2[off2+3];

These methods are turned into instrincs at runtime (like sun.misc.Unsafe), using the vector-instructions of the current platform.

JUSTIFICATION :
With SIMD instructions one can do (theoreticly) 4 operations at a time. While most modern CPUs perform the SIMD instruction in 2+ cycles internally, this yields great performance improvements. In the latest (and upcoming) x86 CPUs, these operations are performed in 1 cycle internally.

The performance of the HotSpot JIT is ever increasing, but the gap between VM and native executable using SIMD, is widening. In vector-based code, or other mathematical SIMD-friendly algorithms, the performance can be multiplied by 200% - 400%, depending on the CPU's SIMD implementation.

__Making the JIT perform this optimisation behind the scenes is not sufficient__

  Programmers can invent smart(er) ways of dealing with data to make it SIMD-friendly / SIMD-optimal, while the JIT might overlook cases, or considers it too complex.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
float[] translate = new float[4];
float[] scale = new float[4];
float[] src = new float[vectors * 4];
float[] dst = new float[vectors * 4];

int end = vectors * 4;
for(int i=0; i<end; i+=4)
{
   SIMD.mul4(src, i, scale, 0, dst, i);
   SIMD.add4(src, i, translate, 0, dst, i);
}

                                    

Comments
EVALUATION

For the Java platform, it would be very uncharacteristic to provide an API of this sort.  The details of the SIMD intructions differ across archictures (and over time) and idioms that ran faster on some platforms could run slower on others.

These sorts of tranformations are better left to the jvm, which can more flexibly accomodate any issues of alignment and padding, data dependancies, etc.

Closing as will not fix.
                                     
2007-03-01



Hardware and Software, Engineered to Work Together