A DESCRIPTION OF THE REQUEST :
On x86 processors, 128-bit SSE data should be 16-byte aligned. There is an unaligned load instruction, but it is between two and ten times slower than an aligned load. In Java applications that use native code to implement fast vector processing, there is no way to pass pre-aligned vector data to JNI code.
Previous RFE dealing with 8-byte alignment; http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4659977
JUSTIFICATION :
When doing a lot of numeric processing on vector data (i.e. co-ordinate groups), the best performance is obtained by using a native method that pins the relevant arrays (with GetPrimitiveArrayCritical) while carrying out the calculation.
Currently applications that use 128-bit SSE numeric code via JNI have four options;
1) use the loadu_ps/storeu_ps instructions instead of load_ps/store_ps
2) check if the address is 16-byte aligned, if it isn't shift the whole array forward to align it, do the calculation, then shift the whole array back again.
3) as above but don't shift the array back again. Pass an offset back to the Java code specifying by how many elements the array has been shifted, take account of this when using results. The array only has to be realigned if moved to an unaligned address by the Java VM.
4) Use Get<Type>ArrayRegion and Set<Type>ArrayRegion to copy the array contents to a pre-aligned buffer.
All of these have a significant performance penalty.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Integer and FP 128-bit vector native types would be ideal, but that's a lot of work and more language complexity, so it isn't really a realistic RFE. Here are a couple of more practical solutions;
1) All array objects are 8-byte aligned. AFAIK the header is 12 bytes on 32-bit VMs, 24 bytes on 64-bit ones, so the contents are in theory 4-byte aligned on 32-bit VMs, 8-byte aligned on 64-bit ones, except for long/double arrays, which are always 8-byte aligned even on 32-bit VMs (wasting 4 bytes). Possible solutions are;
a) Always 16-byte align long/double array contents.
Benefits; aligned SSE operations are now possible in JNI code with no extra work. This may also help if Hotspot ever starts autovectorising loops (i.e. RFEs like 6604786 get implemented). Presumably simple to implement.
Drawbacks; Wastes an average of 4 bytes per long/double array for 64-bit VMs, increases average waste from 4 to 8 bytes on 32-bit VMs. This is fine for SSE2 double precision operations and can accomdate SSE integer operations on 32-bit and smaller vector types with only moderate inconvenience, as packing and unpacking the results into/from Java long arrays is relatively simple and fast. However it isn't really useful for SSE single-precision FP operations.
b) Always 16-byte align array contents for all types except reference and boolean.
Benefits; as above, but all SSE vector operations can now be easily used (including all int sizes and single precision FP).
Drawbacks; would waste an average of 8 bytes per array for all primitive arrays (except booleans) on 32-bit VMs, half that on 64-bit VMs. This is almost certainly unacceptable.
c) Always 16-byte align array contents for all types except reference and boolean, but only if the array length is above a reasonable threshold (e.g. 1024 elements). It does not make sense to use JNI code to process small arrays; in fact most applications will only do this for large datasets (this probably also holds for proposed Hotspot autovectorisation). Thus only large arrays really benefit from alignment.
Benefits; All SSE vector operations can now be used unaligned (on large arrays of all types). Memory penalty for other applications is insignificant (<1%, can be further reduced with a larger threshold).
Drawbacks; The main drawback is that JNI code that has to handle small arrays as an edge case still needs a second code path to cope with unaligned arrays. This isn't a big deal, as it's just a copy-paste of the same inner loops using loadu_ps/storeu_ps instructions instead of load_ps/store_ps. The minor drawback is that the Hotspot allocation and GC code will be slightly more complicated, due to the array size check (though this will only be a slight elaboration of the existing array element size check for 8-byte alignment).
2) An explicit 16-byte alignment flag could be added to the object properties. For example creating a '16-byte aligned' equivalent (internal) type for all the existing primitive array types. Since we're using JNI anyway, this does not need to be accessible through the Java API. Instead, equivalents of the New<Type>Array functions in the JNI API could support the creation of aligned arrays in JNI code. Typical SSE-optimised code would look like this;
DataStructure ds = createDataStructure(num_elements); // native method
ds.readData(source); // java method
proccessData(ds); // native method
ds.writeResults(target); // java method
Benefits; No effect on existing applications, no per-array memory penalty, works for all SSE types, only requires one code path. Presumably compatible with any future Hotspot autovectorisation.
Drawbacks; slightly harder for JNI users to set up (though this is compensated for by not having to have an unaligned code path). This solution has the most additional VM complexity. Requires additional functions in the JNI API.
IMHO solution 2 is the best one for developers and the most elegant, though 1c is a reasonable compromise if 2 is too complex to implement.