JDK-6604786 : SSE optimization for basic elementwise array operations
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 7
  • Priority: P5
  • Status: Closed
  • Resolution: Fixed
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2007-09-14
  • Updated: 2016-04-19
  • Resolved: 2016-04-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
A DESCRIPTION OF THE REQUEST :
Some image processing and other algorithms are based on loops, that apply a simple operation for all elements of 2 or several arrays. For example, elementwise unsigned maximum for byte arrays is the base of morphological image filters:

    for (int srcPosMax = srcPos + count; srcPos < srcPosMax; srcPos++, destPos++) {
        if ((src[srcPos] & 0xFF) > (dest[destPos] & 0xFF))
            dest[destPos] = src[srcPos];
    }

Elementwise sum for int or float arrays lies in the base of linear and other filters:

        for (int srcPosMax = srcPos + count; srcPos < srcPosMax; srcPos++, destPos++) {
            dest[destPos] += src[srcPos];
        }

All Intel processors since Pentium II offer special commands allowing to greatly optimize such loops, namely, SSE (MMX in first processors). In SSE2, we may calculate minimum, maximum, saturated or usual sum or difference and some other operations for 8 bytes / 4 shorts / 2 ints or floats in one command. It increases performance in times in comparison with simple loop.

Unfortunately, Java does not use this optimization. I think it is a good idea if the HotSpot compiler will "understand" the loops alike listed above and translate them into native SSE commands for Intel processors. The loops, used for elementwise array processing, are usually very simple and can be easily recognized; so, I think that necessary correction of HotSpot optimizer should not be too complex.

Or, as a variant, maybe you'll implement the set of typical elementwise array operation (according to the set of SSE commands) in your own native methods in the standard Math or similar class? The great advantage of such solution would be supporting not only Java arrays, but also direct XxxBuffer (ByteBuffer, ShortBuffer, etc.) In current JVM, the simple implementations based on get/set method sometimes works very slow even in "-server" mode:

        for (int srcPosMax = srcPos + count; srcPos < srcPosMax; srcPos++, destPos++) {
            dest.put(dest.get(destPos) + src.get(srcPos));
        }


JUSTIFICATION :
Impossibility to use advantages of SSE commands in simple elementwise loops without hard programming of native methods in the applications for all OS supporting Intel CPU.


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
More intellectual HotSpot optimizer, that recognizes elementwise loops with several standard operations (at least minimum, maximum, sum, difference, saturated sum and difference) and performs them via SSE, grouping iterations per 8 bytes and processing several first and last bytes in usual way (to provide good alignment). Or, as a variant, the ready set of all necessary native methods, processing Java arrays and NIO buffers of all primitive types, in Math or similar class.

CUSTOMER SUBMITTED WORKAROUND :
Creating the native methods for most serious image- and video-processing applications.

Comments
I think it is covered by SuperWord optimization JDK-6536652, increase superword vector size JDK-7119644, reduction optimization for vectorized loops JDK-8074981 and other vectorized loop optimizations.
19-04-2016

We have some vectorization optimizations, under the flag UseSuperWord. They work, at least for limited test cases. I think they need serious broad-spectrum testing, and when we test them we will certainly find they are fragile for use cases we care about. Since streams create simpler loops, we should focus some testing and optimization on stream-based loops. So this is really a request for better testing of the super-word optimization.
02-03-2015