JDK-8074124 : Most Unsafe.get*() access shapes are losing vs. the plain Java accesses
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9,10
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2015-03-02
  • Updated: 2019-01-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
If you run a simple benchmark like this:

    public int stream_unsafe_char() {
        int s = 0;
        for (int c = 0; c < size; c++) {
            s += U.getChar(charArr, CHAR_ARR_OFFSET + CHAR_ARR_SCALE * c);
        return s;

...then you will notice that generated code has a stray "movslq":

 0x00007ffad8741180: movslq %r8d,%r9  <--- unnecessary sign extension
 0x00007ffad8741183: movzwl 0x10(%r10,%r9,2)
 0x00007ffad8741189: add    %ecx,%edx         
 0x00007ffad874118b: inc    %r8d               
 0x00007ffad874118e: cmp    %r11d,%r8d
 0x00007ffad8741191: jl     0x00007ffad8741180  

Indeed, the baseline test that does the plain Java access:

    public int plain() {
        int s = 0;
        for (int c = 0; c < size; c++) {
            s += charArr[c];
        return s;

...does without sign extension:

 0x00007fc3d0f84ea0: movzwl 0x10(%r11,%r8,2),%ecx
 0x00007fc3d0f84ea6: add    %ecx,%edx
 0x00007fc3d0f84ea8: inc    %r8d
 0x00007fc3d0f84eab: cmp    %r10d,%r8d
 0x00007fc3d0f84eae: jl     0x00007fc3d0f84ea0  

This conversion costs some cycles on my 1x4x2 4 GHz Haswell, running with Linux x86_64, 8u40 EA:

Benchmark            (size)  Mode  Cnt    Score   Error  Units
UnsafeMovslq.plain     1000  avgt    5  343.719 �� 0.694  ns/op
UnsafeMovslq.unsafe    1000  avgt    5  362.350 �� 5.302  ns/op

JDK-8145322 should fix some of those issues.

Thanks, Aleksey. I think your evaluation makes perfectly sense - the only way we could improve the existing access shapes is by trying to prove that the integer index expression does not overflow. The baseline version is able to prove this because we have a range check that guarantees that the index is in bounds. I think in most cases we are not able to prove this for the unsafe version without a range check.

I think we cannot optimize this in compilers for a simple reason: Java evaluation rules *would* overflow the computation for final offset for any non-byte[] array. E.g. if you allocate new char[Integer.MAX_VALUE-2], then most access shapes would overflow the int. The only shapes that are not amenable to overflow compute things in long: private static final long SCALE = Unsafe.ARRAY_CHAR_INDEX_SCALE; private static final long BASE = Unsafe.ARRAY_CHAR_BASE_OFFSET; U.getChar(charArr, Unsafe.ARRAY_CHAR_BASE_OFFSET + (long)Unsafe.ARRAY_CHAR_INDEX_SCALE * c); U.getChar(charArr, Unsafe.ARRAY_CHAR_BASE_OFFSET + Unsafe.ARRAY_CHAR_INDEX_SCALE * (long)c); U.getChar(charArr, Unsafe.ARRAY_CHAR_BASE_OFFSET + (1L * Unsafe.ARRAY_CHAR_INDEX_SCALE) * c); U.getChar(charArr, Unsafe.ARRAY_CHAR_BASE_OFFSET + (Unsafe.ARRAY_CHAR_INDEX_SCALE * (1L * c))); U.getChar(charArr, BASE + SCALE * c); U.getChar(charArr, Unsafe.ARRAY_CHAR_BASE_OFFSET + SCALE * c); ...with the significant hits on 32-bit platforms. Plain array accesses get away with this, as John suggests above, by knowing/assuming that a complex addressing mode operand would forgive us? But compiler cannot assume any specifics for the expressions that are used as inputs for Unsafe.get*() calls.You cannot fold an overflowing Java expression into a non-overflowing one, since it breaks language semantics. Which means that either compiler should be able to prove "c" is low enough to avoid overflow, or it should bail. This proof is potentially possible in loops, but not in the one-off usages. All of the above leads me to believe the only way out of this conundrum is adding "X Unsafe.getX(Object obj, int base, int scale, int index)" for each basic type X, and make sure compilers are able to lower this to the most optimal access on a target platform. E.g. it could be folded into "mov $base(&obj, $index, $scale), %reg" for some $scale-s. Or, maybe a simpler form of "X Unsafe.getIndexedX(Object obj, int index)", assuming X is enough to figure out the base and scale, *and* bases for basic types X agree.

If we cannot pattern-match this reliably in compilers, then maybe we should consider adding "X Unsafe.getX(Object obj, int base, int scale, int offset)", and lower it explicitly to complex-addressed mov at least on x86.

Also had a quick comparative run with the latest jdk9/hs-comp that contains JDK-8136820, and it performs the same as the JDK 9b82 used in the original experiment.

Updated benchmark to cover more shapes: http://cr.openjdk.java.net/~shade/8074124/UnsafeConvBench.java Runnable JAR: http://cr.openjdk.java.net/~shade/8074124/benchmarks.jar It seems there is no addressing shape that allows C1 to skip the a significant amount of work. There are some C2 shapes that are as optimal as plain access (notably, in 32-bit mode only!). In the end, there is no shape that works well in both C1 and C2. See the benchmark source for results and disassembly. This should be treated as important codegen issue, because many low-level projects, including Compact Strings and VarHandles, depend on Unsafe.get*() performance.

Come again? The "unsafe" version uses long CHAR_ARR_OFFSET and CHAR_ARR_SCALE already, so the result of (CHAR_ARR_SCALE * c) is already long. Adding (long)c has no effect on performance, and movslq is still present. In the end, the assembly above seems to suggest we are only dealing with $c counter, and the base offsets are calculated already. Why does the compiler use $c directly in "safe" case, but does the conversion in "unsafe" case? $c is "int" in both cases.

The conversion is necessary because the index scaling sub-expression (CHAR_ARR_SCALE * c) can overflow, whereas the corresponding "safe" subexpression cannot (and the JIT works hard at using that fact). Suggest reformulating the benchmark as all-long arithmetic: s += U.getChar(charArr, CHAR_ARR_OFFSET + CHAR_ARR_SCALE * (long) c);

Benchmark: http://cr.openjdk.java.net/~shade/8074124/UnsafeMovslq.java Executable JAR: http://cr.openjdk.java.net/~shade/8074124/benchmarks.jar (run with "java -jar benchmarks.jar") Sample perfasm output: http://cr.openjdk.java.net/~shade/8074124/output.perfasm (run with "java -jar benchmarks.jar -prof perfasm")