JDK-8150730 : Improve performance of x86_64 arraycopy stubs
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11,17,18
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • CPU: x86
  • Submitted: 2016-02-26
  • Updated: 2023-08-30
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
If you have a dumb benchmark like this:

    @Param({"1024"})
    int size;

    byte[] pad;
    long[] source, destination;

    @Setup(Level.Iteration)
    public void setUp() {
        Random r = new Random(42);

        pad = new byte[r.nextInt(1024)];
        source = new long[size];
        destination = new long[size];

        for (int i = 0; i < size; ++i) {
            source[i] = r.nextInt();
        }

        // Promote all the arrays
        System.gc();
    }

    @Benchmark
    public void arraycopy() {
        System.arraycopy(source, 0, destination, 0, size);
    }

And run it with JDK 9b107 on i7-4790K @ 4.0 GHz, Linux x86_64, then you will see that performance fluctuates a lot:

# Warmup Iteration   1: 351.178 ns/op
# Warmup Iteration   2: 385.568 ns/op
# Warmup Iteration   3: 366.771 ns/op
# Warmup Iteration   4: 341.570 ns/op
# Warmup Iteration   5: 420.488 ns/op
Iteration   1: 309.817 ns/op
Iteration   2: 346.652 ns/op
Iteration   3: 408.156 ns/op
Iteration   4: 343.857 ns/op
Iteration   5: 137.810 ns/op
Iteration   6: 283.327 ns/op
Iteration   7: 356.355 ns/op
Iteration   8: 319.256 ns/op
Iteration   9: 136.157 ns/op
Iteration  10: 302.372 ns/op
Iteration  11: 299.792 ns/op
Iteration  12: 389.018 ns/op
Iteration  13: 329.284 ns/op
Iteration  14: 142.508 ns/op
Iteration  15: 297.566 ns/op

Since this run with the same generated code that ends up calling jlong_disjoint_arraycopy, and the hottest piece of code is AVX-assisted copy:

  1.90%    0.69%  │ ↗ │││  0x00007feb44f11a70: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
 36.10%   36.21%  │ │ │││  0x00007feb44f11a76: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
 10.28%   11.38%  │ │ │││  0x00007feb44f11a7c: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
 29.87%   26.29%  │ │ │││  0x00007feb44f11a82: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
 15.40%   18.50%  ↘ │ │││  0x00007feb44f11a88: add    $0x8,%rdx
                    ╰ │││  0x00007feb44f11a8c: jle    Stub::jlong_disjoint_arraycopy+48 0x00007feb44f11a70

...the suspicion obviously falls to data alignment.

This work targets to substantially improve the arraycopy performance and variance on x86_64. 

The key things the new code does:
  *) Aligns destination address to perform aligned stores: the cost on misaligned stores, especially on wide AVX stores is horrible;
  *) Uses the widest copy, basically all vector registers that we have to perform the bulk copy: this minimizes some ill effects from dependencies and forwarding;
  *) Does the tail copies with the widest operations possible, this optimizes the copy lengths that are not exactly the power of 2;
Comments
Status updates: https://cr.openjdk.java.net/~shade/8150730/status.txt The implementation is fairly complete, more performance testing needed. Zen 2 and Haswell respond well to these improvements.
24-11-2021

Transplanted the old experimental changes to current mainline, put it in the Git branch here: https://github.com/openjdk/jdk/compare/master...shipilev:JDK-8150730-arraycopy
04-11-2021

Some progress: smaller arraycopies are handled better, the new stub supports AVX{0,1,2,3}: http://cr.openjdk.java.net/~shade/8150730/webrev.01/ Sample performance on SKL-X: http://cr.openjdk.java.net/~shade/8150730/webrev01-avx0.png http://cr.openjdk.java.net/~shade/8150730/webrev01-avx1.png http://cr.openjdk.java.net/~shade/8150730/webrev01-avx2.png http://cr.openjdk.java.net/~shade/8150730/webrev01-avx3.png
30-01-2019

Updated patch for recent jdk/jdk, and implemented AVX-512 versions: http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest64-zmm16.patch http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest64-zmm32.patch Still very profitable on SKL-X: http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest64-zmm32.png Let's see what it takes to productise it.
20-11-2018

Hi [~shade], Request your help. Are you planning to get the fix in for this task, for JDK 11? If not, can you please change the 'Fix Version' to tbd_feature or tbd_update. Thank you.
03-05-2018

Updated patch for recent jdk/hs: http://cr.openjdk.java.net/~shade/8150730/webrev.00/ Seems like Skylake-X exhibits a more interesting baseline behavior: http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-ymm16-v3-v3-sklX.png Average performance is still better for patched version, and it is more flat. 4K aliasing hump is still present.
10-01-2018

Any progress ? It seems very promising !
27-05-2016

Well, since the bug and your comment is public you just did.
31-03-2016

Better proof-of-concept patch: http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-ymm16-v3.patch It still tries to do several things: a) align destination to 32 bytes to avoid split stores; b) do deeper pipelining to alleviate penalties in hot loop, uses the most efficient load/store order; c) alleviate aliasing penalties by using 16-byte load once (speculation: my Haswell cannot take *that* many split loads); d) (new) handles small arrays efficiently; My attempts to further tame split loads proved unsuccessful. Even with dst aligned by 32, and src aligned by 16, individual 16-byte loads + merges are slower than split 32-byte load, at least on my Haswell. Therefore, I believe we cannot do better than this. Next steps should include extracting the common parts from this new stub, and applying it to other stubs (other data types, and backward copy too). The latest performance data: http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-ymm16-v3.png (again, performance results seem to be periodic with 4K period, only one period is shown) P.S. Somebody, tell Intel their 4K aliasing issue totally blows?
30-03-2016

I'd like just to emphasize set of issue which are observed here: 0) misaligned issue. small issue, misaligned load/stores lead to ~10% performance degradation in comparison with 32-byte aligned store/load 1) 4K - issue. hitting 4K boundary between source and destination causes ~3x times performance degradation 2) 1M&LP - issue. hitting 1M boundary between source and destination and turning on Large Pages (doesn't matter explicitly or implicitly with transparent huge pages on Linux) causes ~10x times performance degradation.
15-03-2016

Current set of proof-of-concept patches: http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-xmm1-ymm7.patch http://cr.openjdk.java.net/~shade/8150730/poc-bytecopy-dest32-xmm1-ymm15.patch These patches try to do three things: a) align destination to 32 bytes to avoid split stores; b) do deeper pipelining to alleviate penalties in hot loop; c) alleviate aliasing penalties by using 16-byte ops sparingly (this is surprising result); Performance for 1xmm + 15ymm version is rather good: the worst performance in patched version is the average on baseline: http://cr.openjdk.java.net/~shade/8150730/arraycopy-dest32-1xmm-7%2c15ymm.png (performance results seem to be periodic with 4K period, only one period is shown) Workload used to estimate performance improvements: http://cr.openjdk.java.net/~shade/8150730/ArrayCopyByteScan.java
11-03-2016

A more advanced proof-of-concept, that treats all-type aligned-by-8 arraycopies, and does not regress tiny arrays: http://cr.openjdk.java.net/~shade/8150730/poc-align-dest32.patch However, we should really do something like stubGenerator_sparc does: accurately peel the prefix and suffix, down to single bytes, to align for non-aligned-by-8 destinations well.
03-03-2016

Intel Optimization Manual seems to say: a) "Assembly/Compiler Coding Rule 73. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer store alignment over load alignment." -- the proof-of-concept fix above fits that description. Dropping the alignment from 64 to 32 seems to solve the issue too. b) "11.6.2 Consider 16-Byte Memory Access when Memory is Unaligned For best results use Intel AVX 32-byte loads and align data to 32-bytes. However, there are cases where you cannot align the data, or data alignment is unknown. This can happen when you are writing a library function and the input data alignment is unknown. In these cases, using 16-Byte memory accesses may be the best alternative."
02-03-2016

Theoretically, there is a problem in aligning for both source and destination: think about src at base 0, and dst at base 8 -- no matter how much pre-loop operations you do, either src or dst would come misaligned at modulos larger than 8. However, the simple experiment for aligning the destination only (it could be demonstrated with source alignment too): http://cr.openjdk.java.net/~shade/8150730/poc-align-dest64.patch ...trims down the run-to-run variance, and shifts the performance up to the best mode: http://cr.openjdk.java.net/~shade/8150730/sorted-align-dest64-minusNewCode.txt http://cr.openjdk.java.net/~shade/8150730/sorted-align-dest64-plusNewCode.txt There are probably other tricks we can do, but this one seems to work well already. Of course, the choice of alignment value was arbitrarily large, and we would need to follow up what are the drawbacks on smaller arrays.
29-02-2016