Bug ID: JDK-8310159 Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 21,22

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2023-06-15
Updated: 2024-02-21
Resolved: 2023-11-27

JDK 22
22 b26Fixed

Consider this benchmark:

https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:xor_bench?expand=1

Here, we compare the performance of code that copies arrays into off-heap storage ahead of a native call. It turns out that doing the copy using Unsafe::arrayCopy is 15-20% slower than using JNI's GetByteArrayRegion function.

Profiling the benchmark with perfasm reveals that Unsafe::arrayCopy boils down to:

StubRoutines::jlong_disjoint_arraycopy 

Whereas for JNI we end up with this:

__memmove_avx_unaligned_erms

The latter, judging from the name, likely enjoys AVX optimizations, which seems the most likely explanation for the difference in the performance profile of the two code paths.

Changeset: 82967f45 Author: steveatgh <steve.dohrmann@intel.com> Committer: Sandhya Viswanathan <sviswanathan@openjdk.org> Date: 2023-11-27 17:35:39 +0000 URL: https://git.openjdk.org/jdk/commit/82967f45db3b9555be03fcabdba380852ea21e2c
27-11-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16575 Date: 2023-11-08 23:23:48 +0000
10-11-2023
Maurizio told that AVX2 machine was used for testing.
22-09-2023
[~mcimadamore] What CPU you run on your testing? Does it have AVX512 instructions or only AVX2?
19-09-2023
> One thing to notice: I'm not calling memmove directly - that seems to be triggered by JNI_GetByteArrayRegion (although I admit I can't quite follow the codepath in hotspot). Isn't that a bit strange? AFAIK we don't use plain C stdlib calls to implement array copy... [~mcimadamore] JNI goes through jni_GetByteArrayRegion -> ArrayAccess<>::arraycopy_to_native -> ... -> pd_disjoint_words -> ... -> memcpy So it all seems to boil down to that libc's memcpy is faster for large arrays than our arraycopy stub.
14-09-2023
If the call to `xor` is removed the differences are magnified: ``` Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units XorTest.xor UNSAFE SMALL avgt 30 0.005 ± 0.001 ms/op XorTest.xor UNSAFE MEDIUM avgt 30 0.125 ± 0.002 ms/op XorTest.xor UNSAFE LARGE avgt 30 3.637 ± 0.052 ms/op XorTest.xor REGION SMALL avgt 30 0.005 ± 0.001 ms/op XorTest.xor REGION MEDIUM avgt 30 0.379 ± 0.003 ms/op XorTest.xor REGION LARGE avgt 30 2.063 ± 0.011 ms/op ``` The interesting thing is that, depending on the size of the copy we can be from 3x faster (for MEDIUM) to almost 2x slower (for LARGE). (I guess that's to be expected since the two copy operations use completely different code).
31-08-2023
I have tried the above patch,by fetching the branch and adding my benchmark. Resulting code here: https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:shipilev_better_unsafe_copy?expand=1 Results seem unchanged (e.g. the same as the one I get against latest JDK 22). ``` Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units XorTest.xor UNSAFE SMALL avgt 30 0.051 ± 0.001 ms/op XorTest.xor UNSAFE MEDIUM avgt 30 1.083 ± 0.017 ms/op XorTest.xor UNSAFE LARGE avgt 30 8.543 ± 0.099 ms/op XorTest.xor REGION SMALL avgt 30 0.051 ± 0.001 ms/op XorTest.xor REGION MEDIUM avgt 30 1.293 ± 0.009 ms/op XorTest.xor REGION LARGE avgt 30 7.130 ± 0.074 ms/op ```
31-08-2023
[~shaed] [~kvn] it would be great if the patch could be rebased to a more recent JDK codebase? I tried to apply it (on jdk/jdk) with no luck.
30-08-2023
One thing to notice: I'm not calling memmove directly - that seems to be triggered by JNI_GetByteArrayRegion (although I admit I can't quite follow the codepath in hotspot). Isn't that a bit strange? AFAIK we don't use plain C stdlib calls to implement array copy...
30-08-2023
We use only two 256-bit (AVX2) words copy in loop: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/stubGenerator_x86_64_arraycopy.cpp#L250 [~shade] has prototype to improve arraycopy: https://bugs.openjdk.org/browse/JDK-8150730
29-08-2023
StubRoutines::jlong_disjoint_arraycopy relies on `StubGenerator::copy_bytes_forward()` which does try to optimize for AVX-capable hardware. Probably, __memmove_avx_unaligned_erms features a more efficient code shape.
29-08-2023
This is the result I get on my machine: ``` Benchmark (arrayKind) (sizeKind) Mode Cnt Score Error Units XorTest.xor ELEMENTS SMALL avgt 30 240101397.044 ± 9261809.099 ns/op XorTest.xor ELEMENTS MEDIUM avgt 30 130575283.692 ± 944813.941 ns/op XorTest.xor ELEMENTS LARGE avgt 30 757975034.267 ± 5748433.646 ns/op XorTest.xor REGION SMALL avgt 30 52554232.092 ± 1623573.931 ns/op XorTest.xor REGION MEDIUM avgt 30 66772012.642 ± 429276.782 ns/op XorTest.xor REGION LARGE avgt 30 71781713.916 ± 324783.892 ns/op XorTest.xor CRITICAL SMALL avgt 30 46570121.646 ± 653800.554 ns/op XorTest.xor CRITICAL MEDIUM avgt 30 48706705.966 ± 410383.485 ns/op XorTest.xor CRITICAL LARGE avgt 30 50525769.353 ± 3512583.237 ns/op XorTest.xor FOREIGN SMALL avgt 30 53748003.470 ± 333580.950 ns/op XorTest.xor FOREIGN MEDIUM avgt 30 58389792.915 ± 110746.006 ns/op XorTest.xor FOREIGN LARGE avgt 30 95225626.628 ± 899076.107 ns/op ``` Note that for SMALL and MEDIUM, FOREIGN fares better than REGION (which uses JNI). But for LARGE, the opposite is true. This seems to show that a completely different copy algorithm is being used.
15-06-2023

Relates :	JDK-8150730 - Improve performance of x86_64 arraycopy stubs
Relates :	JDK-8326421 - Add jtreg test for large arrayCopy disjoint case.