While working on JDK-8150730 and looking at performance results for it, I noticed a pecularity in current arraycopy implementation. It looks as if changing UseAVX from 0 to 1 does not improve the baseline scores:
https://cr.openjdk.java.net/~shade/8150730/i11500.png
https://cr.openjdk.java.net/~shade/8150730/tr3970x.png
The problem is that the arraycopy generators use vmovdqu only for UseAVX >= 2:
if (UseAVX >= 2) {
__ vmovdqu(xmm0, Address(end_from, qword_count, Address::times_8, -56));
...
} else {
__ movdqu(xmm0, Address(end_from, qword_count, Address::times_8, -56));
...
}
...while 256-bit vmovdqu is actually available for plain AVX(1) as well (matches VEX.256 encoding, as per Intel SDM):
// Move Unaligned 256bit Vector
void vmovdqu(Address dst, XMMRegister src);
void vmovdqu(XMMRegister dst, Address src);
void vmovdqu(XMMRegister dst, XMMRegister src);
Seems to be that way since the initial implementation in JDK-8005544.
Relaxing the requirement to UseAVX=1 in that code provides substantial performance improvements:
https://github.com/openjdk/jdk/pull/6987