JDK-8279676 : Dubious YMM register clearing in x86_64 arraycopy stubs
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11,17,18,19
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • CPU: x86_64
  • Submitted: 2022-01-10
  • Updated: 2022-01-25
  • Resolved: 2022-01-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b05Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
In copy_bytes_forward and copy_bytes_backward that are used in arraycopy stubs, we have code like:

      if (UseAVX >= 2) {
        // clean upper bits of YMM registers
        __ vpxor(xmm0, xmm0);
        __ vpxor(xmm1, xmm1);
      }

This code was added by JDK-8011102 (with vzeroupper), and then changed by JDK-8078113 (changed to vpxor). 
It raised some questions during early JDK-8279621 review.

I believe these were added to resolve false dependencies from larger 256-bit registers with subsequent 128-bit-using instructions. 

Note: this is still insufficient on Intel x86 implementations to recover from "dirty" AVX state; only vzeroupper/vzeroall would solve that, but that issue might not even affect our assembler code that AFAICS uses VEX-encoded versions when AVX > 0 (see Assembler::simd_prefix_and_encode). Every arraycopy stub has vzeroupper at the end, anyhow.

For x86_64 version, this zeroing seems redundant, as there are no XMM-using instructions after we leave the copy_bytes_{forward,backward} and go to stub epilog, where we meet vzeroupper.

For x86_32 version, this zeroing seems odd. x86_32 qword copying still uses XMM registers, as 32-bit platform has no other good way to copy 8 bytes at a time. There, using VEX.256 vpxor clears all bits, which is fine right now, but JDK-8279621 changes would probably need to clear upper 128-bits in AVX=1 mode. Also, it is only enabled for AVX==2, ignoring AVX-512. 

Draft PR:
 https://github.com/openjdk/jdk/pull/7016
Comments
We have new JCK test failures in java.lang.Character tests (see JDK-8280544). I'm wondering if this change could be the cause.
25-01-2022

Changeset: 525b20fc Author: Aleksey Shipilev <shade@openjdk.org> Date: 2022-01-12 08:32:08 +0000 URL: https://git.openjdk.java.net/jdk/commit/525b20fce0acd393f31fd37fe1f05f9d32577f77
12-01-2022

Right, so vzeroupper was added at the end of the arraycopy stubs by JDK-8178811 in 10.
12-01-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/7016 Date: 2022-01-10 16:59:18 +0000
11-01-2022

All right, I still think x86_32 code change was sensible, but I also think that issue would go away when/if we reshape the x86_32 arraycopy code to match x86_64. I will re-purpose this patch to only fix x86_64.
11-01-2022

[~shade] I agree with code removal for 64-bit. It was added when we did not have vzeroupper on stubs exit. I am not sure about 32-bit. In both cases where xmm_copy_forward() is called we have vzeroupper() on exit. The only use is `movq(xmm0` in following L_copy_8_bytes code. So we may need it for xmm0. About vpxor() - it clears all 256 bits with VEX.256 and that is intended. SSE2 is 128 bit and AVX2 is 256 bits. So we do need to cleat all 256 bits when transition from AVX to SSE. I think with avx512 there is no anymore penalty between mixing SSE and AVX instructions (I think all uses EVEX). In short, I would leave 32-bit code as it is.
10-01-2022

[~kvn], I wonder what you think about this.
10-01-2022