JDK-8201193 : Use XMM/YMM for objects initialization
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: x86
  • Submitted: 2018-04-05
  • Updated: 2019-10-12
  • Resolved: 2018-06-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11
11 b18Fixed
Related Reports
Relates :  
Description
Right now, for longer lengths we use "rep stos" instructions on x86, for short lengths we use "mov rax". Experiments show that we can use XMM/YMM registers for middle lengths to improve objects initialization perofrmance.

Comments
Updated changes http://cr.openjdk.java.net/~kvn/8201193/webrev.02/ Use pxor() instead of vpxor() when AVX is not available (hit assert during testing). Set UseXMMForObjInit flag ergonomically in vm_version_x64.cpp.
08-06-2018

Latest changes: http://cr.openjdk.java.net/~kvn/8201193/webrev.01/
08-06-2018

On 4/5/18 12:19 AM, Rohit Arul Raj wrote: I was going through the C2 object initialization (zeroing) code based on the below bug entry: https://bugs.openjdk.java.net/browse/JDK-8146801 Right now, for longer lengths we use "rep stos" instructions on x86. I was experimenting with using XMM/YMM registers (on AMD EPYC processor) and found that they do improve performance for certain lengths: For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to 44% For lengths > 512bytes : some lengths show slight improvement in the range of 2% to 7%, others almost same as "rep stos" numbers. I have attached the complete performance data (data.txt) for reference . Can we add this as an user option similar to UseXMMForArrayCopy? I have used the same test case as in (http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with additional sizes. Initial Patch: I haven't added the check for 32-bit mode as I need some help with the code (description given below the patch). The code is similar to the one used in array copy stubs (copy_bytes_forward). diff --git a/src/hotspot/cpu/x86/globals_x86.hpp b/src/hotspot/cpu/x86/globals_x86.hpp --- a/src/hotspot/cpu/x86/globals_x86.hpp +++ b/src/hotspot/cpu/x86/globals_x86.hpp @@ -150,6 +150,9 @@ product(bool, UseUnalignedLoadStores, false, \ "Use SSE2 MOVDQU instruction for Arraycopy") \ \ + product(bool, UseXMMForObjInit, false, \ + "Use XMM/YMM MOVDQU instruction for Object Initialization") \ + \ product(bool, UseFastStosb, false, \ "Use fast-string operation for zeroing: rep stosb") \ \ diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp b/src/hotspot/cpu/x86/macroAssembler_x86.cpp --- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp +++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp @@ -7106,6 +7106,56 @@ if (UseFastStosb) { shlptr(cnt, 3); // convert to number of bytes rep_stosb(); + } else if (UseXMMForObjInit && UseUnalignedLoadStores) { + Label L_loop, L_sloop, L_check, L_tail, L_end; + push(base); + if (UseAVX >= 2) + vpxor(xmm10, xmm10, xmm10, AVX_256bit); + else + vpxor(xmm10, xmm10, xmm10, AVX_128bit); + + jmp(L_check); + + BIND(L_loop); + if (UseAVX >= 2) { + vmovdqu(Address(base, 0), xmm10); + vmovdqu(Address(base, 32), xmm10); + } else { + movdqu(Address(base, 0), xmm10); + movdqu(Address(base, 16), xmm10); + movdqu(Address(base, 32), xmm10); + movdqu(Address(base, 48), xmm10); + } + addptr(base, 64); + + BIND(L_check); + subptr(cnt, 8); + jccb(Assembler::greaterEqual, L_loop); + addptr(cnt, 4); + jccb(Assembler::less, L_tail); + // Copy trailing 32 bytes + if (UseAVX >= 2) { + vmovdqu(Address(base, 0), xmm10); + } else { + movdqu(Address(base, 0), xmm10); + movdqu(Address(base, 16), xmm10); + } + addptr(base, 32); + subptr(cnt, 4); + + BIND(L_tail); + addptr(cnt, 4); + jccb(Assembler::lessEqual, L_end); + decrement(cnt); + + BIND(L_sloop); + movptr(Address(base, 0), tmp); + addptr(base, 8); + decrement(cnt); + jccb(Assembler::greaterEqual, L_sloop); + + BIND(L_end); + pop(base); } else { NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words for 32-bit VM rep_stos(); When I use XMM0 as a temporary register, the micro-benchmark crashes. Saving and Restoring the XMM0 register before and after use works fine. Looking at the "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with other XMM registers has been mentioned as Save-On-Call registers and on Linux ABI, no register is preserved across function calls though XMM0-XMM7 might hold parameters. So I assumed using XMM0 without saving/restoring should be fine. Is it incorrect use XMM* registers without saving/restoring them? Using XMM10 register as temporary register works fine without having to save and restore it. Please let me know your comments. Regards, Rohit
06-04-2018