JDK-8201193 : Use XMM/YMM for objects initialization
Type:Enhancement
Component:hotspot
Sub-Component:compiler
Affected Version:11
Priority:P4
Status:Resolved
Resolution:Fixed
OS:generic
CPU:x86
Submitted:2018-04-05
Updated:2019-10-12
Resolved:2018-06-13
The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
Right now, for longer lengths we use "rep stos" instructions on x86, for short lengths we use "mov rax". Experiments show that we can use XMM/YMM registers for middle lengths to improve objects initialization perofrmance.
Comments
Updated changes http://cr.openjdk.java.net/~kvn/8201193/webrev.02/
Use pxor() instead of vpxor() when AVX is not available (hit assert during testing).
Set UseXMMForObjInit flag ergonomically in vm_version_x64.cpp.
On 4/5/18 12:19 AM, Rohit Arul Raj wrote:
I was going through the C2 object initialization (zeroing) code based
on the below bug entry:
https://bugs.openjdk.java.net/browse/JDK-8146801
Right now, for longer lengths we use "rep stos" instructions on x86. I
was experimenting with using XMM/YMM registers (on AMD EPYC processor)
and found that they do improve performance for certain lengths:
For lengths > 64 bytes - 512 bytes : improvement is in the range of 8% to 44%
For lengths > 512bytes : some lengths show slight
improvement in the range of 2% to 7%, others almost same as "rep stos"
numbers.
I have attached the complete performance data (data.txt) for reference .
Can we add this as an user option similar to UseXMMForArrayCopy?
I have used the same test case as in
(http://cr.openjdk.java.net/~shade/8146801/benchmarks.jar) with
additional sizes.
Initial Patch:
I haven't added the check for 32-bit mode as I need some help with the
code (description given below the patch).
The code is similar to the one used in array copy stubs (copy_bytes_forward).
diff --git a/src/hotspot/cpu/x86/globals_x86.hpp
b/src/hotspot/cpu/x86/globals_x86.hpp
--- a/src/hotspot/cpu/x86/globals_x86.hpp
+++ b/src/hotspot/cpu/x86/globals_x86.hpp
@@ -150,6 +150,9 @@
product(bool, UseUnalignedLoadStores, false, \
"Use SSE2 MOVDQU instruction for Arraycopy") \
\
+ product(bool, UseXMMForObjInit, false, \
+ "Use XMM/YMM MOVDQU instruction for Object Initialization") \
+ \
product(bool, UseFastStosb, false, \
"Use fast-string operation for zeroing: rep stosb") \
\
diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
--- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
+++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
@@ -7106,6 +7106,56 @@
if (UseFastStosb) {
shlptr(cnt, 3); // convert to number of bytes
rep_stosb();
+ } else if (UseXMMForObjInit && UseUnalignedLoadStores) {
+ Label L_loop, L_sloop, L_check, L_tail, L_end;
+ push(base);
+ if (UseAVX >= 2)
+ vpxor(xmm10, xmm10, xmm10, AVX_256bit);
+ else
+ vpxor(xmm10, xmm10, xmm10, AVX_128bit);
+
+ jmp(L_check);
+
+ BIND(L_loop);
+ if (UseAVX >= 2) {
+ vmovdqu(Address(base, 0), xmm10);
+ vmovdqu(Address(base, 32), xmm10);
+ } else {
+ movdqu(Address(base, 0), xmm10);
+ movdqu(Address(base, 16), xmm10);
+ movdqu(Address(base, 32), xmm10);
+ movdqu(Address(base, 48), xmm10);
+ }
+ addptr(base, 64);
+
+ BIND(L_check);
+ subptr(cnt, 8);
+ jccb(Assembler::greaterEqual, L_loop);
+ addptr(cnt, 4);
+ jccb(Assembler::less, L_tail);
+ // Copy trailing 32 bytes
+ if (UseAVX >= 2) {
+ vmovdqu(Address(base, 0), xmm10);
+ } else {
+ movdqu(Address(base, 0), xmm10);
+ movdqu(Address(base, 16), xmm10);
+ }
+ addptr(base, 32);
+ subptr(cnt, 4);
+
+ BIND(L_tail);
+ addptr(cnt, 4);
+ jccb(Assembler::lessEqual, L_end);
+ decrement(cnt);
+
+ BIND(L_sloop);
+ movptr(Address(base, 0), tmp);
+ addptr(base, 8);
+ decrement(cnt);
+ jccb(Assembler::greaterEqual, L_sloop);
+
+ BIND(L_end);
+ pop(base);
} else {
NOT_LP64(shlptr(cnt, 1);) // convert to number of 32-bit words
for 32-bit VM
rep_stos();
When I use XMM0 as a temporary register, the micro-benchmark crashes.
Saving and Restoring the XMM0 register before and after use works
fine.
Looking at the "hotspot/src/cpu/x86/vm/x86.ad" file, XMM0 as with
other XMM registers has been mentioned as Save-On-Call registers and
on Linux ABI, no register is preserved across function calls though
XMM0-XMM7 might hold parameters. So I assumed using XMM0 without
saving/restoring should be fine.
Is it incorrect use XMM* registers without saving/restoring them?
Using XMM10 register as temporary register works fine without having
to save and restore it.
Please let me know your comments.
Regards,
Rohit