JDK-8218192 : Remove copy constructor for MemRegion
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 11,12,13
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2019-02-01
  • Updated: 2019-12-04
  • Resolved: 2019-02-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 12 JDK 13
11.0.3Fixed 12.0.2Fixed 13 b09Fixed
Related Reports
Relates :  
Description
We observed a 5% performance regression comparing Clang-built and GCC-built HotSpot on Google's production machine with the jython benchmark in DaCapo. We identified the root cause is that LLVM's SLP vectorizer (https://llvm.org/docs/Vectorizers.html#the-slp-vectorizer) compiles G1BarrierSet::write_region() and G1BarrierSet::write_ref_array_work() methods with SSE instructions movups and movaps for passing the parameter "MemRegion mr" to G1BarrierSet::invalidate(). However, the data for the SSE move instructions is likely not aligned, resulting in the poor performance.

Although LLVM's SLP vectorizer can be turned off with -fno-slp-vectorize, we don't think it is desirable as it may cause other performance regression with Clang. We think it is reasonable to just pass the MemRegion object by a const reference, which avoids unnecessary data movement and vectorization.

Below are performance numbers with this patch. Experiments were done with 15 trials, and the variances for each config are within 0.5%.
Clang version: trunk r351319
GCC version: 4.9

	                  GCC-default	GCC-passByRef	Clang-default	Clang-passByRef
Execution Time (ms):	12151.4	12078.7	12532.8	11957.2
Process CPU Time (ms):	12167.3	12086.7	12543.3	11975.3

Update:
Based on suggestion from Kim Barrett below, we think it is better to remove the copy constructor. Latest performance numbers are attach in the HTML file.
Comments
This should also go to 12u, right? If so, please amend Fix Request and tags to include 12u.
20-02-2019

Fix Request: This fix could improve G1 write barrier performance on some hardware for Clang-built HotSpot. The risk is minimal.
19-02-2019

I attached a more comprehensive results including 6 configs and all runnable DaCapo benchmarks. Removing the copy constructor seems having similar result as passing by ref. It resolves the jython regression, and improves YG pause time for lusearch. I'm going to change this RFE's title and remove copy constructor instead. Note that xalan in the GCC config seems having a regression with removing copy constructor or pass-by-ref. I looked at the logs and it is because the JVM detected a different and variable number of available processors during the runs for xalan for GoogGcc-default. It is a problem with our benchmarking infrastructure, so probably not a concern.
05-02-2019

Good suggestion! Yes, removing the user defined copy constructor seems to eliminate the movups and movaps instructions in Clang-built code. I will also run some experiments later but I'm optimistic it will resolve the performance issue. Note that the performance regression is highly dependent on the hardware. I couldn't reproduce the original regression on a non-production desktop machine. The production machine showing the regression uses Intel Haswell CPU.
02-02-2019

I don't know if it will make a difference, but it might be interesting to compare with MemRegion's user defined copy constructor removed, so that it is trivially copyable. That UD copy constructor isn't functionally different from the default trivial copier. The presence of the UD copy constructor may affect which part of the ABI calling convention is applicable. [Later: I'm doing some performance comparisons with that change.]
02-02-2019