We observed a 5% performance regression comparing Clang-built and GCC-built HotSpot on Google's production machine with the jython benchmark in DaCapo. We identified the root cause is that LLVM's SLP vectorizer (https://llvm.org/docs/Vectorizers.html#the-slp-vectorizer) compiles G1BarrierSet::write_region() and G1BarrierSet::write_ref_array_work() methods with SSE instructions movups and movaps for passing the parameter "MemRegion mr" to G1BarrierSet::invalidate(). However, the data for the SSE move instructions is likely not aligned, resulting in the poor performance.
Although LLVM's SLP vectorizer can be turned off with -fno-slp-vectorize, we don't think it is desirable as it may cause other performance regression with Clang. We think it is reasonable to just pass the MemRegion object by a const reference, which avoids unnecessary data movement and vectorization.
Below are performance numbers with this patch. Experiments were done with 15 trials, and the variances for each config are within 0.5%.
Clang version: trunk r351319
GCC version: 4.9
GCC-default GCC-passByRef Clang-default Clang-passByRef
Execution Time (ms): 12151.4 12078.7 12532.8 11957.2
Process CPU Time (ms): 12167.3 12086.7 12543.3 11975.3
Update:
Based on suggestion from Kim Barrett below, we think it is better to remove the copy constructor. Latest performance numbers are attach in the HTML file.