JDK-8328138 : Optimize ArrayEquals on AArch64 & fix potential crash
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • CPU: aarch64
  • Submitted: 2024-03-14
  • Updated: 2024-07-23
  • Resolved: 2024-07-23
Related Reports
Relates :  
Relates :  
Description
Current implementation of ArrayEquals on AArch64 is quite complex, due to the variety of checks about alignment, tail processing, bus locking and so on. However, Modern Arm processors have eased such worries. Besides, we found crash about array equals when using lilliput together with JDK-8139457. So we proposed to use a simple&straightforward flow of ArrayEquals.
With this simplified ArrayEquals, we observed performance gains on the latest arm platforms(Neoverse N1&N2)
Test case: org.openjdk.bench.java.util.ArraysEquals

1x vector length, 64-bit aligned array[0]
|       Test Case               |    N1    |    N2    |
| testByteFalseBeginning | -21.42%   | -13.37%   |
| testByteFalseEnd       |  25.79%   |  27.45%   |
| testByteFalseMid       |  16.64%   |  16.46%   |
| testByteTrue           |  12.39%   |  24.66%   |
| testCharFalseBeginning |  -5.27%   |  -3.08%   |
| testCharFalseEnd       |  29.29%   |  35.23%   |
| testCharFalseMid       |  15.13%   |  19.34%   |
| testCharTrue           |  21.63%   |  33.73%   |
| Total                  |  11.77%   |  17.55%   |

A key factor is to decide when we should utilize simd in array equals. An aggressive choice is to enable simd as long as array length exceeds vector length(8 words). The corresponding result is shown above, from which we can see performance regression in both testBeginning cases. To avoid such perf impact, we can set simd threshold to 3x vector length.

3x vector length, 64-bit aligned array[0]
|       Test Case               |    N1    |    N2    |
| testByteFalseBeginning |  8.28%  |  8.64%  |
| testByteFalseEnd       |  6.38%  | 12.29%  |
| testByteFalseMid       |  6.17%  |  7.96%  |
| testByteTrue           | -10.08% |  3.06%  |
| testCharFalseBeginning | -1.42%  |  7.23%  |
| testCharFalseEnd       |  4.05%  | 13.48%  |
| testCharFalseMid       |  8.79%  | 16.96%  |
| testCharTrue           | -5.66%  | 10.23%  |
| Total                  |  2.06%  |  9.98%  |


In addtion to perf improvement, we propose this patch to solve alignment issues in array equals. JDK-8139457 tries to relax alignment of array elements. On the other hand, this misalignment makes it an error to read the whole last word in array equals, in case that the array doesn't occupy the whole word and lilliput is enabled. A detailed explaination quoted from [https://github.com/openjdk/jdk/pull/11044#issuecomment-1996771480](url)

> The root cause is that default behavior of MacroAssembler::arrays_equals will blindly load whole word before comparison. When the array[0] is aligned to 32-bit, the last word load will exceed the array limit and may touch the next word beyong object layout in heap memory. If the next word which doesn't belong to object self happens to be the boundary of pages and G1 heap regions, the segmentation fault will be triggered. Loading the last word blindly is benign for 64-bit aligned array because it is always inside the object self.

Our patch fixed this problem, and again we would like to show the perf improvement when misalignment.

1x vector length, 32-bit aligned array[0]
|         Test Case             |    N1    |    N2    |
| testByteFalseBeginning | -12.96%  | -17.50%  |
| testByteFalseEnd       |  29.43%  |  32.19%  |
| testByteFalseMid       |  24.30%  |  17.54%  |
| testByteTrue           |  16.57%  |  24.40%  |
| testCharFalseBeginning |   2.40%  |   0.60%  |
| testCharFalseEnd       |  32.14%  |  32.94%  |
| testCharFalseMid       |  18.86%  |  17.60%  |
| testCharTrue           |  25.38%  |  32.62%  |
| Total                  |  17.01%  |  17.54%  |


3x vector length, 32-bit aligned array[0]
|       Test Case               |    N1    |    N2    |
| testByteFalseBeginning |  10.95%  |  14.23%  |
| testByteFalseEnd       |  13.83%  |  12.35%  |
| testByteFalseMid       |  11.18%  |  10.31%  |
| testByteTrue           |  -7.13%  |   6.61%  |
| testCharFalseBeginning |   0.38%  |   7.20%  |
| testCharFalseEnd       |   2.84%  |  13.13%  |
| testCharFalseMid       |  11.17%  |  15.40%  |
| testCharTrue           |  -5.43%  |   9.52%  |
| Total                  |   4.72%  |  11.09%  |

Comments
Runtime Triage: This is not on our current list of priorities. We will consider this feature if we receive additional customer requirements.
23-07-2024

The implementation here serves as a prototype. Anyone interested is welcome to test the performance gain, by turning on/off UseNewCode.
14-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/18292 Date: 2024-03-14 06:21:56 +0000
14-03-2024