Bug ID: JDK-8189100 Improve performance of String and Array operations on AArch64

Type: JEP
Component: hotspot
Sub-Component: compiler

Priority: P3
Status: Closed
Resolution: Withdrawn

Submitted: 2017-10-10
Updated: 2019-12-05
Resolved: 2017-11-16

Summary
------

Improve the performance of AArch64 OpenJDK port intrinsics for operations with lots of load/store operations, such as String and Array intrinsics.

Non-Goals
------

- Compare to and match the performance of other architectures for optimized operations.
- Tune generic AArch64 port intrinsics for optimal performance on a single AArch64 architecture implementation only
- Port intrinsics to ARM CPU code branch

Motivation
------

Specialized CPU architecture-specific code patterns improve the performance of user applications and benchmarks.

Description
------

Intrinsics are used to leverage CPU architecture-specific assembly code which gets executed instead of generic Java code for a given method to improve performance. While most of the intrinsics are already implemented in AArch64 OpenJDK port, the current implementation of some intrinsics may not be optimal. Specifically, some intrinsics for AArch64 architectures may benefit from software prefetching instructions, memory address alignment, instructions placement for multi-piplining CPUs, replacement of certain instruction patterns with faster ones or using SIMD instructions.

This includes (but is not limited to) such typical operations as String::compareTo, String::indexOf, StringCoding::hasNegatives, Arrays::equals, StringUTF16::compress, StringLatin1::inflate and checksum calculations.

Depending on the intrinsic algorithm, most common intrinsic use case, and CPU specifics the following changes may be considered:

- Use the ARM NEON instruction set. Such code (if any will be created) will be placed under a flag (like UseSIMDForMemoryOps flag) in case the existing algorithm has non-NEON version.
- Use prefetch hint instruction (PRFM). The effect of this instruction depends on various factors like presence of a CPU hardware prefetcher and its capabilities, cpu/memory clock ratio, memory controller specifics and particular algorithm needs.
- Reorder instructions and reduce data dependencies to allow out-of-order execution where possible.
- Avoid unaligned memory access if needed. Some CPU implementations have penalties issuing load/store across 16-byte boundary, dcache_line boundary or have different optimal alignment for different load/store instructions (see, for example, Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.

Testing
------

- Revised intrinsics performance will be tested on Cortex A53 and Cavium ThunderX hardware using JMH benchmarks and SPECjvm2005 where applicable.
- Functional correctness will be tested using jtreg test suite. Additional tests might be created in case the existing testbase doesn't provide sufficient coverage.

Risks and Assumptions
------

- It is not possible to perform testing and measurements on all AArch64 hardware variants. We will rely on OpenJDK community to perform testing on hardware from vendors we currently do not have in-house should they find it necessary when patches are submitted for review.
- Efforts will be made to improve the performance of a generic AArch64 port intrinsic implementation. In cases where this is not possible, specialized versions of intrinsics for a given hardware vendor may need to be written.
- Intrinsics which are in scope of the JEP are CPU architecture-specific and changing them does not affect shared HotSpot code.

[~mr] can you review this JEP?
13-11-2017
Okay.
17-10-2017
[~kvn], we measured performance improvement for some of these intrinsics and attached the preliminary performance results to enhancements linked in this JEP. For example, - for array_equals: for large arrays, we improved the performance up to x6 on a system without hardware prefetching and up to x1.5 on a system with hardware prefetcher - for string_compare: the case of small arrays - 10-20% faster, for large arrays - up to x4 faster on a system without hardware prefetcher and up to x1.8 faster on a system with hardware prefetcher. I��m currently working on further intrinsics and observe similar numbers. Hope this gives you an idea of what ballpark performance improvement we are expecting. You can find further details in the enhancements.
12-10-2017
[~dpochepk] Do you have any data which shows improvement with these changes? You will increase complexity of the code but what if benefits are not significant?
10-10-2017

Relates :	JDK-8184943 - AARCH64: Intrinsify hasNegatives
Relates :	JDK-8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays
Relates :	JDK-8189113 - AARCH64: StringLatin1 inflate intrinsic doesn't use prefetch instruction
Relates :	JDK-8189103 - AARCH64: optimize String indexOf intrinsic
Relates :	JDK-8189112 - AARCH64: optimize StringUTF16 compress intrinsic
Relates :	JDK-8189101 - AARCH32 - 'minimal' build fails because CMS bits are referred unconditionally