JDK-8189104 : JEP 315: Improve Aarch64 Intrinsics
  • Type: JEP
  • Component: hotspot
  • Sub-Component: compiler
  • Priority: P3
  • Status: Closed
  • Resolution: Delivered
  • Fix Versions: 11
  • Submitted: 2017-10-10
  • Updated: 2018-09-10
  • Resolved: 2018-09-10
Related Reports
Blocks :  
Blocks :  
Blocks :  
Blocks :  
Blocks :  
Blocks :  
Blocks :  
Blocks :  
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Summary
-------

Improve the existing string and array intrinsics, and implement new intrinsics for the `java.lang.Math` sin, cos and log functions, on AArch64 processors.

Non-Goals
---------

 - Compare to and match the performance of other architectures
 - Tune generic AArch64 port intrinsics for optimal performance on a single ARM64 architecture implementation only
 - Port intrinsics to the ARM CPU port

Motivation
----------

Specialized CPU architecture-specific code patterns improve the performance of user applications and benchmarks.

Description
-----------

Intrinsics are used to leverage CPU architecture-specific assembly code which is executed instead of generic Java code for a given method to improve performance. While most of the intrinsics are already implemented in AArch64 port, optimized intrinsics for the following `java.lang.Math` methods are still missing:

 - sin (sine trigonometric function)
 - cos (cosine trigonometric function)
 - log (logarithm of a number)

This JEP is intended to cover this gap by implementing optimized intrinsics for these methods.

At the same time, while most of the intrinsics are already implemented in the AArch64 port, the current implementation of some intrinsics may not be optimal. Specifically, some intrinsics for  AArch64 architectures may benefit from software prefetching instructions, memory address alignment, instructions placement for multi-pipeline CPUs, and the replacement of certain instruction patterns with faster ones or with SIMD instructions.

This includes (but is not limited to) such typical operations as `String::compareTo`, `String::indexOf`, `StringCoding::hasNegatives`, `Arrays::equals`, `StringUTF16::compress`, `StringLatin1::inflate`, and various checksum calculations.

Depending on the intrinsic algorithm, the most common intrinsic use case, and CPU specifics, the following changes may be considered:

 - Use the ARM NEON instruction set. Such code (if any will be created) will be placed under a flag (such as `UseSIMDForMemoryOps`) in case the existing algorithm has a non-NEON version.
 - Use the prefetch-hint instruction (PRFM). The effect of this instruction depends on various factors such as the presence of a CPU hardware prefetcher and its capabilities, the cpu/memory clock ratio, memory controller specifics, and particular algorithm needs.
 - Reorder instructions and reduce data dependencies to allow out-of-order execution where possible.
 - Avoid unaligned memory access if needed. Some CPU implementations impose penalties when issuing load/store instructions across a 16-byte boundary, a dcache-line boundary, or have different optimal alignment for different load/store instructions (see, for example, the Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.


Testing
-------

 - Intrinsics performance will be tested on Cavium ThunderX, ThunderX2 and Cortex A53 hardware using JMH benchmarks.
 - Functional correctness will be tested using the `jtreg` test suite. Additional tests might be created in case existing testbase doesn't provide sufficient coverage.

Risks and Assumptions
---------------------

 - Efforts will be made to implement optimally-performant generic versions of the AArch64 intrinsics. In cases where this is not possible, specialized versions of the intrinsics for a given hardware vendor may need to be written.
 - It is not possible to perform testing and performance measurements on all AArch64 hardware variants. We will rely on the OpenJDK Community to perform testing on hardware we currently do not have in-house should they find it necessary when patches are submitted for review.
 - The intrinsics in scope for this JEP are CPU architecture-specific, so changing them does not affect shared HotSpot code.

Comments
This work is nearing completion and most of it is already reviewed. I hence moved it to "Proposed To Target" to JDK11.
25-05-2018

We need to be sure that changes to intrinsics are really valuable and are correct. On more than one occasion over the last year or so, intrinsic improvements have resulted in regressions, so I want the bar to be high for any changes: performance improvements should be clear, repeatable, and significant. Also, I am concerned about the maintenance cost of making the port significantly more complex.
16-11-2017

[~mr] can you review this JEP?
13-11-2017

Okay. Thank you.
12-10-2017

[~kvn], we totally share your concern. It is the intent of this JEP to create a generic version that will be widely suitable. This is exactly the reason why we picked 2 different CPUs publicly available today to verify our implementation will be generic enough: entry level Cortex A53 (which will support Raspberry3 embedded community) and server-class ThunderX (which is currently publicly available through different cloud services providers).
12-10-2017

The cases that have turned up on aarch64 have not been "Use this special version for SuperDuperChipCo", but more choice between different generic versions. HW-specific settings can choose one or the other. An example is in array copy, where using SIMD registers is a great optimization on some HW and not others. I think that is what is intended in this JEP (only if necessary).
11-10-2017

I would like next to be avoided: "If this will not be possible, specialized versions of intrinsics for a given hardware vendor may need to be written." Generic version is preferable even if it slower on particular hardware.
11-10-2017