JDK-8302736 : Major performance regression in Math.log on aarch64
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 11.0.2,17,19.0.2,21
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: os_x
  • CPU: aarch64
  • Submitted: 2023-02-15
  • Updated: 2024-03-12
  • Resolved: 2023-05-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21
17.0.9-oracleFixed 21 b25Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
aarch64, Apple M1 Max, macOS 13.2.1

A DESCRIPTION OF THE PROBLEM :
Math.log using the generic dlog intrinsic is much slower than StrictMath.log on aarch64.

Caused by JDK-8215133
Related to JDK-8210858

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
javac Main.java
java Main
java -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_dlog Main

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The elapsed time without additional options is less than or equal to the time with the _dlog intrinsic disabled.
ACTUAL -
java Main
6200ms

java -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_dlog Main
860ms

---------- BEGIN SOURCE ----------
import java.util.Random;
import java.util.concurrent.TimeUnit;

public class Main {
    public static void main(String[] args) throws Exception {
        while (true) {
            final Random random = new Random();
            final double[] values = new double[100_000_000];
            for (int i = 0; i < values.length; i++)
                values[i] = random.nextDouble();

            System.gc();

            final long start = System.nanoTime();

            double blackhole = 0;
            for (int i = 0; i < values.length; i++)
                blackhole += Math.log(values[i]);

            final long elapsed = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - start);

            System.out.println(elapsed + "ms (" + blackhole + ")");
        }
    }
}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Disable the _dlog intrinsic on aarch64 like -XX:DisableIntrinsic=_dlog do and use the StrictMath implementation.

FREQUENCY : always



Comments
[11u notice] If this was adressed for 11, we should seek a more restricted solution. See also discussion in backport PR for 17. Also remember to take along the follow-up fix.
12-03-2024

Fix request [17u] I backport this for parity with 17.0.9-oracle. Medium risk, rather new change but small. Affects only mac on aarch64. I had to resolve and skipped changes that are relevant for loom. SAP nighlty testing passed.
18-07-2023

Sure, but only once I have a review for the change.
18-07-2023

Did you mean to add a fix request tag?
17-07-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/1588 Date: 2023-07-17 09:26:21 +0000
17-07-2023

Changeset: 466ec300 Author: Tobias Holenstein <tholenstein@openjdk.org> Date: 2023-05-24 07:29:25 +0000 URL: https://git.openjdk.org/jdk/commit/466ec300fc8e5702553123cf2fa4b0d8c7d552d9
24-05-2023

This is another case of a general bug in the way WX is handled. Instead of flipping WX when needed, there is a general presumption that when we're in VM code we should enable WXWrite. This is an example of temporal coupling, a classic code smell. As long as we insist on trying to maintain this convention, things will continue to break.
10-05-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13606 Date: 2023-04-24 08:10:02 +0000
10-05-2023

Looks pretty good to me. The suggested path is a quick and straightforward fix for the problem, while the more generic fix for too many LEAF functions to handle WX will take a longer time and will require some effort to make it right. Probably we need to move WXWrite from LEAF entry down to NativeCall:set_*, where it feels more natural.
24-03-2023

I tested removing the `WXWrite` from `VM_LEAF_BASE` and then building the VM fails on macOs aarch64 with: ``` Stack: [0x000000016f768000,0x000000016f96b000], sp=0x000000016f968fe0, free space=2051k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.dylib+0x7ccd1c] NativeCall::set_destination_mt_safe(unsigned char*, bool)+0xc8 V [libjvm.dylib+0x891e44] SharedRuntime::fixup_callers_callsite(Method*, unsigned char*)+0x240 v ~BufferBlob::I2C/C2I adapters 0x000000010a0d12a4 J 7 c1 java.lang.String.hashCode()I java.base (60 bytes) @ 0x0000000102b51f08 [0x0000000102b51d00+0x0000000000000208] ``` `SharedRuntime::fixup_callers_callsite` is `JRT_LEAF` and accesses the Code Cache via `call->set_destination_mt_safe(entry_point)`. Therefore we need `ThreadWXEnable` But the math functions I tested without the `WXWrite` and all tests passed. We could solve this Bug like that and investigate which `JRT_LEAF` functions need the `WXWrite` in a different RFE. What do you think?
24-03-2023

Indeed, thanks for the detailed analysis! :) For these math function WX switching is not required. Probably in some other cases functions defined with VM_BASE_IMPL do access CodeCache for writing. I need to check the code why it's there. Another approach would be to remove WX management from these functions and leave that only where it is really required.
24-03-2023

Couldn't we defer the WXWrite to the safepoint? And actually, no safepoint should happen when entering via VM_LEAF_BASE, right?
24-03-2023

[~akozlov], [~vkempik], [~burban], what do you think?
24-03-2023

Thanks for this thorough analysis, Toby! Here's an explanation for what "WXWrite" actually does: https://developer.apple.com/documentation/apple-silicon/porting-just-in-time-compilers-to-apple-silicon I assume that setting WXWrite is required because entering the VM might trigger a safepoint and that might trigger writing into executable memory: // JavaThread state should be changed only after taking WXWrite. The state // change may trigger a safepoint, that would need WXWrite to do bookkeeping // in the codecache. I think it's best to hand this over to the macOS/AArch64 Port (JDK-8253795) experts to check if there's anything we can do about the performance of enabling/disabling write protections. If not, we should probably prefer the Java version over the intrinsics.
24-03-2023

| Linux x64 | Linux aarch64 | macOS aarch64 --- Math.exp | 4.996 ns/ops | 16.444 ns/ops | **75.032 ns/ops** StrictMath.exp | 10.201 ns/ops | 17.950 ns/ops | 6.292 ns/ops ------- Math.log | 7.106 ns/ops | 12.406 ns/ops | **62.073 ns/ops** StrictMath.log | 8.881 ns/ops | 13.228 ns/ops | 4.512 ns/ops --- Math.log10 | 7.811 ns/ops | 16.180 ns/ops | **67.623 ns/ops** StrictMath.log10 | 12.724 ns/ops | 16.968 ns/ops | 6.611 ns/ops --- Math.pow | 1.958 ns/ops | **4.693 ns/ops** | **46.393 ns/ops** StrictMath.pow | 2.516 ns/ops | 3.342 ns/ops | 2.052 ns/ops --- Math.ceil | **2.237 ns/ops** | 0.748 ns/ops | 0.569 ns/ops StrictMath.ceil | 1.513 ns/ops | 1.670 ns/ops | 0.844 ns/ops --- Math.floor | **2.236 ns/ops** | 0.728 ns/ops | 0.602 ns/ops StrictMath.floor | 1.051 ns/ops | 1.336 ns/ops | 0.771 ns/ops --- Math.rint | **2.236 ns/ops** | 0.728 ns/ops | 0.567 ns/ops StrictMath.rint | 0.980 ns/ops | 1.363 ns/ops | 0.739 ns/ops --- Math.sin | **9.597 ns/ops** | 8.264 ns/ops | 3.738 ns/ops StrictMath.sin | 8.093 ns/ops | 14.754 ns/ops | 8.711 ns/ops --- Math.cos | **9.424 ns/ops** | 8.394 ns/ops | 3.427 ns/ops StrictMath.cos | 7.602 ns/ops | 14.055 ns/ops | 8.102 ns/ops --- Math.tan | 14.159 ns/ops | 23.009 ns/ops | **80.358 ns/ops** StrictMath.tan | 15.769 ns/ops | 28.939 ns/ops | 13.166 ns/ops
23-03-2023

The class`java.lang.Math` contains methods for performing basic numeric operations such as the elementary exponential, logarithm, square root, and trigonometric functions. The numeric methods of class `java.lang.StrictMath` are defined to return the bit-for-bit same results on all platforms. The implementations of the equivalent functions in class `java.lang.Math` do not have this requirement. This relaxation permits better-performing implementations where strict reproducibility is not required. By default most of the `java.lang.Math` methods simply call the equivalent method in `java.lang.StrictMath` for their implementation. Code generators (like C2) are encouraged to use platform-specific native libraries or microprocessor instructions, where available, to provide higher-performance implementations of `java.lang.Math` methods. Such higher-performance implementations still must conform to the specification for `java.lang.Math` I ran JMH benchmarks `org.openjdk.bench.java.lang.StrictMathBench` and `org.openjdk.bench.java.lang.MathBench` on `Linux x64` , `Linux aarch64` and `macOS aarch64`. It would be exprected that `java.lang.Math` is equally or faster than `java.lang.StrictMath` . But this is not always the case. Especially `exp`, `log`, `log10`, `pow` and `tan` on `macOS aarch64` are by around a factor 10 slower. On `macOS aarch64` C2 generates `StubRoutines` for `Math.sin` and `Math.cos`. And a for `Math.tan`, `Math.exp`, `Math.log`, `Math.pow` and `Math.log10` a call to a `c++` function. This happens in `LibraryCallKit::inline_math_native` with funcAddr `CAST_FROM_FN_PTR(address, SharedRuntime::dsin)` To the shared runtime functions: ```c++ static jdouble dtan(jdouble x); static jdouble dlog(jdouble x); static jdouble dlog10(jdouble x); static jdouble dexp(jdouble x); static jdouble dpow(jdouble x, jdouble y); ``` - Which are implemented in `sharedRuntimeTrans.cpp`: ```c++ JRT_LEAF(jdouble, SharedRuntime::dlog10(jdouble x)) return __ieee754_log10(x); JRT_END JRT_LEAF(jdouble, SharedRuntime::dexp(jdouble x)) return __ieee754_exp(x); JRT_END JRT_LEAF(jdouble, SharedRuntime::dpow(jdouble x, jdouble y)) return __ieee754_pow(x, y); JRT_END JRT_LEAF(jdouble, SharedRuntime::dlog(jdouble x)) return __ieee754_log(x); JRT_END ``` - And in `sharedRuntimeTrig.cpp`: ```c++ JRT_LEAF(jdouble, SharedRuntime::dtan(jdouble x)) [...] JRT_END ``` - The `JRT_LEAF` makro : ```c++ #define JRT_LEAF(result_type, header) \ result_type header { \ VM_LEAF_BASE(result_type, header) \ ``` - Whereas `VM_LEAF_BASE` is ```c++ // LEAF routines do not lock, GC or throw exceptions // On macos/aarch64 we need to maintain the W^X state of the thread. So we // take WXWrite on the enter to VM from the "outside" world, so the rest of JVM // code can assume writing (but not executing) codecache is always possible // without preliminary actions. // JavaThread state should be changed only after taking WXWrite. The state // change may trigger a safepoint, that would need WXWrite to do bookkeeping // in the codecache. #define VM_LEAF_BASE(result_type, header) \ debug_only(NoHandleMark __hm;) \ MACOS_AARCH64_ONLY(ThreadWXEnable __wx(WXWrite, \ JavaThread::current())); \ os::verify_stack_alignment(); \ ``` - The reason for the 10x slowdown on macOS aarch64 seems to be `WXWrite` - Without that the performance is as expected (similar to StrictMath)
23-03-2023

Toby, please have a look.
17-02-2023

ILW = Performance with intrinsic is worse than without, _dlog intrinsic on Mac M1, disable intrinsic = MMM = P3
17-02-2023

The stub for the _dlog intrinsic was disabled by JDK-8215133 which should lead to LibraryCallKit::inline_math_native emitting a direct call to SharedRuntime::dlog -> __ieee754_log. I'm not sure why that one is so slow on Mac M1.
17-02-2023

But I can reproduce this on a Mac M1 machine: jdk-19.0.2.jdk/Contents/Home/bin/java Main 6267ms (-9.99850495092053E7) 6290ms (-1.000038166303315E8) 6290ms (-1.0001351541276565E8) jdk-19.0.2.jdk/Contents/Home/bin/java -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_dlog Main 861ms (-9.99948939255735E7) 874ms (-9.999978828501564E7) 872ms (-1.0000045283484954E8) Same with latest JDK 21 (21-ea+10-LTS-784) and also with JDK 17.0.7. So this is not a (recent) regression.
17-02-2023

I can not reproduce this on Linux aarch64 (Ampere A1) with JDK 19.0.2: 19.0.2/bin/java Main 1386ms (-1.0000181323473875E8) 1431ms (-9.999069243067198E7) 1431ms (-1.0000593490621991E8) 1431ms (-1.000068413622054E8) jdk-19.0.2/bin/java -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_dlog Main 1923ms (-1.0000122097311078E8) 1924ms (-1.000130371487033E8) 1925ms (-1.0000553223919515E8) 1923ms (-9.999498701601629E7)
17-02-2023

Issue is not reproduced on Windows OS OS : Windows 10 (x64) JDK 19.0.2: Pass The elapsed time without additional options is less than the time with the _dlog intrinsic disabled. The description mentions that the issue is only reproducible on Mac M1 aarch64, moving it to dev team for further analysis. ILW = issue in Mac M1 aarch64, reproducible with single test , no workaround available = MLM = P4
17-02-2023