JDK-8362193 : Re-work MacOS/AArch64 SpinPause to handle SB
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 26
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: os_x
  • CPU: aarch64
  • Submitted: 2025-07-14
  • Updated: 2025-07-25
  • Resolved: 2025-07-23
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 26
26 b08Fixed
Related Reports
Causes :  
Relates :  
Description
The following test failed in the JDK26 CI:

compiler/onSpinWait/TestOnSpinWaitAArch64.java

Here's a snippet from the log file:

#section:driver
----------messages:(8/310)----------
command: driver compiler.onSpinWait.TestOnSpinWaitAArch64 c1 sb 1
reason: User specified action: run driver compiler.onSpinWait.TestOnSpinWaitAArch64 c1 sb 1 
started: Mon Jul 14 19:12:42 GMT 2025
Mode: agentvm
Agent id: 8
Process id: 15597
finished: Mon Jul 14 19:12:45 GMT 2025
elapsed time (seconds): 2.823
----------configuration:(14/2012)----------

<snip>

----------System.err:(29/1945)----------
 stdout: [#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/d2398cde-9325-49c3-b030-8961a4f0a253-S577093/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/da0b7b0f-e736-4641-9a75-1b69e20fe2a9/runs/c2e8223d-c732-4099-8da7-827c79a47aec/workspace/open/src/hotspot/os_cpu/bsd_aarch64/os_bsd_aarch64.cpp:539), pid=16220, tid=4867
#  assert(VM_Version::spin_wait_desc().inst() >= SpinWait::NONE && VM_Version::spin_wait_desc().inst() <= SpinWait::YIELD) failed: must be
#
# JRE version:  (26.0+7) (fastdebug build )
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-ea+7-641, mixed mode, emulated-client, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, bsd-aarch64)
# Core dump will be written. Default location: core.16220
#
# An error report file with more information is saved as:
# /System/Volumes/Data/mesos/work_dir/slaves/d2398cde-9325-49c3-b030-8961a4f0a253-S577423/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/7824c4a2-9175-4074-a7ee-c70a0e27a3c3/runs/aa8baa9e-3690-49f7-843f-c5bc106129a3/testoutput/test-support/jtreg_open_test_hotspot_jtreg_tier2_compiler/scratch/1/hs_err_pid16220.log
#
#
];
 stderr: []
 exitValue = 134

java.lang.RuntimeException: Expected to get exit value of [0], exit value is: [134]
	at jdk.test.lib.process.OutputAnalyzer.shouldHaveExitValue(OutputAnalyzer.java:522)
	at compiler.onSpinWait.TestOnSpinWaitAArch64.main(TestOnSpinWaitAArch64.java:88)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:565)
	at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
	at java.base/java.lang.Thread.run(Thread.java:1474)

JavaTest Message: Test threw exception: java.lang.RuntimeException
JavaTest Message: shutting down test

result: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Expected to get exit value of [0], exit value is: [134]

Here's the crashing thread's stack trace:

---------------  T H R E A D  ---------------

Current thread (0x0000000125809410):  JavaThread "Unknown thread" [_thread_in_vm, id=4867, stack(0x000000016f5d0000,0x000000016f7d3000) (2060K)]

Stack: [0x000000016f5d0000,0x000000016f7d3000],  sp=0x000000016f7d1b30,  free space=2054k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x121db64]  VMError::report(outputStream*, bool)+0x1b00  (os_bsd_aarch64.cpp:539)
V  [libjvm.dylib+0x1221404]  VMError::report_and_die(int, char const*, char const*, char*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long)+0x55c
V  [libjvm.dylib+0x5a8a68]  print_error_for_unit_test(char const*, char const*, char*)+0x0
V  [libjvm.dylib+0xec5c0c]  _Copy_conjoint_jshorts_atomic+0x0
V  [libjvm.dylib+0x26e9e4]  ArchiveWorkers::run_task_multi(ArchiveWorkerTask*)+0x12c
V  [libjvm.dylib+0x26e7f4]  ArchiveWorkers::run_task(ArchiveWorkerTask*)+0x50
V  [libjvm.dylib+0x692c58]  FileMapInfo::relocate_pointers_in_core_regions(long)+0x404
V  [libjvm.dylib+0x6923cc]  FileMapInfo::map_regions(int*, int, char*, ReservedSpace)+0x244
V  [libjvm.dylib+0xdf6b34]  MetaspaceShared::map_archive(FileMapInfo*, char*, ReservedSpace)+0x8c
V  [libjvm.dylib+0xdf5ec8]  MetaspaceShared::map_archives(FileMapInfo*, FileMapInfo*, bool)+0x36c
V  [libjvm.dylib+0xdf56a0]  MetaspaceShared::initialize_runtime_shared_and_meta_spaces()+0x114
V  [libjvm.dylib+0xdea8a0]  Metaspace::global_initialize()+0x88
V  [libjvm.dylib+0x11ab61c]  universe_init()+0x1a8
V  [libjvm.dylib+0x8744d8]  init_globals()+0x64
V  [libjvm.dylib+0x1175840]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x2f8
V  [libjvm.dylib+0x9f1ab8]  JNI_CreateJavaVM+0x70
C  [libjli.dylib+0xa3d0]  JavaMain+0x100
C  [libjli.dylib+0xd52c]  ThreadJavaMain+0xc
C  [libsystem_pthread.dylib+0x72e4]  _pthread_start+0x88
Lock stack of current Java thread (top to bottom):
Comments
The fix for this issue is integrated in jdk-26+8-804.
23-07-2025

Changeset: 743c8212 Branch: master Author: Evgeny Astigeevich <eastigeevich@openjdk.org> Date: 2025-07-23 13:51:49 +0000 URL: https://git.openjdk.org/jdk/commit/743c821289a6562972364b5dcce8dd29a786264a
23-07-2025

> What makes this failure intermittent?? AFAICS, as Evgeny explained above, we only hit the actual spinwait from archive workers, only if the main thread is able to unpark from final semaphore and get into spinwait-ing for workers to terminate. That condition depends on timing: whether the workers would finish before main thread starts checking, or not. See: https://github.com/openjdk/jdk/blob/b02c1256768bc9983d4dba899cd19219e11a380a/src/hotspot/share/cds/archiveUtils.cpp#L499. This is where unimplemented SB would be discovered. Evgeny now has a direct gtest for spinwaits that captures the failure consistently.
23-07-2025

[~dholmes] the test failure might be considered intermittent. See my comment that on some Apple hardware, e.g. M3 Pro, the test does not fail at all.
23-07-2025

What makes this failure intermittent??
23-07-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/26387 Date: 2025-07-18 13:11:26 +0000
20-07-2025

> We see the same issue in our CI (test compiler/onSpinWait/TestOnSpinWaitAArch64.java , fastdebug binaries on macOS aarch64 ). I have a fix and am testing it.
17-07-2025

We see the same issue in our CI (test compiler/onSpinWait/TestOnSpinWaitAArch64.java , fastdebug binaries on macOS aarch64 ).
17-07-2025

Hi [~mwthomps], Why have you changed Priority to P3?
15-07-2025

During review of JDK-8321371 PR I was proposing to have `SpinWait::exec(instr)`: https://github.com/openjdk/jdk/pull/16994#issuecomment-1865147655. I think we should do this to fix mac SpinPause.
15-07-2025

On my MacBook with M3 Pro I could not get the test `compiler/onSpinWait/TestOnSpinWaitAArch64.java` crashing. It gets into `ArchiveWorkers::run_task_multi` but `spin.wait()` is not invoked. `Atomic::load(&_finish_tokens)` is always 0. I wrote a gtest which always fails.
15-07-2025

For the bootstrap sequence, we seem to be initializing in correct order: VM_Version_init(); // <--- spin wait style is figure out here ... jint status = universe_init(); // <--- we go for CDS load/relocation here (also see the failing stack in the ticket) This looks just a simple problem in MacOS/AArch64 spinwait code. Should be more bullet-proof if we had added the `_LIMIT` enum value and asserts against that.
15-07-2025

[~shade] Phew! ;-)
15-07-2025

> So this would only reproduce on newer Macs? It does not require newer Macs. This should be reproducible on Apple M2+ CPUs.
15-07-2025

Also, this would be "intermittent" in the sense that SB would only be enabled -- and SB subtest would run -- on modern AArch64 platforms that actually support SB. Docs say "FEAT_SB is OPTIONAL from Armv8.0. FEAT_SB is mandatory from Armv8.5". So there is a chance this failure would only show up when scheduled a shiny new MacBook.
15-07-2025

[~adinn], look at my comments before, and Evgeny's too. MacOS/AArch64 seems to use the whole different implementation for spin-waits, that does not handle SB well.
15-07-2025

Oh, so this is uniquely MacOS/AArch64 problem? So this would only reproduce on newer Macs? I removed "os_x" designator an hour ago, thinking it is a generic aarch64 problem. I have now placed it back.
15-07-2025

The problem is the implementation of `SpinPause()` in os_bsd_aarch64.cpp. Its current implementation was added by JDK-8321371. This implementation differs from os_linux_aarch64.cpp where we call `StubRoutines::aarch64::spin_wait`. `StubRoutines::aarch64::spin_wait` uses functionality covered by the test compiler/onSpinWait/TestOnSpinWaitAArch64.java. There is no test specifically for os_bsd_aarch64.cpp `SpinPause()`. Also the implementation of `SpinPause()` is tied to the details of possible implementations of `VM_Version::spin_wait_desc()` too much.
15-07-2025

There is something very wrong in both the assert and the stack trace. The assert reports a "should not happen" case -- a SpinWait returned at to the assert cannot legitimately have an instruction type out of the enum range -- indicating memory corruption in the SpinWait object returned by value i.e. just pushed onto the stack. The stack trace shows class FileMapInfo swizzling pointers in the archive regions (in parallel). The closure passed to ArchiveWorkers::run_task goes nowhere near _Copy_conjoint_jshorts_atomic yet that is listed as a callee in the stack trace. The closure passed to run_task actually performs an iteration over a BitMapView. This applies another closure that swizzles marked pointers i.e. adds an offset to a memory location indexed by the bitmap if the bit is set. Notably, the address for the frame is _Copy_conjoint_jshorts_atomic+0x0. That looks very suspicious because that function is defined immediately after function spinPause which includes the assert. So, something is very rotten in the state of Danmark. It's hard to see how this could relate to the stub generation ordering/timing changes made by [~kvn], [~adinn] and [~shade]. The obvious suspicion is that relocating generation of some of the initial stubgen stubs in preuniverse_stubs_init() which precedes universe_init() and locating generation of the remaining stubs in initial_stubs_init() *after* universe_init() is leading to some sort of memory ordering or synchronization issue. The obvious culprit is an error in the Atomic::PlatformXXX implementations similar to what we saw on arm32 when the atomic stubs were defaulting to non-atomic implementations. However, that seems unlikely to be the problem. This is happening on bsd_aarch64 and for that architecture all the Atomic::PlatformXXX implementations rely on calls to OS library functions -- no stubs involved. So, if this change has happened after the stub reordering then it doesn't seem likely it is to do with Atomic::PlatformXXX. It is also hard to pin any of the problems on any remaining initial stubs that are now being generated after universe_init. Firstly, there are no aarch64-specific initial stubs. The generic stubs include the call stub,forward/catch exception stubs and various crypto and math stubs. If late generation of these caused an issue on bsd-aarch64 then one would expect to see the same issue on linux_aarch64. The only operation -- apart form stub generation -- that I can see has been delayed by the move of initial_stubs_init() after universe_init() is the intitalization of UnsafeMemoryAccess::_table. If it is found to be null in generate_initial_stubs() aarch64 sets it to a suitable size table, as it seems does every other architecture. It might be appropriate to move this to preuniverse_stubs_init() so it is initialized before calling universe_init() as was done before the reordering. This table appears only to be used in the signal handler. Could the problem be to do with this table still being null when we enter universe_init()? If so then the error would have to be down to problematic memory accesses that only turn up on bsd_aarch64. Does that sound feasible?
15-07-2025

I think the answer is much simpler. Here is the config the test runs: run driver compiler.onSpinWait.TestOnSpinWaitAArch64 c1 sb 1 ...and here is the assert it fails: assert(VM_Version::spin_wait_desc().inst() >= SpinWait::NONE && VM_Version::spin_wait_desc().inst() <= SpinWait::YIELD) ...and here is the enum definition: enum Inst { NONE = -1, NOP, ISB, YIELD, SB }; Notice the failing subtest runs with "SB". This is outside [NONE;YIELD] that assert asserts. "SB" was added by JDK-8359435. The test was problemlisted until recently, so the failure was hidden. Maybe AArch64 atomics were hidden to CDS code as well, like the issue we have fixed on ARM32, haven't checked that one. Anyway, these asserts deserve a fix following JDK-8359435. I'll assign this to Evgeny to make such a fix :)
15-07-2025

The issue also looks like a spinwait called from ArchiveWorkers::run_task_multi: void ArchiveWorkers::run_task_multi(ArchiveWorkerTask* task) { ... SpinYield spin; while (Atomic::load(&_finish_tokens) != 0) { spin.wait(); } } And the assert is in SpinWait(). So the hs_err just misleadingly points to copy stubs: V [libjvm.dylib+0xec5c0c] _Copy_conjoint_jshorts_atomic+0x0 V [libjvm.dylib+0x26e9e4] ArchiveWorkers::run_task_multi(ArchiveWorkerTask*)+0x12c
15-07-2025

To trigger the bug, a user needs explicitly to set `OnSpinWaitInst` to `SB`. With the current default `YIELD` for it no crashes will happen. It is still an open question why the test `compiler/onSpinWait/TestOnSpinWaitAArch64.java` does not trigger the crash. I am looking into this.
15-07-2025

I've added [~adinn]and [~shade] to the watchers list as they have been active in this area - though I don't see a specific change that should have affected this code. Aleksey's most recent change should only affect ARM32 not Aarch64. And hard to see how we have a race condition this early in VM startup that would make this intermittent.
14-07-2025