Bug ID: JDK-8361380 ARM32: Atomic stubs should be in pre-universe

JDK 26
26 b07Fixed

[~shade] Sure I will! I need to improve my monitoring so that I can react more promptly. Thanks for picking this up so quickly!
14-07-2025
> [~shade] Looks good for fastdebug as well as release builds of your branch. I do not see crashes anymore. Thanks for testing! The fix is in. Nightly/EA builds would eventually catch up. If you see this problem reoccuring, let us know.
14-07-2025
Changeset: 6cff49c0 Branch: master Author: Aleksey Shipilev <shade@openjdk.org> Date: 2025-07-14 14:30:02 +0000 URL: https://git.openjdk.org/jdk/commit/6cff49c0fe7f5fac2efe50ac51479d7ee964436a
14-07-2025
[~shade] Looks good for fastdebug as well as release builds of your branch. I do not see crashes anymore.
14-07-2025
[~shade] Sure, builds are running... I'll post the result.
13-07-2025
Marc, I have a PR up with the fix we are about to integrate (see link above), it should fix the issue. Please test, if you have time.
13-07-2025
[~iklam] With the environment variable set JAVA_TOOL_OPTIONS='-XX:+UnlockDiagnosticVMOptions -XX:-AOTCacheParallelRelocation' I wasn't able to reproduce the crash so far.
12-07-2025
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/26270 Date: 2025-07-11 17:02:07 +0000
11-07-2025
Lifecycle: we map and relocate CDS archive during universe init, see universe_init() -> Metaspace::global_initialize() -> MetaspaceShared::initialize_runtime_shared_and_meta_spaces(). So the atomic stubs should be ready before that.
11-07-2025
Sure. Your patch seems reasonable.
11-07-2025
Can I have it, though? I already spent quite some time reproducing and understanding the issue.
11-07-2025
[~adinn] This is yours.
11-07-2025
Thank you [~shade] confirming [~adinn] theory. I assign this to him.
11-07-2025
I know Andrew Dinn wanted to do a patch. Whatever happens, I attached my version as 8361380-arm32-atomics.patch, and I am testing if it fixes the ARM32 reproducer I have :)
11-07-2025
Or, as JDK-8358690 specific fix, we can move these to preuniverse? StubRoutines::_atomic_add_entry = generate_atomic_add(); StubRoutines::_atomic_xchg_entry = generate_atomic_xchg(); StubRoutines::_atomic_cmpxchg_entry = generate_atomic_cmpxchg(); StubRoutines::_atomic_cmpxchg_long_entry = generate_atomic_cmpxchg_long(); StubRoutines::Arm::_atomic_load_long_entry = generate_atomic_load_long(); StubRoutines::Arm::_atomic_store_long_entry = generate_atomic_store_long(); It would take a bit of fiddling to move these from initial to preuniverse.
11-07-2025
Linux ARM32 is the only platform that does this bootstrap-time-actually-not-atomic oddity. Even Zero have migrated to GCC built-ins for atomics, which avoids these bootstrap circularities. It requires linking with -latomic, though, which Zero build automatically adds. Maybe we could/should do the same for ARM32.
11-07-2025
One might think it is just archive workers that are used too early, but I see in hs_errs that there are other threads -- notably GC threads -- have started at the time of archive workers crash too. So if any of those also (unlikely, but possibly) wanted atomics in some initialization sequences to work, they are also at risk. Java Threads: ( => current thread ) Total: 0 Other Threads: 0xb598fa38 WorkerThread "GC Thread#0" [id=2470, stack(0xb368c000,0xb370c000) (512K)] 0xb5996d38 ConcurrentGCThread "G1 Main Marker" [id=2471, stack(0xb360b000,0xb368b000) (512K)] 0xb5997d50 WorkerThread "G1 Conc#0" [id=2472, stack(0x76180000,0x76200000) (512K)] 0xb59eda88 ConcurrentGCThread "G1 Refine#0" [id=2473, stack(0x75e80000,0x75f00000) (512K)] 0xb59eeaf8 ConcurrentGCThread "G1 Service" [id=2474, stack(0x75c80000,0x75d00000) (512K)] =>0xb59f3c70 (exited) Archive Worker Thread "ArchiveWorkerThread" [id=2475, stack(0x74fad000,0x7502d000) (512K)] Total: 6
11-07-2025
This also lines up nicely with Andrew Dinn's suspicion yesterday that we use a broken cmpxchg somehow. The task distibution code in archive workers uses cmpxchg(int): void ArchiveWorkerTask::run() { while (true) { int chunk = Atomic::load(&_chunk); if (chunk >= _max_chunks) { return; } if (Atomic::cmpxchg(&_chunk, chunk, chunk + 1, memory_order_relaxed) == chunk) { assert(0 <= chunk && chunk < _max_chunks, "Sanity"); work(chunk, _max_chunks); } } } ...which ends up calling here for ARM32: int32_t ARMAtomicFuncs::cmpxchg_bootstrap(int32_t compare_value, int32_t exchange_value, volatile int32_t* dest) { // try to use the stub: cmpxchg_func_t func = CAST_TO_FN_PTR(cmpxchg_func_t, StubRoutines::atomic_cmpxchg_entry()); if (func != nullptr) { _cmpxchg_func = func; return (func)(compare_value, exchange_value, dest); } assert(Threads::number_of_threads() == 0, "for bootstrap only"); int32_t old_value = dest; if (old_value == compare_value) dest = exchange_value; return old_value; } ...which, as you can see does the non-atomic update* if `StubRoutines::atomic_cmpxchg_entry()` is not initialized. So there is a chance task distribution code would hand over the same chunk to several threads. Which leads to running relocation over some pointers twice. Which FUBARs them. Which also explains why it started to show up after JDK-8358690 -- that thing likely moved the stub initialization _after_ archive workers needed it. I guess cmpxchg_bootstrap "believes" that bootstrap is single-threaded until all the stubs have generated. It even assert(Threads::number_of_threads() == 0), but that only takes care of Java threads, not the native ones like archive workers.
11-07-2025
My weak hypothesis was that we end up doing the relocation twice somehow. The second relocation obviously FUBARs the pointer, which would assert in fastdebug, and crash somewhere in release. I added this logging: void work_on(int chunk, int max_chunks, BitMapView* bm, SharedDataRelocator* reloc) { BitMap::idx_t size = bm->size(); BitMap::idx_t start = MIN2(size, size * chunk / max_chunks); BitMap::idx_t end = MIN2(size, size * (chunk + 1) / max_chunks); assert(end > start, "Sanity: no empty slices"); if (UseNewCode) { tty->print_cr("(%d) Working on " PTR_FORMAT " : %zu %zu %zu", os::current_process_id(), p2i(bm), start, end, size); } bm->iterate(reloc, start, end); } ...and after a while it showed me (I sorted the output by hand): $ grep 18862 out (18862) Working on 0xb53709e8 : 0 115125 921000 (18862) Working on 0xb53709e8 : 115125 230250 921000 (18862) Working on 0xb53709e8 : 230250 345375 921000 (18862) Working on 0xb53709e8 : 345375 460500 921000 (18862) Working on 0xb53709e8 : 460500 575625 921000 (18862) Working on 0xb53709e8 : 575625 690750 921000 (18862) Working on 0xb53709e8 : 690750 805875 921000 (18862) Working on 0xb53709e8 : 690750 805875 921000 ; <----- DOING IT TWICE (18862) Working on 0xb53709e8 : 805875 921000 921000 (18862) Working on 0xb53709f0 : 0 161699 1293596 (18862) Working on 0xb53709f0 : 161699 323399 1293596 (18862) Working on 0xb53709f0 : 323399 485098 1293596 (18862) Working on 0xb53709f0 : 485098 646798 1293596 (18862) Working on 0xb53709f0 : 646798 808497 1293596 (18862) Working on 0xb53709f0 : 808497 970197 1293596 # Internal Error (/home/shade/trunks/jdk/src/hotspot/share/cds/archiveUtils.inline.hpp:43), pid=18862, tid=18863 # V [libjvm.so+0x8be3f0](18862) Working on 0xb53709f0 : 970197 1131896 1293596 (18862) Working on 0xb53709f0 : 1131896 1293596 1293596 # /home/pi/hs_err_pid18862.log
11-07-2025
Ah yes, here it is! $ seq 1 10000 \| xargs -P 2 -n 1 jdk-mainline/bin/java -fastdebug -Xmx64m Hello ... Hello world Hello world Hello world Hello world Hello world Hello world Hello world # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (src/hotspot/share/cds/archiveUtils.inline.hpp:43), pid=2252, tid=2259 # assert(_valid_old_base <= old_ptr && old_ptr < _valid_old_end) failed: must be # # JRE version: (26.0) (fastdebug build ) # Java VM: OpenJDK Server VM (fastdebug 26-testing-builds.shipilev.net-openjdk-jdk-b4814-20250710-1937, mixed mode, sharing, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x8be380]Hello world SharedDataRelocationTask::work(int, int)+0x5d4
11-07-2025
Given that we are crashing at the very beginning, I don't think Jacoco is substantial in this story. What probably matters is that we run lots of JVMs, and it gives us more chance to observe a low-frequency event. I just ran: $ seq 1 10000 \| xargs -P 2 -n 1 jdk-mainline/bin/java -Xmx64m Hello ...and it crashed as well, in a new way: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xb6208504, pid=32705, tid=32706 # # JRE version: (26.0) (build ) # Java VM: OpenJDK Server VM (26-testing-builds.shipilev.net-openjdk-jdk-b4814-20250710-1937, mixed mode, sharing, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x152504] AOTClassLocationConfig::validate(char const, bool, bool) const+0x664 # Current thread (0xb5e19670): JavaThread "Unknown thread" [_thread_in_vm, id=32706, stack(0xb5fdf000,0xb602f000) (320K)] Stack: [0xb5fdf000,0xb602f000], sp=0xb602b9b8, free space=306k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x152504] AOTClassLocationConfig::validate(char const, bool, bool) const+0x664 (aotClassLocation.cpp:976) V [libjvm.so+0x426250] FileMapInfo::validate_class_location()+0x44 (filemap.cpp:330) V [libjvm.so+0x890374] MetaspaceShared::map_archive(FileMapInfo, char, ReservedSpace)+0x90 (metaspaceShared.cpp:1903) V [libjvm.so+0x892008] MetaspaceShared::map_archives(FileMapInfo, FileMapInfo, bool)+0x11c (metaspaceShared.cpp:1553) V [libjvm.so+0x892760] MetaspaceShared::initialize_runtime_shared_and_meta_spaces()+0x3ac (metaspaceShared.cpp:1342) V [libjvm.so+0x88b13c] Metaspace::global_initialize()+0xb4 (metaspace.cpp:743) V [libjvm.so+0xaf5910] universe_init()+0x150 (universe.cpp:890) V [libjvm.so+0x5685f0] init_globals()+0x6c (init.cpp:138) V [libjvm.so+0xaccfd0] Threads::create_vm(JavaVMInitArgs, bool)+0x2e8 (threads.cpp:592) V [libjvm.so+0x66aec4] JNI_CreateJavaVM+0x74 (jni.cpp:3589) siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xdd78e524
11-07-2025
Dang, it is hard to reproduce for me now as well. I remembered a bit of trivia about small boards: they might be hotplugging CPUs. So the ActiveProcessorCount is different, depending on system conditions. ArchiveWorkers use that to drive the parallelism for relocation code. So that might be a confounding factor for reproducibilty as well. EDIT: Nevermind, looks like all 4 CPUs are online in all crash logs.
11-07-2025
[~iklam] I can actually reproduce it with almost every build. If it helps here is my setup: https://github.com/marchof/PiCI/blob/master/jdk-jacoco/docker/Dockerfile I just added JAVA_TOOL_OPTIONS as requested and will let you know.
11-07-2025
[~marchof] I have a hard time reproducing the crash. I am now trying to run this in a loop overnight ... How often can you reproduce it? Could you try adding this to you environment and see if the problem goes away? $ export JAVA_TOOL_OPTIONS='-XX:+UnlockDiagnosticVMOptions -XX:-AOTCacheParallelRelocation' $ mvn clean install ....
11-07-2025
Relocation should always happen. The requested based address is 0x80000000, but we always patch it to 0x75594000 (at least that's the case on my RPi. The OS doesn't really give us ASLR) $ for i in {1..10}; do ./jdk/bin/java -Xlog:cds,aot --version \| grep 'Reserved archive_space_rs'; done [0.021s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.011s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.011s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.010s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.011s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.014s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.011s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.011s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.011s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes [0.010s][info][cds] Reserved archive_space_rs [0x75594000 - 0x76000000] (10928128) bytes Since the classes.jsa file is not changing, the relocation code should do the exact same thing every time. I don't know why the relocation would sometimes assert (or end up patching the archive incorrectly, so we fail very early when trying to load the very first class) $ ./jdk/bin/java -Xlog:aot+reloc=debug --version [0.011s][debug][aot,reloc] SharedDataRelocator::_patch_base = 0x75595b6c [0.011s][debug][aot,reloc] SharedDataRelocator::_patch_end = 0x758a0000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_old_base = 0x80000000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_old_end = 0x80a6c000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_new_base = 0x75594000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_new_end = 0x76000000 [0.011s][debug][aot,reloc] SharedDataRelocator::_patch_base = 0x75b04874 [0.011s][debug][aot,reloc] SharedDataRelocator::_patch_end = 0x76000000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_old_base = 0x80000000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_old_end = 0x80a6c000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_new_base = 0x75594000 [0.011s][debug][aot,reloc] SharedDataRelocator::_valid_new_end = 0x76000000 openjdk 26-testing 2026-03-17 OpenJDK Runtime Environment (build 26-testing-builds.shipilev.net-openjdk-jdk-b4816-20250710-2347) OpenJDK Server VM (build 26-testing-builds.shipilev.net-openjdk-jdk-b4816-20250710-2347, mixed mode, sharing) [~shade] could you check if this is related to AOTCacheParallelRelocation?
11-07-2025
I can also reproduce on my RPi (Linux raspberrypi 5.10.63-v7l+ #1459 SMP Wed Oct 6 16:41:57 BST 2021 armv7l GNU/Linux) with https://builds.shipilev.net/openjdk-jdk/openjdk-jdk-linux-arm32-hflt-server.tar.xz (build 26-testing-builds.shipilev.net-openjdk-jdk-b4816-20250710-2347). But the crash is not consistent. I only got the crash once out of many runs. $ time env JAVA_HOME=/home/pi/shipilev/jdk ../apache-maven-3.9.10/bin/mvn clean install -Dspotless.check.skip -Dmaven.javadoc.skip [...] INFO] [INFO] --- surefire:2.19.1:test (default-test) @ org.jacoco.core.test.validation.java5 --- ------------------------------------------------------- T E S T S ------------------------------------------------------- # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xb683f388, pid=739, tid=742 # Stack: [0xb5fe5000,0xb6035000], sp=0xb6031b08, free space=306k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x783388] Klass::restore_unshareable_info(ClassLoaderData, Handle, JavaThread)+0x18c (klass.hpp:686) V [libjvm.so+0x1743e4] ArrayKlass::restore_unshareable_info(ClassLoaderData, Handle, JavaThread)+0x34 (arrayKlass.cpp:235) V [libjvm.so+0x572914] InstanceKlass::restore_unshareable_info(ClassLoaderData, Handle, PackageEntry, JavaThread)+0x168 (instanceKlass.cpp:2832) V [libjvm.so+0xb507e8] vmClasses::resolve_shared_class(InstanceKlass, ClassLoaderData, Handle, JavaThread)+0xc0 (vmClasses.cpp:242) V [libjvm.so+0xb50bb8] vmClasses::resolve_all(JavaThread)+0x244 (vmClasses.cpp:87) V [libjvm.so+0xa876d8] SystemDictionary::initialize(JavaThread)+0xc4 (systemDictionary.cpp:1568) V [libjvm.so+0xaf2cb4] Universe::genesis(JavaThread)+0x98 (universe.cpp:443) V [libjvm.so+0xaf4738] universe2_init()+0x24 (universe.cpp:1079) V [libjvm.so+0x568660] init_globals2()+0x10 (init.cpp:173) V [libjvm.so+0xacd070] Threads::create_vm(JavaVMInitArgs, bool*)+0x388 (threads.cpp:615) V [libjvm.so+0x66aec4] JNI_CreateJavaVM+0x74 (jni.cpp:3589) siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x6ac39428 See hs_err_pid739_iklam.log in attachment
11-07-2025
[~shade] were you able to reproduce the error with fastdebug build? From the attachment: https://bugs.openjdk.org/secure/attachment/115229/hs_err_pid2468-fastdebug.log 7502e000-75b00000 rw-p 00001000 08:01 62393893 /jdk/lib/server/classes.jsa 7613b000-7617f000 r--p 00ad3000 08:01 62393893 /jdk/lib/server/classes.jsa The assert happened while relocating the default archive, so in theory it should assert every time the /jdk/bin/java is executed. It will be good to find out what the offending pointer is that causes this assert # assert(_valid_old_base <= old_ptr && old_ptr < _valid_old_end) failed: must be
11-07-2025
Yes, I was able to reproduce it on my RPi 4 / armhf with https://builds.shipilev.net/openjdk-jdk/openjdk-jdk-linux-arm32-hflt-server.tar.xz running https://github.com/jacoco/jacoco build and a blanket "mvn clean install -Dspotless.check.skip -Dmaven.javadoc.skip". It takes a while, though :) I think Andrew Dinn found an issue in atomic-long-handling stubs that might explain some of this.
10-07-2025
Sure, the project and its build is public: https://github.com/jacoco/jacoco The specific build setup here is a 32bit Rasberry Pi 4: first building a 32bit JDK and then using this JDK to run the JaCoCo build. I haven't tried to reproduce the problem on amd64 yet. I can add whatever helps to find the root cause.
10-07-2025
Marc, is that a public project build? Can you put a build/test invocation line here?
10-07-2025
[~kvn] I'll try but can't promise: It's a complex Maven build with many JVMs started as sub-processes during the integration tests.
10-07-2025
[~iklam] Can you look why this could happened?
10-07-2025
[~marchof] Can you run fast debug VM with -Xlog:cds=debug -Xlog:aot=debug +codecache+init=debug and attach output?
09-07-2025
Are you sure it is JDK-8358690? From hs_err files I see that AOT is not used. The only thing I see is that I moved some initial stubs generation after universe init.
09-07-2025
Thanks for narrowing it down! [~kvn] could you please take a look?
09-07-2025
I think I can narrow this down to the following commit: https://github.com/openjdk/jdk/commit/6e390ef17cf4b6134d5d53ba4e3ae8281fedb3f3 JDK-8358690: Some initialization code asks for AOT cache status way too early With this commit I can reproduce the crash with every test run. With its parent commit I wasn't able to reproduce it. With fast-debug the error message is: # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/workspace/src/hotspot/share/cds/archiveUtils.inline.hpp:43), pid=3557, tid=3558 # assert(_valid_old_base <= old_ptr && old_ptr < _valid_old_end) failed: must be
08-07-2025
[~thartmann] Sure, I'll bisect -- will take some time.
07-07-2025
I don't see how JDK-8351645 could trigger this, so the issue was most likely introduced by something else. The failing method looks CDS related. Could you try to narrow it down more to a specific change?
07-07-2025
I was able to reproduce the issue with a fastdebug build. See hs_err_pid2468-fastdebug.log attached: # Internal Error (/workspace/src/hotspot/share/cds/archiveUtils.inline.hpp:43), pid=2468, tid=2475 # assert(_valid_old_base <= old_ptr && old_ptr < _valid_old_end) failed: must be # # JRE version: (26.0) (fastdebug build ) # Java VM: OpenJDK Server VM (fastdebug 26-internal-adhoc.root.workspace, mixed mode, sharing, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x604d58] SharedDataRelocationTask::work(int, int)+0x3a3
05-07-2025
I'll try to run it with fastdebug builds.
04-07-2025

Causes :	JDK-8358690 - Some initialization code asks for AOT cache status way too early
Relates :	JDK-8359373 - Split stubgen initial blob into pre and post-universe blobs