JDK-8284997 : arm32 build crashes since JDK-8283326
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 19
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • CPU: arm
  • Submitted: 2022-04-19
  • Updated: 2022-06-24
  • Resolved: 2022-06-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20 masterFixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Since merge of JDK-8283326 native builds on arm32 crash reproducible when using the new JVM as part of build steps.

Please find example crash report attached,
Comments
Fix was pushed while main bug was targeted to '19'. Reset the main bug to fixed in '20' and copied the Robo Duke entry here.
24-06-2022

> This has been fixed with JDK-8288719. I don't know why Skara did not pick up on that automatically. This bug's "Fix Version/s" was set to '19' at the time of the integration so a backport bug was created.
24-06-2022

Dukebot added a comment - Yesterday Changeset: 26c03c18 Author: Thomas Stuefe <stuefe@openjdk.org> Date: 2022-06-23 10:15:05 +0000 URL: https://git.openjdk.org/jdk/commit/26c03c1860c6da450b5cd6a46576c78bea682f96
24-06-2022

This has been fixed with JDK-8288719. I don't know why Skara did not pick up on that automatically.
24-06-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9213 Date: 2022-06-20 08:24:49 +0000
20-06-2022

PR open at https://github.com/openjdk/jdk/pull/9213
20-06-2022

We should use `.type function` anyway, for ABI compatibility.
02-06-2022

One possible solution would be to add the missing `.type function` to the assembler function, in my test this causes with-thumb-built-gcc to emit the correct call instruction (blx) and we switch to arm mode when calling into SafeFetch. But I'm not sure this is the best solution. Should we not use the same mode for all compilation units? Why leave this up to the whims of the toolchain builder?
27-05-2022

Picked this up again. This is either a problem with the tool chain, or with the way we call the assembly, or the way we build the assembly. TL;DR VM uses thumb mode globally, the assembly is in ARM, but we don't switch to ARM when calling it. ------------------- SafeFetch32_impl is implemented like this: ``` .globl SafeFetch32_impl .globl _SafeFetch32_fault .globl _SafeFetch32_continuation SafeFetch32_impl: _SafeFetch32_fault: ldr r0, [r0] bx lr _SafeFetch32_continuation: mov r0, r1 bx lr ``` SafeFetch32 Wrapper is C++, looks like this: ``` extern "C" int SafeFetch32_impl(int* adr, int errValue); inline int SafeFetch32(int* adr, int errValue) { return SafeFetch32_impl(adr, errValue); } ``` I step through VM initialization. VM seems to be always in thumb mode: cpsr=xxxx030 and lr is odd-numbered. So that seems to be the standard in this build. Ok. Now I am in the SafeFetch32 wrapper, before stepping into the SafeFetch32_impl assembly routine (named SafeFetch32_fault since the fault label points to the start of the function): ``` (gdb) 0xb659d62a 52 return SafeFetch32_impl(adr, errValue); 8: x/3i $pc => 0xb659d62a <SafeFetch32(int*, int)+14>: bl 0xb665e36c <_SafeFetch32_fault> 0xb659d62e <SafeFetch32(int*, int)+18>: mov r3, r0 0xb659d630 <SafeFetch32(int*, int)+20>: mov r0, r3 (gdb) info registers r0 0x7438d688 1949881992 r1 0xcafebabe 3405691582 r2 0x3 3 r3 0xdeadbeef 3735928559 r4 0xb6d47c70 3067378800 r5 0xb5d1afec 3050418156 r6 0x75609c3c 1969265724 r7 0xb5d1af40 3050417984 r8 0xb5d1ba40 3050420800 r9 0x7438ab90 1949870992 r10 0xb5b17138 3048304952 r11 0xb5d1d468 3050427496 r12 0xb6d47fec 3067379692 sp 0xb5d1af40 0xb5d1af40 lr 0xb65ba1af -1235508817 pc 0xb659d62a 0xb659d62a <SafeFetch32(int*, int)+14> cpsr 0x800f0030 -2146500560 fpscr 0x60000010 1610612752 ``` Now I am inside of _SafeFetch32_fault. We did not switch the execution mode, we are still in thumb mode. ``` (gdb) stepi _SafeFetch32_fault () at /shared/projects/openjdk/jdk-jdk/source/src/hotspot/os_cpu/linux_arm/safefetch_linux_arm.S:36 36 ldr r0, [r0] 8: x/3i $pc => 0xb665e36c <_SafeFetch32_fault>: ldr r0, [r0] 0xb665e370 <_SafeFetch32_fault+4>: bx lr 0xb665e374 <_SafeFetch32_continuation>: mov r0, r1 (gdb) info registers r0 0x7438d688 1949881992 r1 0xcafebabe 3405691582 r2 0x3 3 r3 0xdeadbeef 3735928559 r4 0xb6d47c70 3067378800 r5 0xb5d1afec 3050418156 r6 0x75609c3c 1969265724 r7 0xb5d1af40 3050417984 r8 0xb5d1ba40 3050420800 r9 0x7438ab90 1949870992 r10 0xb5b17138 3048304952 r11 0xb5d1d468 3050427496 r12 0xb6d47fec 3067379692 sp 0xb5d1af40 0xb5d1af40 lr 0xb659d62f -1235626449 pc 0xb665e36c 0xb665e36c <_SafeFetch32_fault> cpsr 0x800f0030 -2146500560 fpscr 0x60000010 1610612752 ``` I step one more instruction. PC only advances by two bytes because the CPU thinks thumb. But this was a four byte instruction. Disassembler is confused now: ``` (gdb) stepi 0xb665e36e 36 ldr r0, [r0] 8: x/6i $pc => 0xb665e36e <_SafeFetch32_fault+2>: ; <UNDEFINED> instruction: 0xff1ee590 0xb665e372 <_SafeFetch32_fault+6>: andeq lr, r1, pc, lsr #2 0xb665e376 <_SafeFetch32_continuation+2>: ; <UNDEFINED> instruction: 0xff1ee1a0 0xb665e37a <_SafeFetch32_continuation+6>: strlt lr, [r0, #303] ; 0x12f 0xb665e37e <__static_initialization_and_destruction_0(int, int)+2>: sub sp, #8 0xb665e380 <__static_initialization_and_destruction_0(int, int)+4>: add r7, sp, #0 ``` ============= VM was compiled in thumb mode, assembler routine in ARM mode, but we did not switch from thumb to ARM when calling into the assembler routine. The call was generated by gcc: ``` => 0xb659d62a <SafeFetch32(int*, int)+14>: bl 0xb665e36c <_SafeFetch32_fault> ``` It uses a "BL", not a "BLX" instruction. If I understand ARM assembly correctly [1], it should have used "BLX" to switch the instruction set. I am sure the answer is somehow in how we use the toolchain. [1] https://developer.arm.com/documentation/dui0489/h/arm-and-thumb-instructions/b--bl--bx--blx--and-bxj
27-05-2022

I wondered why only Marc's reproduction scenario, based on an Ubuntu18.04 docker image, shows the error. It looks like in all other build environments I tested (devkit crossbuild, Raspberry OS 32), GCC generates ARM code too, so it does not clash with the ARM code in the static assembler routine. But the GCC from Ubuntu 18.04 generates THUMB code, because it itself was built with --with-mode=thumb. So this error is hidden by the (typical?) implicit -marm code generation used when building OpenJDK for arm32.
27-05-2022

Moved to hotspot/runtime.
16-05-2022

[~stuefe] Those tags just indicate supported ISA. Section with $a points to ARM code, and it highly likely SafeFetch32_impl is also in ARM mode (can be checked with readelf --syms: if address is even this is ARM function). Is it possible to share disassembler dump for the function?
04-05-2022

[~snazarki] thanks for the hint. Remains strange. AFAICS we don't force the ISA anywhere - no -marm or -mthumb - nor do we set -mthumb-interwork. I compared the object files for the assembler file in question from both crossbuild (which works) and local build (which does not work). In both cases I only see a single code section with "$a", so it would be ARM ISA? But then, the working variant has a "Tag_THUMB_ISA_use: Thumb-1", the non-working one says "Tag_THUMB_ISA_use: Thumb-2". I attached both files. Side question, we should not even need -mthumb-interwork, or? Should we not just simply use the same ISA for all compile units?
02-05-2022

Could this be due to thumb interworking? Some logs contains odd value at return register (LR), and CPSR indicates CPU working in thumb mode (bit 5 is set). Could anybody check what mode is used for assembler file compilation?
28-04-2022

I can confirm that since the above workaround was merged to master the native build on arm32 is green again. Thanks!
27-04-2022

Temp. workaround under review: https://github.com/openjdk/jdk/pull/8399 (https://bugs.openjdk.java.net/browse/JDK-8285675)
26-04-2022

Despite the different error message, it does seem that I'm hitting the same issue. Builds on 'armhf' with the following commit fail: 8283326: Implement SafeFetch statically https://github.com/openjdk/jdk/commit/bdf8a2a2050 while builds with its parent commit work: https://github.com/openjdk/jdk/commit/bb7c97bddfe 8284874: Add comment to ProcessHandle/OnExitTest to describe zombie problem I attached "before" and "after" build logs showing the working build before the commit and the failing build after the commit. Before: openjdk-19+18-24-gbb7c97bddfe-armhf.txt.gz From https://github.com/openjdk/jdk Note: checking out 'bb7c97bddfe88cb3261706f5e272fd0418e5238c'. HEAD is now at bb7c97bddfe 8284874: Add comment to ProcessHandle/OnExitTest to describe zombie problem After: openjdk-19+18-25-gbdf8a2a2050-armhf.txt.gz From https://github.com/openjdk/jdk Note: checking out 'bdf8a2a2050393e91800786f8d5a5d6805f936eb'. HEAD is now at bdf8a2a2050 8283326: Implement SafeFetch statically So far, I get a SIGSEGV at the same location every time: Optimizing the exploded image # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xf74db1ac, pid=64500, tid=64515 # # JRE version: OpenJDK Runtime Environment (19.0) (build 19-internal-adhoc.root.build) # Java VM: OpenJDK Server VM (19-internal-adhoc.root.build, mixed mode, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x6451ac] RuntimeService::init()+0x13b
23-04-2022

I'm getting the following reproducible crash when building on 32-bit ARM ("armhf" Debian architecture) when the build hits the target 'jdk__optimize_image_exec'. I assume my error is related to this bug report, because of the timing, but please let me know if I should instead open a new report. Optimizing the exploded image # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0xf7a7577c, pid=64825, tid=64844 # # JRE version: OpenJDK Runtime Environment (19.0+19) (build 19-ea+19-snap) # Java VM: OpenJDK Server VM (19-ea+19-snap, mixed mode, g1 gc, linux-arm) # Problematic frame: # V [libjvm.so+0x64677c] RuntimeService::init()+0x143 # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /build/snapcraft-openjdk-ada10b75ccb0e6a3649b33d25f2b67a7/parts/jdk/build/make/core.64825) # # An error report file with more information is saved as: # /build/snapcraft-openjdk-ada10b75ccb0e6a3649b33d25f2b67a7/parts/jdk/build/make/hs_err_pid64825.log # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp # These are builds that I run every week successfully (until today) on the Canonical Launchpad build farm. I can also try locally on a Raspberry Pi 2 Model B Rev 1.1, if necessary, and I can change any parameters of the build to try other tests. My build configuration is in the following YAML file: https://github.com/jgneff/openjdk/blob/edge/snap/snapcraft.yaml#L146
22-04-2022

I double checked and did two fastdebug builds: Commit bdf8a2a2050393e91800786f8d5a5d6805f936eb (your change for JDK-8283326): Crashes with Internal Error (0xe0000000) Commit bb7c97bddfe88cb3261706f5e272fd0418e5238c (parent, also by you): Builds succeeds
20-04-2022

Yes, weird, isn't it? My patch did not touch those stub routines. My current guess is that by removing the SafeFetch stub routine generation, my patch nudged some hidden bug loose in the stub generator. It was the last routine to be generated. Like an IC flush is now missing or something ? We now generate just fewer code into the code blob, so its smaller. Yesterday night I finally got it reproduced, but only in a hand-crafted docker container based on Marc's one, and by directly building on the Raspberry. My cross build works with GCC 10, and Marc's build with GCC 9. Maybe that's the difference. [~marchof] BTW, build times on Raspberry are atrocious. I thought the poor thing was melting on me. You really should look into crossbuilding. I build openjdk arm on my x86 desktop in 2,5 minutes.
20-04-2022

The debug log is intriguing: # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (0xe0000000), pid=16984, tid=16989 so we don't even have a clear error cause. The fact this happens around the stub code does suggest a code generation fault.
19-04-2022

[~stuefe] The results with fast debug are attached: - debug-build.log.gz - debug-hs_err_pid16984.log Bur first enjoy your vacation! :)
19-04-2022

This the Dockerfile I'm using: https://github.com/marchof/PiCI/blob/master/jdk/docker/Dockerfile It uses the path /workspace which needs to be mapped to the OpenJDK tree.
19-04-2022

Still stumped. Cannot reproduce it. Only difference is that I build locally with gcc8 instead of gcc9. I'll retry after my vacation. Is there a Docker container readily available? Or a DockerFile to build one?
19-04-2022

[~stuefe] Sorry, for the confusion. There is only one crash per build (because afterwards the build is terminated). But it looks like it randomly crashes at ~StubRoutines::atomic_add or at ~StubRoutines::atomic_cmpxchg for each build. I added hs_err_pid17036.log as an example for atomic_cmpxchg.
19-04-2022

The build is started with the following commands: bash configure --disable-warnings-as-errors --with-native-debug-symbols=none make images It runs on a Raspberry Pi 4 (within a docker container based on ubuntu:18.04).
19-04-2022

ILW = JVM crash; building JaCoCo on ARM32; no known workaround = MMH = P3
19-04-2022

What are the build flags you use for the crashing builds? What is the hardware?
19-04-2022

hs-err file shows crash in atomic_add(), Marc's comment under https://bugs.openjdk.java.net/browse/JDK-8283326 talks about crashing in cmpxchg, so at least two crash locations are involved.
19-04-2022

[~stuefe] I see a single crash reported during the build. Please find the build log attached. I will create a fastdebug build as requested.
19-04-2022

[~stuefe] where are you seeing the different error reports? Is there a missing link to an external report?
19-04-2022

Note that I am on vacation until next week and may only react sporadically.
19-04-2022

I'm unable to reproduce it on a Raspberry 4 4g with Raspian 32-bit. I did build natively, I tried crossbuilding, ran a number of tests, it just works. You do a scratch build, right? Could you please build the debug version (configure ... --with-debug-level=fastdebug) and run the actual build with "make images LOG=debug", then attach the resulting hs-err file as well as the build log? Please also sync the sources to 21ea740e1da48054ee46efda493d0812a35d786e (JDK-8284699), just to be sure we build the same sources. Also, there seem to be several crashes here, all in various stub routines. If possible, please attach multiple hs-err files. -- Have to say, so far this is very strange. The crashes are in several stub routines, but none have to do with my patch. So far I am unsure what to make of it.
19-04-2022