Bug ID: JDK-8357017 UnexpectedDeoptimizationAllTest.java crashes in C1 during nmethod registration

JDK-8357017 : UnexpectedDeoptimizationAllTest.java crashes in C1 during nmethod registration

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 25

Priority: P3
Status: Open
Resolution: Unresolved

Submitted: 2025-05-15
Updated: 2025-07-01

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Causes :	JDK-8258229 - Crash in nmethod::reloc_string_for
Relates :	JDK-8258229 - Crash in nmethod::reloc_string_for
Relates :	JDK-8358821 - patch_verified_entry causes problems, use nmethod entry barriers instead
Relates :	JDK-8023037 - Race between ciEnv::register_method and nmethod::make_not_entrant_or_zombie

Description

The jtreg test compiler/codecache/stress/UnexpectedDeoptimizationAllTest.java triggered this issue on a POWER10 64 thread SUSE Linux Enterprise Server 15 SP6 machine (observed only once, unclear if it is platform specific):

#  fatal error: not a NativeCall at 0x00007fff7ca20c40

---------------  T H R E A D  ---------------
Current thread (0x00007fff2c1ca5f0):  JavaThread "C1 CompilerThread4" daemon [_thread_in_vm, id=36509, stack(0x00007fff4ce00000,0x00007fff4d200000) (4096K)]

Current CompileTask:
C1:937 1400       3       java.lang.StringCoding::countNonZeroAscii (30 bytes)
Stack: [0x00007fff4ce00000,0x00007fff4d200000],  sp=0x00007fff4d1fcc50,  free space=4083k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x16d0cdc]  NativeCall::verify()+0x8c  (nativeInst_ppc.cpp:159)
V  [libjvm.so+0x1946bac]  Relocation::pd_call_destination(unsigned char*)+0x31c  (nativeInst_ppc.hpp:177)
V  [libjvm.so+0x1945da4]  CallRelocation::value()+0x24  (relocInfo.hpp:950)
V  [libjvm.so+0x16e587c]  nmethod::verify_scopes() [clone .part.0]+0x21c  (nmethod.cpp:3052)
V  [libjvm.so+0x16f0e58]  nmethod::verify()+0x508  (nmethod.cpp:2997)
V  [libjvm.so+0x16f13e0]  nmethod::new_nmethod(methodHandle const&, int, int, CodeOffsets*, int, DebugInformationRecorder*, Dependencies*, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, CompLevel)+0x3b0  (nmethod.cpp:1226)
V  [libjvm.so+0x832e24]  ciEnv::register_method(ciMethod*, int, CodeOffsets*, int, CodeBuffer*, int, OopMapSet*, ExceptionHandlerTable*, ImplicitExceptionTable*, AbstractCompiler*, bool, bool, bool, bool, int)+0x604  (ciEnv.cpp:1062)
V  [libjvm.so+0x61e390]  Compilation::install_code(int)+0x120  (c1_Compilation.cpp:424)
V  [libjvm.so+0x623a90]  Compilation::compile_method()+0x640  (c1_Compilation.cpp:487)
...

The address which was incorrectly assumed to refer to a NativeCall points to the entry point (prolog) of the new nmethod which is being registered:
   0x7fff7ca20c40:   .long 0x0
   0x7fff7ca20c44:   addis   r11,r1,-2
   0x7fff7ca20c48:   std     r0,0(r11)
   0x7fff7ca20c4c:   std     r20,16(r1)
   0x7fff7ca20c50:   stdu    r1,-112(r1)
   0x7fff7ca20c54:   addis   r20,r29,16
   0x7fff7ca20c58:   addi    r20,r20,-19328
   0x7fff7ca20c5c:   mtctr   r20
   0x7fff7ca20c60:   lis     r20,0
   0x7fff7ca20c64:   ori     r20,r20,0
   0x7fff7ca20c68:   ld      r0,32(r16)
   0x7fff7ca20c6c:   cmpw    r0,r20
   0x7fff7ca20c70:   bnectrl

Compiled method (c1) 3201 1400       3       java.lang.StringCoding::countNonZeroAscii (30 bytes)
 total in heap  [0x00007fff7ca20b08,0x00007fff7ca21498] = 2448
 constants      [0x00007fff7ca20c00,0x00007fff7ca20c40] = 64
 main code      [0x00007fff7ca20c40,0x00007fff7ca21428] = 2024
 stub code      [0x00007fff7ca21428,0x00007fff7ca21498] = 112
 mutable data [0x00007fff101076a0,0x00007fff10107738] = 152
 relocation     [0x00007fff101076a0,0x00007fff101076d8] = 56
 metadata       [0x00007fff101076d8,0x00007fff10107738] = 96
 immutable data [0x00007fff10107260,0x00007fff10107608] = 936
 dependencies   [0x00007fff10107260,0x00007fff10107270] = 16
 nul chk table  [0x00007fff10107270,0x00007fff10107290] = 32
 scopes pcs     [0x00007fff10107290,0x00007fff101074d0] = 576
 scopes data    [0x00007fff101074d0,0x00007fff10107608] = 312

The entry point is overwritten with 0 which is the instruction which triggers SIGILL. That means it has been patched not entrant.

Seems like -XX:-DeoptimizeRandom can hit nmethods which are not yet completely installed and usable.

Comments

We are seeing a crash because the code from JDK-8258229 has patched the reloc info for the entry point to something completely wrong. I can't see what kind of reloc type we had before patching in the crash report. Could be relocInfo::none or maybe some misinterpreted data_prefix_tag thing. What I can see is that the instructions for which the reloc type was modified don't have a patchable/identifyable interface.
01-07-2025
[~mdoerr] : > PPC doesn't use metadata relocations at the VEP also from the PR: > Adding a nop wouldn't fix the issue for PPC, because it already has instructions which don't need patching at the entry point Sorry, I'm confused. Why exactly are we seeing a crash on POWER10 if the VEP doesn't need patching and doesn't use a relocation? > The relocations can also start with relocInfo::data_prefix_tag. The reloc info belongs to something else in that case. I must be rusty on reloc info, because I don't know what the above means, but I think your point is that if there is a prefix, then change_reloc_info_for_address is doing the wrong thing, correct?
30-06-2025
PPC doesn't use metadata relocations at the VEP. x86 does that in some cases. RelocIterator iter(this, verified_entry_point(), verified_entry_point() + 8) typically finds relocInfo::none on all platforms. The relocations can also start with relocInfo::data_prefix_tag. The reloc info belongs to something else in that case. JDK-8258229 patches ANY reloc type to runtime_call_type and that really sounds scary. I'd vote for a complete backout for jdk25. Also because of NMethodState_lock. We could later backport the better fix JDK-8358821 to jdk25u.
30-06-2025
[~mdoerr] Backporting JDK-8358821 to jdk25 seems a little risky. Maybe one of the above workaround would be more appropriate? The NOP would only be needed if we were going to start the prolog with a metadata relocation, right?
27-06-2025
This bug only exists because the new code from JDK-8258229 has patched a stupid relocation at the verified entry point. There should never be a relocation at this point on PPC64. The same holds for aarch64 which already has a nop in MachPrologNode::emit: // insert a nop at the start of the prolog so we can patch in a // branch if we need to invalidate the method later __ nop(); There is another trick possible to avoid a relocation: Metadata can be used without emitting relocation information: The Metadata can be recorded by oop_recorder()->find_index(metadata_ptr) to make sure the dependency is tracked. Then, the Metadata address can be loaded as constant without relocation. I don't see any code which requires a patching interface for the Klass* in the clinit_barrier.
06-06-2025
Yes, I think any fix will involve undoing what JDK-8258229 did. This bug was repoted on POWER10, so doesn't that mean that ppc64 sometimes has a relocation at the verified entry point? Right now I am investigating the use of the nmethod entry barrier as a solution. There is already logic to call SharedRuntime::get_handle_wrong_method_stub() on return from the barrier, so I just need to trigger that and keep the barrier permanently armed.
05-06-2025
On PPC64, the instructions at the beginning are arranged in a way that they don't need any relocation (C1_MacroAssembler::build_frame and MachPrologNode::emit). Seems like aarch64 and riscv are fine, but s390 may be affected, too. Seems like the metadata relocation comes from the clinit_barrier on x86. It may be possible to rearrange the code. Or to add a nop before mov_metadata. I'd vote for backing out JDK-8258229 and implementing a better solution.
05-06-2025
I don't know what kind of relocation ppc allows at the verified entry point, but on x86 metadata_type seems to be common. We could emit a "fat nop" or something similar that doesn't require relocation info.
04-06-2025
I think NativeJump::patch_verified_entry() is only needed for inline caches which are not yet updated and may still point to the non-entrant method. (OSR methods are not reached via inline caches.) https://github.com/openjdk/jdk/pull/24831 describes the problem that relocation info did no longer match after patching. I wonder why we need relocation information for the entry point at all. Does only x86 have that? If so, could that be changed?
04-06-2025
This is a tricky one. I'm not really happy with the JDK-8258229 changes. It solves the problem with print_nmethod by grabbing the NMethodState_lock. This means make_not_entrant and other code that uses NMethodState_lock might block for a long time waiting for tty output. But the bigger problem is all the other uses of RelocIterator that are still susceptible to this race. I don't think we want to add NMethodState_lock to all those places. I considered changing make_not_entrant() to call change_reloc_info_for_address() with relocInfo::none, which would probably work better than relocInfo::runtime_call_type, but there is still a race unless all users of RelocIterator grab NMethodState_lock. So I'm looking for a lock-free solution. One possibility is to get rid of NativeJump::patch_verified_entry(). I am wondering why we are calling it at all. We don't do anything like that for OSR nmethods, and apparently nothing has gone wrong. Even if we patch the verified entry, there could still be threads executing just past entry point that sneak by. The only reliable way to make sure no thread is executing the nmethod after the call to make_not_entrant() is to do it at a safepoint, and in that case patching the entry point is not needed. Another possibility would be to fold the verified entry trap into the GC nmethod entry barrier. Making an nmethod not-entrant would mean arming the nmethod entry barrier and never unarming it, plus return address magic to end up in SharedRuntime::get_handle_wrong_method_stub() upon return.
04-06-2025
Looks like runtime_call_type is wrong for all platforms except x64.
02-06-2025
[~dlong] Thanks for looking into this! This makes sense. Right, it crashes when we call verify() after change_reloc_info_for_address(). I'm surprised that we are changing the reloc info at all. That came in recently with JDK-8258229.
02-06-2025
The relocations base is at content_begin(), the same as constants: constants [0x00007fff7ca20c00,0x00007fff7ca20c40] = 64 main code [0x00007fff7ca20c40,0x00007fff7ca21428] = 2024 so the entry point has relative offset 0x40, which corresponds to the 0x3010 at 0x7fff101076a0, I believe. I was expecting nmethod::make_not_entrant() to replace the reloc type at the entry point with relocInfo::none, but it uses relocInfo::relocType::runtime_call_type(), which isn't quite right for platforms that patch in an illegal instruction. This is enough to cause nmethod::verify() to crash on PPC, but only if there is a race, because verify() starts out with the following check: 2935 if (is_not_entrant()) 2936 return; If we instead called verify() right after the call to change_reloc_info_for_address() in make_not_entrant(), it seems like it would crash every time on PPC, and maybe aarch64 too.
27-05-2025
If it's not the entry point then we shouldn't be patching it, but from the description it does appear to be the entry point: # fatal error: not a NativeCall at 0x00007fff7ca20c40 [...] main code [0x00007fff7ca20c40,0x00007fff7ca21428] = 2024 [...] 0x7fff7ca20c40: .long 0x0 If there's not a call there then maybe this is a race between RelocIterator reading and relocInfo::change_reloc_info_for_address() writing the reloc.
27-05-2025
There is never a call instruction directly at the entry point. The relocation data appears to be: 0x7fff101076a0: 0x3010 0x3016 0x305f 0x3016 0x3018 0x3015 0x3013 0x300c 0x7fff101076b0: 0x5003 0x3013 0x3036 0x3036 0x201d 0x8001 0x5805 0x5807 0x7fff101076c0: 0x0077 0x7c01 0xfe8a 0x2800 0x4801 0x780c 0x6001 0x7c01 0x7fff101076d0: 0xfe8a 0x6809 0x4801 0x3009 That means the first relocation refers to a runtime_call_type at offset 16*4 if I see that correctly. No idea why offset 0 is assumed.
26-05-2025
I suspect that the nmethod is being made not_entrant by WB_DeoptimizeAll() and not GC, so CompiledICLocker may not be relevant here. However, nmethod::make_deoptimized(), not grab CompiledICLocker before iterating relocs and patching NOPs, so maybe make_not_entrant() should do the same, but instead the latter uses only NMethodState_lock.
24-05-2025
Is it really true that on POWER10 the entry point can have a call instruction (before patching)? I think that would be very unusual for x86, which might explain why I'm having trouble reproducing it on x86.
24-05-2025
[~mdoerr] what is your opinion regarding Dean's comment ? > so moving the call to verify_scopes() down so it is covered by CompiledICLocker would likely also fix the problem. Should we move the call ?
20-05-2025
> We've never seen this in our testing. Is this reproducible or did you see it again in the meantime? Since 13th of May (when we had and I later reported the issue) I haven't seen it again.
20-05-2025
ILW = crash in stress test, debug build; intermittent, debug build only; maybe disable c1 = MMM = P3
20-05-2025
I don't think DeoptimizeRandom can cause this, because it operates on frames, and there should be no frames with this nmethod at this point. However, the nmethod is registered with GC, so it's possible that something triggered a call to nmethod::make_not_entrant(). There could be a race here if done by a different thread (maybe a GC thread?). If we try to look at the patched verified entry before relocInfo::change_reloc_info_for_address() has been called, we could see this kind of crash. Inserting an artificial delay in make_not_entrant() between NativeJump::patch_verified_entry() and change_reloc_info_for_address() might make it easier to reproduce. I see that make_not_entrant() holds NMethodState_lock while doing all this, so one possible fix would be to grab this same lock in nmethod::verify(). On the other hand, I see that this function already uses CompiledICLocker, which is probably enough to prevent a race with GC threads, so moving the call to verify_scopes() down so it is covered by CompiledICLocker would likely also fix the problem.
20-05-2025
[~mbaesken] We've never seen this in our testing. Is this reproducible or did you see it again in the meantime?
19-05-2025