JDK-8307549 : AsyncGetCallTrace crashes on Mac M1 after JDK-8294160
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc
  • Affected Version: 17.0.7,17.0.7-oracle
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • OS: os_x
  • CPU: aarch64
  • Submitted: 2023-05-05
  • Updated: 2023-11-30
  • Resolved: 2023-10-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17
17.0.9Resolved
Related Reports
Blocks :  
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
JDK-8294160 introduced changes for getting more information for a crash dump. The changes were then backported to 17.0.7 in JDK-8301335.

Among other things, the change included the following addition in Method::is_valid_method:

  } else if (!os::is_readable_range(m, m + 1)) {
    return false;

os::is_readable_range() is implemented with SafeFetch32.
Before JDK-8283326, SafeFetch was generated dynamically in the Code Cache area. On Apple M1/M2 architecture the memory for the dynamic code generation can be either writable or executable, but not both at the same time (W^X).

AsyncGetCallTrace, widely used for profiling (by means of async-profiler), also calls Method::is_valid_method(). Since a thread calling AsyncGetCallTrace is not guaranteed to be in "executable" JIT state, it's illegal to call dynamically generated SafeFetch.

JDK 17.0.7 started crashing on Mac M1 during profiling:

#  SIGBUS (0xa) at pc=0x000000013fd394e4, pid=80473, tid=120091

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
v  ~StubRoutines::SafeFetch32
V  [libjvm.dylib+0x7aea40]  _ZN2os17is_readable_rangeEPKvS1_+0x2c
V  [libjvm.dylib+0x75d0c8]  _ZN6Method15is_valid_methodEPKS_+0x38
V  [libjvm.dylib+0x34d1d0]  _ZL31forte_fill_call_trace_given_topP10JavaThreadP15ASGCT_CallTracei5frame+0x460
V  [libjvm.dylib+0x34cd48]  AsyncGetCallTrace+0x1e0

Note: the bug applies only to 17.0.7. It's not an issue with JDK 20+ since SafeFetch was made static in JDK-8283326.
Comments
Thanks [~dholmes], done.
16-10-2023

This should be closed as a duplicate of JDK-8283326. We don't use "Delivered" for bug or enhancement issues.
16-10-2023

Issue resolved as part of the backport of the enhancement JDK-8283326
13-10-2023

[~lujaniuk] The easiest way to reproduce it is to run Java with async-profiler enabled as described at https://github.com/async-profiler/async-profiler/#launching-as-an-agent In fact, the bug has been already fixed by JDK-8283326 backport, so let me just resolve the issue. Thanks.
09-10-2023

[~apangin] Hi! I'd like to ask if you know of any reproduction steps for this issue?
04-10-2023

Sergey, I don't mind rewording the description to make it more clear - feel free to suggest an edit. I personally don't see any false or misleading statement in the description. The issue appeared after JDK-8301335 - true. The crash happens during execution of SafeFetch32 that was added in JDK-8301335 - true. It's unsafe to call dynamically generated SafeFetch32 inside AsyncGetCallTrace - true. Please note that async-profiler's JitWriteProtection trick has nothing to do with the root cause of this issue, except that it makes the crash obvious, that would be otherwise difficult to reproduce. I'm not sure what you mean by "isn't related to the VM implementation".
07-07-2023

Andrei, I'm just saying that the description of this case is wrong and confusing. The crash has a very explicit reason that isn't related to the VM implementation. I agree that ThreadWXEnable should not be involved in any branch of the AsyncGetCallTrace flow.
07-07-2023

Sergey, this *is* a VM bug. Note that `Thread::enable_wx` is not atomic with respect to getting an old state and setting a new state. AsyncGetCallTrace may be called at any point wherever a profiling signal interrupts current thread, including in the middle of `Thread::enable_wx`. Therefore, any solution involving ThreadWXEnable to temporary switch state inside AsyncGetCallTrace does not really fix the issue. Furthermore, ThreadWXEnable should not have been added on the hot path in the first place, since `pthread_jit_write_protect_np` is rather expensive. It's OK to switch state once per AsyncGetCallTrace invocation, but not for every walked frame like it is done now. So, my proposal is to either remove an expensive and faulty call to SafeFetch from AsyncGetCallTrace path or backport the fix to make SafeFetch implementation static.
06-07-2023

This was not clear from the bug description. I'd say that the reporter should make it clear that this is not a VM issue, or at least not an issue in the way it is described. A simple fix in the library sources reveals the real problem, which is fixed in JDK-8304725.
06-07-2023

Yes, you're right.
05-07-2023

Oh, this is not about ThreadWXEnable. It is the feature of AsyncProfiler https://github.com/async-profiler/async-profiler/blob/117594bb4d1ce61b073c8a9cea438b33f3f09c81/src/profiler.cpp#L390. The tool do a lot of stuff with VM internals and that may the reason to disable write protection.
05-07-2023

I think this might be related to the implementation of ThreadWXEnable. Do you observe any failed assertions with fastdebug?
05-07-2023

I look into this a bit more and to me it looks like we either have a bug or the OS is dropping the WX state when executing the signal handler. SafeFetch32 has a WX switcher (https://github.com/openjdk/jdk17u/blob/0f531dacb878efe77912688216b0f39a769259a3/src/hotspot/share/runtime/safefetch.inline.hpp#L40C5-L40C5) and should not throw an exception. The only reason for the crash is that ThreadWXEnable doesn't actually call os::current_thread_enable_wx as it thinks we are already in this state. Forcing the thread to be in the X state at the beginning of AsyncGetCallTrace results in a crash at PcDescContainer::find_pc_desc_internal as expected.
05-07-2023

I proposed backporting the SafeFetch change at the original issue. If nobody wants to do the work, I can probably backport it.
15-06-2023

Backporting is the only viable option to me. It should speed up ASGCT and allow fixes related to ASGCT safety to be backported too.
30-05-2023

Sorry, I meant AGCT calls Method::is_valid_method a lot. Right, backporting JDK-8283326 would also help. I believe it is even better, since AGCT will also benefit from additional safety checks in is_valid_method().
12-05-2023

Back-porting JDK-8283326 would also fix the problem, right?
12-05-2023

> Since AsyncGetCallTrace calls SafeFetch a lot, this would cause notable performance impact. Then I'm confused how Method::is_valid_method() caused a new problem, if AsyncGetCallTrace was already calling SafeFetch on Apple M1. Did you mean AsyncGetCallTrace calls is_valid_method a lot? My change to Method::is_valid_method() was really intended for error reporting. If AsyncGetCallTrace doesn't need to check if Method memory is bad/inaccessible then it would make sense to revert that change by moving the is_readable_range() into os::print_location().
12-05-2023

Does adding: MACOS_AARCH64_ONLY(ThreadWXEnable wx(WXExec, Thread::current())); to os::is_readable_pointer() fix the problem?
11-05-2023

I haven't tried, but I believe this does not fix the problem for the reasons discussed here - https://github.com/openjdk/jdk/pull/13144#issuecomment-1480769232 Furthermore, swtiching WX state is rather expensive (see JDK-8302736). Since AsyncGetCallTrace calls SafeFetch a lot, this would cause notable performance impact.
11-05-2023

Another crash has been just reported: https://github.com/async-profiler/async-profiler/issues/747 It has complete hs_err.log attached. The submitter confirms that switching back to JDK 17.0.6 solves the issue.
06-05-2023

The suggested fix is to revert the following lines: } else if (!os::is_readable_range(m, m + 1)) { return false;
05-05-2023