Bug ID: JDK-8326236 assert(ce != nullptr) failed in Continuation::continuation_bottom

JDK-8326236 : assert(ce != nullptr) failed in Continuation::continuation_bottom_sender

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 22

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2024-02-20
Updated: 2025-01-20
Resolved: 2025-01-14

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 25
25 b06Fixed

Related Reports

Relates :	JDK-8321098 - Cooperative JFR Sampling
Relates :	JDK-8168445 - make pd_get_top_frame_for_profiling more robust

Description

Test name(s): applications/skynet/SkyNet24H.java 
Product(s) tested: JDK 22.0.1 b03
OS/architecture: Linux-aarch64 
Reproducible: Highly Intermittent
Regression: Can't say as it is highly intermittent issue
VM flag: 

Excerpts from Log: 

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (open/src/hotspot/share/runtime/continuation.cpp:290), pid=4042704, tid=4042790
#  assert(ce != nullptr) failed: callee.sp(): 0x0000fffeb7dfeff0
#
# JRE version: Java(TM) SE Runtime Environment (22.0.1+3) (fastdebug build 22.0.1+3-7)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 22.0.1+3-7, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x937a48]  Continuation::continuation_bottom_sender(JavaThread*, frame const&, long*)+0x374
#


Current thread (0x0000ffff20342960):  JfrThreadSampler "JFR Thread Sampler" [id=4042790, stack(0x0000ffff1aa0c000,0x0000ffff1ac0a000) (2040K)] _threads_hazard_ptr=0x0000fffe91ca4d30, _nested_threads_hazard_ptr_cnt=0

Stack: [0x0000ffff1aa0c000,0x0000ffff1ac0a000],  sp=0x0000ffff1ac08140,  free space=2032k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x937a48]  Continuation::continuation_bottom_sender(JavaThread*, frame const&, long*)+0x374  (continuation.cpp:290)
V  [libjvm.so+0xadf028]  frame::safe_for_sender(JavaThread*)+0x548  (frame_aarch64.cpp:167)
V  [libjvm.so+0xd7d2a8]  JavaThread::pd_get_top_frame(frame*, void*, bool)+0x208  (javaThread_linux_aarch64.cpp:73)
V  [libjvm.so+0xd81c48]  JfrGetCallTrace::get_topframe(void*, frame&)+0x28  (jfrCallTrace.cpp:103)
V  [libjvm.so+0xe24adc]  OSThreadSampler::protected_task(SuspendedThreadTaskContext const&)+0xac  (jfrThreadSampler.cpp:193)
V  [libjvm.so+0x1493a50]  SuspendedThreadTask::internal_do_task()+0x40  (signals_posix.cpp:1839)
V  [libjvm.so+0x15539b4]  SuspendedThreadTask::run()+0x14  (suspendedThreadTask.cpp:30)
V  [libjvm.so+0xe24c5c]  JfrThreadSampleClosure::sample_thread_in_java(JavaThread*, JfrStackFrame*, unsigned int)+0x78  (jfrThreadSampler.cpp:210)
V  [libjvm.so+0xe25a8c]  JfrThreadSampleClosure::do_sample_thread(JavaThread*, JfrStackFrame*, unsigned int, JfrSampleType)+0x32c  (jfrThreadSampler.cpp:408)
V  [libjvm.so+0xe28ebc]  JfrThreadSampler::task_stacktrace(JfrSampleType, JavaThread**) [clone .constprop.1]+0x2ac  (jfrThreadSampler.cpp:627)
V  [libjvm.so+0xe29370]  JfrThreadSampler::run()+0x1e0  (jfrThreadSampler.cpp:561)
V  [libjvm.so+0x15b0ee0]  Thread::call_run()+0xac  (thread.cpp:221)
V  [libjvm.so+0x1323a9c]  thread_native_entry(Thread*)+0x12c  (os_linux.cpp:789)
C  [libpthread.so.0+0x7928]  start_thread+0x188

Comments

Changeset: ec2aaaaf Branch: master Author: Patricio Chilano Mateo <pchilanomate@openjdk.org> Date: 2025-01-14 21:51:05 +0000 URL: https://git.openjdk.org/jdk/commit/ec2aaaaf83ad0553d9cb8b3a81e8214b3f5e63fe

14-01-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/23017 Date: 2025-01-09 19:48:53 +0000

09-01-2025

The JFR sampler crashes while trying to get a Java sample for one of the ForkJoinPool workers. I investigated this issue by looking at the core file. Here is what’s happening: The FJP worker is signaled by the JFR sampler at compiled method VirtualThread.afterDone in the last safepoint check right before returning, i.e once the frame has been popped already. Since the _anchor is not set (no last_Java_sp) the sampler gets the top frame’s sp and pc from the thread’s ucontext info. The pc still points to the afterDone nmethod but the sp already points to the caller (frame already popped). So now in frame::safe_for_sender(), when trying to get the sender of the top frame, we calculate the sender_sp by incorrectly incrementing the sp by the size of the afterDone frame, which leads to sender_sp pointing to some random place further up in the stack. Then we calculate the sender_pc by the usual sender_sp[-1]. Unfortunately this sender_sp[-1] value happens to match StubRoutines::cont_returnBarrier() and so we take the branch to call Continuation::continuation_bottom_sender(). Since there is no actual ContinuationEntry in the stack we crash. Method frame::safe_for_sender() already calls _cb->is_frame_complete_at(_pc) to verify the frame is complete at that pc. But this only verifies that the pc is past the point where we create the frame, it doesn’t guard against the case where the frame has already been removed. There is already a comment in codeBlob.hpp about this (https://github.com/openjdk/jdk/blob/ddb58819640dc8f1930d243d6eb07ce88ef79b22/src/hotspot/share/code/codeBlob.hpp#L120). Given that the sp that we read from the thread’s context info already points to the caller of VirtualThread.afterDone (VirtualThread.runContinuation), adding again to it to calculate sender_sp means we are still pointing either somewhere within that frame or its callers. So the fact that sender_sp[-1] happens to match StubRoutines::cont_returnBarrier() means there was a previous mount of a virtual thread where the FJP worker frames occupied less stack space and so the bottom-most virtual thread frame landed in what now is VirtualThread.runContinuation or some of its callers. This can happen because of differences in size between interpreted and compiled frames (c1 or c2). This can be fixed by adding an extra check in frame::safe_for_sender() to make sure there is an actual ContinuationEntry in the stack before calling Continuation::continuation_bottom_sender().

09-01-2025