JDK-8331735 : UpcallLinker::on_exit races with GC when copying frame anchor
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang.foreign
  • Affected Version: 21,22,23,24
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2024-05-06
  • Updated: 2025-04-15
  • Resolved: 2024-11-27
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21 JDK 24
17.0.15Fixed 21.0.7Fixed 24 b26Fixed
Related Reports
Causes :  
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Description
A fatal error has been detected by the Java Runtime Environment:

 SIGSEGV (0xb) at pc=0x0000ffff67e651a8, pid=1499163, tid=1499236

JRE version: Java(TM) SE Runtime Environment (23.0+22) (build 23-ea+22-1781)
 Java VM: Java HotSpot(TM) 64-Bit Server VM (23-ea+22-1781, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
 Problematic frame:
 j  java.awt.Font.getFont2D()Lsun/font/Font2D;+0 java.desktop@23-ea
....
....
[warning][os] Loading hsdis library failed
Comments
[jdk17u-fix-request] Approval Request from Andrew Hughes Partial backport of a fix for a race condition in the FFM API. Can lead to crashes when the FFM code manipulates a frame anchor in native mode, which the GC does not expect to happen. Fix is to move the frame anchor copying to Java mode, where the GC will wait for the thread to get to a safepoint. Risk to other code is low as the UpcallLinker is only used by FFM, which is in incubation in 17u. Patch has been reviewed by Martin Balao.
05-04-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk17u-dev/pull/3434 Date: 2025-04-03 22:19:28 +0000
03-04-2025

[jdk21u-fix-request] Approval Request from Andrew Hughes Clean backport of a fix for a race condition in the FFM API. Can lead to crashes when the FFM code manipulates a frame anchor in native mode, which the GC does not expect to happen. Fix is to move the frame anchor copying to Java mode, where the GC will wait for the thread to get to a safepoint. Risk to other code is low as the UpcallLinker is only used by FFM.
20-02-2025

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk21u-dev/pull/1424 Date: 2025-02-20 16:23:03 +0000
20-02-2025

Changeset: 461ffafe Branch: master Author: Jorn Vernee <jvernee@openjdk.org> Date: 2024-11-27 12:20:51 +0000 URL: https://git.openjdk.org/jdk/commit/461ffafeba459c077f1c2d9c5037305b71a8bc2a
27-11-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/21742 Date: 2024-10-28 13:53:58 +0000
18-11-2024

Jorn has a (draft) PR with the proposed fix https://github.com/openjdk/jdk/pull/21742. I have been running the reproducer against JDK mainline with that PR's changes on a linux x64 and a linux aarch64. The repeat launches of the reproducer has been going on for almost 3 days now and it hasn't crashed even once on either of those hosts. Previously, without that proposed fix, it used to crash within some hours on aarch64 and in a few days in x64. I'll let the reproducer run for a few more days, but at this point I think the fix proposed in that PR appears to address this issue.
01-11-2024

I managed to catch this with rr on Linux x64. The issue seems to be that the frame anchor is manipulated while in native. The GC is using these fields when scanning for oops in the threads, and expects that these fields are note manipulated concurrently. This seems to be the problematic code: #1 UpcallLinker::on_exit (context=0x6d24699ee5f0) at open/src/hotspot/share/prims/upcallLinker.cpp:129 ... 124 debug_only(thread->dec_java_call_counter()); 125 126 // Old thread-local info. has been restored. We are now back in native code. 127 ThreadStateTransition::transition_from_java(thread, _thread_in_native); 128 129 thread->frame_anchor()->copy(&context->jfa); 130 131 // Release handles after we are marked as being in native code again, since this 132 // operation might block 133 JNIHandleBlock::release_block(context->new_handles, thread); In my crash the scanned JavaThread Thread-10 has called copy on line 129 and has just executed: 48 void copy(JavaFrameAnchor* src) { ... 56 if (_last_Java_sp != src->_last_Java_sp) 57 _last_Java_sp = nullptr; So, _last_Java_sp is now null. While this is happening the GC thread is scanning this JavaThread and executing this code: 1431 void JavaThread::oops_do_frames(OopClosure* f, NMethodClosure* cf) { 1432 if (!has_last_Java_frame()) { 1433 return; 1434 } and because _last_Java_sp is null the has_last_Java_frame function will return false: JavaThread::has_last_Java_frame (this=0x5b85fc12e1a0) at open/src/hotspot/share/runtime/javaThread.hpp:556 556 bool has_last_Java_frame() const { return _anchor.has_last_Java_frame(); } JavaFrameAnchor::has_last_Java_frame (this=0x5b85fc12e550) at open/src/hotspot/share/runtime/javaFrameAnchor.hpp:78 78 bool has_last_Java_frame() const { return _last_Java_sp != nullptr; } And this has the effect that the GC skips scanning the oops in the frames, leading to various crashes and failures.
28-10-2024

This week I ran the same test repeatedly with a fastdebug JDK mainline build on a Linux x64 machine. After several days of repeat launches, one run ended up crashing the VM this time with an assertion failure in the JDK code: # Internal Error (/jdk/open/src/hotspot/share/runtime/javaThread.cpp:1347), pid=1856780, tid=1856858 # assert((!has_last_Java_frame() && java_call_counter() == 0) || (has_last_Java_frame() && java_call_counter() > 0)) failed: unexpected frame info: has_last_frame=false, java_call_counter=1 ... Current thread (0x00007f6e1c01a640): WorkerThread "GC Thread#15" [id=1856858, stack(0x00007f6e0e7ce000,0x00007f6e0e8ce000) (1024K)] Stack: [0x00007f6e0e7ce000,0x00007f6e0e8ce000], sp=0x00007f6e0e8cc960, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xee5578] JavaThread::verify_frame_info() [clone .part.0]+0x28 (javaThread.cpp:1347) V [libjvm.so+0xee8bdb] JavaThread::verify_frame_info()+0x3b (javaThread.cpp:1346) V [libjvm.so+0xeef82f] JavaThread::oops_do_no_frames(OopClosure*, NMethodClosure*)+0x5f (javaThread.cpp:1387) V [libjvm.so+0x180a156] Thread::oops_do(OopClosure*, NMethodClosure*)+0x76 (thread.cpp:448) V [libjvm.so+0x1822284] Threads::possibly_parallel_oops_do(bool, OopClosure*, NMethodClosure*)+0x1c4 (threads.cpp:1164) V [libjvm.so+0xd70215] G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x75 (g1RootProcessor.cpp:180) V [libjvm.so+0xd70b71] G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x61 (g1RootProcessor.cpp:61) V [libjvm.so+0xd81ad2] G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned int)+0x22 (g1YoungCollector.cpp:669) V [libjvm.so+0xd82019] G1EvacuateRegionsBaseTask::work(unsigned int)+0x89 (g1YoungCollector.cpp:656) V [libjvm.so+0x19555b0] WorkerThread::run()+0x80 (workerThread.cpp:70) V [libjvm.so+0x180aa2a] Thread::call_run()+0xba (thread.cpp:234) V [libjvm.so+0x14eb963] thread_native_entry(Thread*)+0x123 (os_linux.cpp:858) JavaThread 0x00007f6ec4959940 (nid = 1856820) was being processed Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::nep_invoker_blob 0x00007f6eb3add4e8 J 2139 c2 sun.font.HBShaper.shape(Lsun/font/Font2D;Lsun/font/FontStrike;F[FLjava/lang/foreign/MemorySegment;[CLsun/font/GlyphLayout$GVData;IIIILjava/awt/geom/Point2D$Float;II)V java.desktop@24-internal (52 bytes) @ 0x00007f6eb3f45254 [0x00007f6eb3f44660+0x0000000000000bf4] J 2180 c2 sun.font.GlyphLayout$EngineRecord.layout()V java.desktop@24-internal (108 bytes) @ 0x00007f6eb3f4c3f0 [0x00007f6eb3f4bfc0+0x0000000000000430] J 1925 c1 sun.font.GlyphLayout.layout(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[CIIILsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector; java.desktop@24-internal (683 bytes) @ 0x00007f6eac5a7c34 [0x00007f6eac5a5300+0x0000000000002934] J 2093 c2 FontLayoutStressTest.doLayout()D (31 bytes) @ 0x00007f6eb3f2ef58 [0x00007f6eb3f2e480+0x0000000000000ad8] j FontLayoutStressTest.lambda$main$0(Ljava/util/concurrent/CyclicBarrier;DLjava/util/concurrent/atomic/AtomicReference;)V+23 j FontLayoutStressTest$$Lambda+0x00007f6e2f000a18.run()V+12 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@24-internal j java.lang.Thread.run()V+19 java.base@24-internal v ~StubRoutines::call_stub 0x00007f6eb3742d01 The complete hs_err log file from this crash is attached to this issue (file named "hs_err_pid1856780.log "). It's unclear if this assertion failure leading to the crash is the same issue as what's causing the original crashes. But hopefully this has some hints on what's going on.
28-10-2024

[~stefank] Thanks for the analysis. I think I understand what's going on. This code was originally adapted from `JavaCallWrapper`, but that transitions to _thread_in_vm, rather than _thread_in_native before copying the frame anchor. I think we can copy the frame anchor while still being in the java thread state instead.
28-10-2024

Here's a summary of what we have investigated and found so far. The code in question resides in sun.font.HBShaper class (https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/sun/font/HBShaper.java) and looks like: static void shape( Font2D font2D, FontStrike fontStrike, float ptSize, float[] mat, MemorySegment hbface, char[] text, GVData gvData, int script, int offset, int limit, int baseIndex, Point2D.Float startPt, int flags, int slot) { ScopedVars vars = new ScopedVars(font2D, fontStrike, gvData, startPt); ScopedValue.where(scopedVars, vars) .run(() -> { try (Arena arena = Arena.ofConfined()) { float startX = (float)startPt.getX(); float startY = (float)startPt.getY(); MemorySegment matrix = arena.allocateFrom(JAVA_FLOAT, mat); MemorySegment chars = arena.allocateFrom(JAVA_CHAR, text); jdk_hb_shape_handle.invokeExact( ptSize, matrix, hbface, chars, text.length, script, offset, limit, baseIndex, startX, startY, flags, slot, hb_jdk_font_funcs_struct, store_layout_results_stub); } catch (Throwable t) { } }); } The HBShaper.shape() method (shown above) constructs a ScopedValue and runs a task in that scope. That scoped task constructs a confined FFM Arena, allocates a few MemorySegment(s) and then does a FFM downcall using the "jdk_hb_shape_handle" MethodHandle. The native side of this downcall is the "jdk_hb_shape" function which is implemented in src/java.desktop/share/native/libfontmanager/HBShaper_Panama.c (https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/native/libfontmanager/HBShaper_Panama.c). The implementation of this jdk_hb_shape function uses harfbuzz native library and apart from doing certain things, ulimately does a FFM based upcall into the HBShaper's "store_layout_results" method which looks like: private static void store_layout_results( int slot, int baseIndex, int offset, float startX, float startY, float devScale, int charCount, int glyphCount, MemorySegment /* hb_glyph_info_t* */ glyphInfo, MemorySegment /* hb_glyph_position_t* */ glyphPos ) { // accesses and uses the scoped value that was set previously in HBShaper.shape() GVData gvdata = scopedVars.get().gvData(); Point2D.Float startPt = scopedVars.get().point(); ... // works against the MemorySegment(s) that are passed around by FFM // through the upcall from jdk_hb_shape native function ... } So essentially, at the high level this code path can be depicted as: HBShaper.shape() { ScopedValue.run() { try (Arena arena = FFM confined Arena through Arena.ofConfined()) { create MemorySegment(s); do FFM downcall passing the MemorySegment(s); native function in the downcall works with harfbuzz library; native function then does a FFM upcall; Java method of the upcall invocation works against the MemorySegment(s); upcall completes normally, controls reaches back into native function, which then returns normally too; } // close the Arena } // ScopedValue task completion } This HBShaper.shape() gets invoked concurrently by numerous threads (intentionally) through the FontLayoutStressTest. So effectively we have several threads doing these confined Arean allocations, working against MemorySegment(s), doing downcall and upcall and finally closing the confined Arena, all within a ScopedValue of each thread. With this background of the code flow, let's take a small step back and look into what a confined Arena, that's constructed in this code, consists of. A Arena.ofConfined() returns a arena instance which is scoped to the Thread instance which invokes this ofConfined() method. The arena that's returned is backed by a unique instance of a jdk.internal.foreign.MemorySessionImpl. The MemorySessionImpl is the one which holds on to the Thread instance to which the Arena is scoped. The MemorySessionImpl stores that Thread instance in a internal (final) field called "owner". When working with the Arena instance, to verify and guarantee that only the "owner" thread is allowed to operate on that Arena, there are checks at relevant operations in the MemorySessionImpl where it asserts that "owner" is the same as the current thread (i.e. owner == Thread.currentThread()). That assertion is implemented in MemorySessionImpl's checkValidStateRaw() method (which gets invoked from relevant places in the implementation): @ForceInline public void checkValidStateRaw() { if (owner != null && owner != Thread.currentThread()) { throw WRONG_THREAD; } ... If that assertion fails, then the MemorySessionImpl raises a WrongThreadException. In the above HBShaper code that we looked into previously, there should never be a case of WrongThreadException. However, during our investigation of this issue, when the JVM crashes, we have noticed that almost always there's a WrongThreadException raised just before the JVM crashes. That WrongThreadException gets raised when the confined Arena is being closed in the HBShaper's try-with-resources block, which as noted previously looks as follows: try (Arena arena = FFM confined Arena through Arena.ofConfined()) { .... } // close the Arena During this close() of the Arena, the MemorySessionImpl notices that the "owner" field that stores the Thread instance to which the Arena was scoped is no longer the same as the Thread.currentThread(). Additional investigation has been done and based on that I have ruled out that Thread.currentThread() is returning a wrong value. Instead, what I have noticed is that the "owner" field in the backing MemorySessionImpl instance starts off with the correct value of Thread instance, when the Arena is created. Then sometime during when the Arena is in use and before the Arena is closed, the "owner" field gets corrupted and ends up holding an incorrect value that then doesn't match the original value that was stored in that (final) field (and thus doesn't match Thread.currentThread()). It's as if something in this entire downcall, upcall sequence has ended up overwriting the contents of an address that it shouldn't have touched. I haven't yet been able to narrow this down further (my knowledge of native debugging tools is very minimal). This entire analysis was done during past months and it is even applicable to the latest JDK mainline as of today (which contains the fixes that Jorn and Phil have done for some related code in this area). I continue to consistently reproduce this both on Linux aarch64 and Linux x64. I haven't tried on other OS. The reproducer I use is the same FontLayoutStressTest that is under investigation here and is attached to this JBS issue (file named "8331735-reproducer-platform-thread.tar.gz"). Instead of running it as a jtreg test, this reproducer repeatedly, in a sequential loop, launches the FontLayoutStressTest as a standalone java process, until the process crashes (due to this bug). The Linux aarch64 crash happens in a few hours whereas the Linux x64 crash takes several days (yes days) to happen.
25-10-2024

Hello Jorn, > Jaikiran Pai I think you had a reproducer that you tested with my fix for JDK-8337753, and were still seeing issues with the thread pointer being corrupted, right? Yes, I can consistently reproduce this, although it sometimes takes hours. I thought I had added the analysis of some of our internal discussion in this JBS issue, but looks like I didn't. I'll update this issue with the summarized version of what we have narrowed down so far, in the next hour or so.
25-10-2024

FWIW, I got a crash when running with -XX:+UseParallelGC on macos-aarch64. The crash happened in non-GC code. frame #2: 0x00000001048ff8f8 libjvm.dylib`os::start_debugging(buf="SIGSEGV (0xb) at pc=0x000000010daff6ec, pid=51020, tid=68355\n\nDo you want to debug the problem?\n\nTo debug, run 'gdb /proc/51020/exe 51020'; (lldb) x/40i 0x000000010daff6a0 0x10daff6a0: 0x95cb6178 unknown bl 0x114dd7c80 0x10daff6a4: 0xd503201f unknown nop 0x10daff6a8: 0xf280b39f unknown movk xzr, #0x59c 0x10daff6ac: 0xf280001f unknown movk xzr, #0x0 0x10daff6b0: 0xf9404fe1 unknown ldr x1, [sp, #0x98] 0x10daff6b4: 0xd343fc08 unknown lsr x8, x0, #3 0x10daff6b8: 0xb9003c28 unknown str w8, [x1, #0x3c] 0x10daff6bc: 0xd349fc22 unknown lsr x2, x1, #9 0x10daff6c0: 0xd2980003 unknown mov x3, #0xc000 0x10daff6c4: 0xf2bfddc3 unknown movk x3, #0xfeee, lsl #16 0x10daff6c8: 0x3823785f unknown strb wzr, [x2, x3, lsl #0] 0x10daff6cc: 0xd28cf904 unknown mov x4, #0x67c8 0x10daff6d0: 0xf2a80784 unknown movk x4, #0x403c, lsl #16 0x10daff6d4: 0xf2c00024 unknown movk x4, #0x1, lsl #32 0x10daff6d8: 0xb50000a0 unknown cbnz x0, 0x10daff6ec 0x10daff6dc: 0x3943e488 unknown ldrb w8, [x4, #0xf9] 0x10daff6e0: 0xb2400108 unknown orr x8, x8, #0x1 0x10daff6e4: 0x3903e488 unknown strb w8, [x4, #0xf9] 0x10daff6e8: 0x14000035 unknown b 0x10daff7bc 0x10daff6ec: 0xb9400803 unknown ldr w3, [x0, #0x8] 0x10daff6f0: 0xd25d2c63 unknown eor x3, x3, #0x7ff800000000 0x10daff6f4: 0x91042089 unknown add x9, x4, #0x108 0x10daff6f8: 0xf9400128 unknown ldr x8, [x9] 0x10daff6fc: 0xeb08007f unknown cmp x3, x8 0x10daff700: 0x540000a1 unknown b.ne 0x10daff714 0x10daff704: 0xf9408888 unknown ldr x8, [x4, #0x110] 0x10daff708: 0x91000508 unknown add x8, x8, #0x1 0x10daff70c: 0xf9008888 unknown str x8, [x4, #0x110] 0x10daff710: 0x1400001c unknown b 0x10daff780 0x10daff714: 0x91046089 unknown add x9, x4, #0x118 0x10daff718: 0xf9400128 unknown ldr x8, [x9] 0x10daff71c: 0xeb08007f unknown cmp x3, x8 0x10daff720: 0x540000a1 unknown b.ne 0x10daff734 0x10daff724: 0xf9409088 unknown ldr x8, [x4, #0x120] 0x10daff728: 0x91000508 unknown add x8, x8, #0x1 0x10daff72c: 0xf9009088 unknown str x8, [x4, #0x120] 0x10daff730: 0x14000014 unknown b 0x10daff780 0x10daff734: 0x91042089 unknown add x9, x4, #0x108 0x10daff738: 0xf9400128 unknown ldr x8, [x9] 0x10daff73c: 0xb50000c8 unknown cbnz x8, 0x10daff754 It crashes when reading the compressed klass pointer because the object pointer is 0: 0x10daff6ec: 0xb9400803 unknown ldr w3, [x0, #0x8] 0x10daff6f0: 0xd25d2c63 unknown eor x3, x3, #0x7ff800000000 (lldb) p $x0 (unsigned long) 0 (lldb) p CompressedKlassPointers::_base (address) 0x00007ff800000000 ""
24-10-2024

I don't think this is a GC issue given that GC crashes when scanning an OopMap. That is often an indication that something is broken outside of the GC. I wouldn't mind if we create a new Bug for these latest crashes.
24-10-2024

Since the issue doesn't appear to be related to JDK-8337753, I'm not sure if the remaining issue is FFM related. Maybe someone from the GC team could take another look at this, as they probably have more experience with debugging this kind of issue. I wouldn't mind rotating back around to this eventually, but I'm a bit busy with another project at the moment. [~jpai] I think you had a reproducer that you tested with my fix for JDK-8337753, and were still seeing issues with the thread pointer being corrupted, right?
24-10-2024

> unless trim_queue_to_threshold is a function that is basically the main() of the GC it is really interesting that it keeps coming up. It *is* the `main() of the GC.
24-10-2024

unless trim_queue_to_threshold is a function that is basically the main() of the GC it is really interesting that it keeps coming up. The one FFM bug that was noted is fixed, we've beaten to death the desktop 3rd party library code usage, the ScopedValues (IIUC) is just a field attached to a Java Thread object, so assuming it is referencing only another Java object should be OK. I think on current JDK 24 tip it would be good for GC to take another look. Perhaps we should run this test 10,000 ( or more) (?) times with a different collector ?
24-10-2024

FWIW, I can reproduce a crash by running the test in a loop and spawning away multiple such loops concurrently (to mess with timing of things). From the crashes I've looked at it seem like we are scanning this thread: for thread: "Thread-58" #114 [1628888] prio=5 os_prio=0 cpu=49,46ms elapsed=303,70s tid=0x00007ef318183800 nid=1628888 waiting for monitor entry [0x00007ef3b38fe000] java.lang.Thread.State: BLOCKED (on object monitor) at sun.font.FontAccess.getFontAccess(java.desktop@24-internal/FontAccess.java:42) - waiting to lock <0x00000000fa803c98> (a java.lang.Class for sun.font.FontAccess) at sun.font.FontUtilities.getFont2D(java.desktop@24-internal/FontUtilities.java:151) at sun.font.StandardGlyphVector.initFontData(java.desktop@24-internal/StandardGlyphVector.java:1110) at sun.font.StandardGlyphVector.initGlyphVector(java.desktop@24-internal/StandardGlyphVector.java:231) at sun.font.StandardGlyphVector.<init>(java.desktop@24-internal/StandardGlyphVector.java:186) at sun.font.GlyphLayout$GVData.createGlyphVector(java.desktop@24-internal/GlyphLayout.java:612) at sun.font.GlyphLayout.layout(java.desktop@24-internal/GlyphLayout.java:484) at java.awt.Font.layoutGlyphVector(java.desktop@24-internal/Font.java:2856) at FontLayoutStressTest.doLayout(FontLayoutStressTest.java:51) at FontLayoutStressTest.lambda$main$0(FontLayoutStressTest.java:67) at FontLayoutStressTest$$Lambda/0x00007ef373001208.run(Unknown Source) at java.lang.Thread.runWith(java.base@24-internal/Thread.java:1589) at java.lang.Thread.run(java.base@24-internal/Thread.java:1576) And in one of the crashes we were crashing while scanning the OopMap of an nmethod frame: 0x00007ef3dc494ab8 is at entry_point+1302 in (nmethod*)0x00007ef3dc494408 Compiled method (c2) 440902 1902 4 sun.font.StandardGlyphVector::initFontData (180 bytes) total in heap [0x00007ef3dc494408,0x00007ef3dc495138] = 3376 relocation [0x00007ef3dc4944e0,0x00007ef3dc494570] = 144 constants [0x00007ef3dc494580,0x00007ef3dc4945a0] = 32 main code [0x00007ef3dc4945a0,0x00007ef3dc495050] = 2736 stub code [0x00007ef3dc495050,0x00007ef3dc495068] = 24 oops [0x00007ef3dc495068,0x00007ef3dc495070] = 8 metadata [0x00007ef3dc495070,0x00007ef3dc495138] = 200 immutable data [0x00007ef0a4041c30,0x00007ef0a4041f10] = 736 dependencies [0x00007ef0a4041c30,0x00007ef0a4041c78] = 72 nul chk table [0x00007ef0a4041c78,0x00007ef0a4041ca0] = 40 handler table [0x00007ef0a4041ca0,0x00007ef0a4041cd0] = 48 scopes pcs [0x00007ef0a4041cd0,0x00007ef0a4041de0] = 272 scopes data [0x00007ef0a4041de0,0x00007ef0a4041f10] = 304 End of assembler dump. (gdb) disass 0x00007ef3dc494ab8 - 0x40, +100 Dump of assembler code from 0x7ef3dc494a78 to 0x7ef3dc494adc: 0x00007ef3dc494a78: out %al,(%dx) 0x00007ef3dc494a79: xchg %ax,%ax 0x00007ef3dc494a7b: call 0x7ef3dbd39760 0x00007ef3dc494a80: nopl 0x2000678(%rax,%rax,1) 0x00007ef3dc494a88: mov $0xfffffff6,%esi 0x00007ef3dc494a8d: xchg %ax,%ax 0x00007ef3dc494a8f: call 0x7ef3dbd39760 <= Uncommon trap blob 0x00007ef3dc494a94: nopl 0x300068c(%rax,%rax,1) 0x00007ef3dc494a9c: mov %r14d,(%rsp) 0x00007ef3dc494aa0: mov %rsi,%rbp 0x00007ef3dc494aa3: movabs $0xfa803c98,%rsi 0x00007ef3dc494aad: lea 0x30(%rsp),%rdx 0x00007ef3dc494ab2: nop 0x00007ef3dc494ab3: call 0x7ef3dbd45ae0 <= Broken oop map here 0x00007ef3dc494ab8: nopl 0x40006b0(%rax,%rax,1) 0x00007ef3dc494ac0: mov (%rsp),%r14d 0x00007ef3dc494ac4: jmp 0x7ef3dc494653 0x00007ef3dc494ac9: movabs $0xfa803c98,%rdi 0x00007ef3dc494ad3: lea 0x30(%rsp),%rsi 0x00007ef3dc494ad8: mov %r15,%rdx 0x00007ef3dc494adb: movabs $0x7ef3f1dd47f0,%r10 End of assembler dump.
24-10-2024

Here's the crashing thread's stack from the jdk-24+13-1350-tier3 sighting: java/awt/font/TextLayout/FontLayoutStressTest.java --------------- T H R E A D --------------- Current thread (0x0000ffff4c00a830): WorkerThread "GC Thread#4" [id=3306141, stack(0x0000fffef3208000,0x0000fffef3406000) (2040K)] Stack: [0x0000fffef3208000,0x0000fffef3406000], sp=0x0000fffef3404520, free space=2033k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xed1890] V [libjvm.so+0x70a260] G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+0x31c (g1ParScanThreadState.inline.hpp:60) V [libjvm.so+0x73afb4] G1ParEvacuateFollowersClosure::do_void()+0x94 (g1YoungCollector.cpp:582) V [libjvm.so+0x73b594] G1EvacuateRegionsTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int)+0x74 (g1YoungCollector.cpp:607) V [libjvm.so+0x7391e0] G1EvacuateRegionsBaseTask::work(unsigned int)+0xa0 (g1YoungCollector.cpp:657) V [libjvm.so+0xdae008] WorkerThread::run()+0x98 (workerThread.cpp:70) V [libjvm.so+0xd01028] Thread::call_run()+0xa8 (thread.cpp:225) V [libjvm.so+0xb853b0] thread_native_entry(Thread*)+0xdc (os_linux.cpp:858) C [libpthread.so.0+0x7950] start_thread+0x190 siginfo: si_signo: 4 (SIGILL), si_code: 1 (ILL_ILLOPC), si_addr: 0x0000ffff95187890
26-08-2024

> [~jvernee] does `-Xlog:class+unload` show the class being unloaded? Yes, it is being unloaded, thanks for the suggestion. I was looking at the class unload events in the hs_err log, but that doesn't seem to include everything. I think I've figured out the issue I'm seeing: The lambda form holding the method we're calling is being customized (I thought this was impossible), which replaces the form field in the MethodHandle instance we're targeting with the upcall stub. It is in principle fine to keep using the un-customized LF like we do, but now that LF class is not being referenced by anything, and is then, evidently, unloaded. Another clue which I didn't notice sooner: The Method* embedded in the upcall stub is different from the one pointed to by the receiver MH in several of these crashes where both of those values were still in the registers at the time of the crash. I'm going to file a separate issue for that, since I'm still not 100% sure it's related to this issue, but it seems likely.
02-08-2024

[~jvernee] does `-Xlog:class+unload` show the class being unloaded?
01-08-2024

Little update: from examining the many different crashes I'm seeing with the new reproducer, there are strong indications that the class pointed to by the FFM upcall stub is either being unloaded, or being corrupted. e.g. we embed the Method* of the target method in the stub, and then call into `from_compiled`: __ ldr(rscratch1, Address(rmethod, Method::from_compiled_offset())); __ blr(rscratch1); It is exactly at this blr instruction that a crash occurs (SIGBUS), and from the crash log, the `Method` that rmethod points to can not be decoded, and is just described as 'pointing into metadata'. The contents of rmethod is just loaded from a constant (using mov_metadata), and the code looks intact at the time of the crash, when looking at the disassembly. I've also seen several other crashes where the IntstanceKlass or constant pool associated with this Method was corrupted. The holder class is a hidden class, referenced through a MethodHandle stashed in a global JNI ref. Maybe that is not enough to keep the class alive. Or maybe I'm misunderstanding the lifetime of Method objects in metaspace (can they be moved around?). Still investigating. I'm also seeing frequent malloc errors reported in STDERR, such as: 'int_mallinfo(): unaligned fastbin chunk detected', which seems to indicate a double free (?). I have not been able to find the source of the issue with ASAN/gdb so far. Metaspace is not really my area, so suggestions would be welcome.
01-08-2024

The current status on this is that Jorn has a test case based on java/foreign/TestUpcallStress.java that seems to reproduce the issue within a few minutes on linux-aarch64. The test doesn't use a ScopedValue.
24-07-2024

> I'm running another round of tests, this time with platform threads to rule out any virtual thread usage from the reproducer. The issue is reproducible even with just platform threads. I've now attached 8331735-reproducer-platform-thread.tar.gz which also contains the (same set of) instructions on how to use it. This reproducer is similar to the previous virtual thread reproducer, except that this doesn't make use of any virtual threads in any part of the testing. The issue continues to reproduce and crashes the JVM. Here too we are noticing that the object instance id of the same platform thread changes unexpectedly during the FFM task execution happening within a ScopedValue block. Additional investigation is currently in progress to understand more about what cause this.
23-07-2024

I've attached a 8331735-reproducer-virtual-thread.tar.gz which contains a README.md with the instructions on how to use that reproducer. It's trivial and just requires setting a JAVA_HOME and a JT_HOME (for pointing to jtreg installation). That reproducer uses virtual thread to launch the main() method of the java application/test.
23-07-2024

I now have a reproducer (in fact this same FontLayoutStressTest, just launched in a more simpler way) which consistently reproduced this issue locally against latest mainline master branch. I chose a linux-aarch64 to reproduce it. It takes several runs (sometimes 1000 odd) before it fails with a JVM crash, but it does indeed always crash and fail, sometimes after around 30 minutes of repeated launching of the test. After looking at the client-libs code and the ScopedValue code involved, I've been adding tiny bits of debug logs in the JDK, at relevant places to understand what's going on. Here's the launch options I use to try and help narrow this down (only relevant snippet): -ea -esa -Dsun.java2d.debugfonts=true -Xlog:exceptions=info -XX:+UnlockDiagnosticVMOptions -XX:+ShowCarrierFrames -XX:+ShowHiddenFrames --add-modules java.desktop I've been able to reproduce this consistently when virtual threads are involved. What I mean by that is, if I launch the main class' (which is the FontLayoutStressTest.java) main() method through a virtual thread, then I am always to always reproduce it. How I launch main() method of a main class through a virtual thread is something that borrowed from what jtreg itself does (when -Dtest.threadFactory is Virtual) - I'll be attaching those relevant scripts and details to this issue. I haven't so far been able to reproduce it when launching the main() with a regular platform thread. However, I haven't completely ruled out that platform threads aren't impacted, only because I manually stopped the platform threads based run after several hours of not being able to reproduce it. Given the way this is failing (which I explain below), I suspect this issue is likely very much to do with virtual threads. Based on a few initial runs with -Xlog:exceptions=info (thanks to David Holmes for mentioning this option in a different issue), I managed to notice some interesting exceptions being reported and looking at the code in the client-libs `src/java.desktop/share/classes/sun/font/HBShaper.java`, the following code seemed relevant https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/sun/font/HBShaper.java#L462: ScopedVars vars = new ScopedVars(font2D, fontStrike, gvData, startPt); ScopedValue.where(scopedVars, vars) .run(() -> { try (Arena arena = Arena.ofConfined()) { float startX = (float)startPt.getX(); float startY = (float)startPt.getY(); MemorySegment matrix = arena.allocateFrom(JAVA_FLOAT, mat); MemorySegment chars = arena.allocateFrom(JAVA_CHAR, text); /*int ret =*/ jdk_hb_shape_handle.invokeExact( ptSize, matrix, hbface, chars, text.length, script, offset, limit, baseIndex, startX, startY, flags, slot, hb_jdk_font_funcs_struct, store_layout_results_stub); } catch (Throwable t) { } }); So it uses ScopedValues to run a task which uses FFM, specifically `Arena.ofConfined()`. That code also has a `catch(Throwable t)` block which catches and eats up anything that's thrown from that task, including any exceptions that might happen when the Arena is being used or closed or any FFM operation is invoked. I am not sure if that catch block is intentional in the way it currently is and why any failures from the FFM call might be ignored. To gain more data from that part of the code, I added a few debug logs in that catch block and reran the reproducer. When the JVM crashes (the crash log I haven't included because it's similar to what we have seen so far), I am noticing the following (only the relevant parts): release0 called from thread Thread[#45,Thread-14,5,VirtualThreads] owner is Thread[#45,Thread-14,5,VirtualThreads] ... wrong thread: Thread[#45,Thread-14,5,VirtualThreads] (objid= java.lang.Thread@3c7f8406) owner: Thread[#45,Thread-14,5,VirtualThreads] (objid=java.lang.Thread@5490003) ... [0.899s][info][exceptions] Exception <a 'jdk/internal/misc/ScopedMemoryAccess$ScopedAccessError'{0x00000005490259c0}: Invalid memory access> thrown in interpreter method <{method} {0x0000ffff34387520} 'checkValidStateRaw' '()V' in 'jdk/internal/foreign/MemorySessionImpl'> at bci 91 for thread 0x0000ffff041f9f60 (Thread-14) ... [0.899s][info][exceptions] Exception <a 'jdk/internal/misc/ScopedMemoryAccess$ScopedAccessError'{0x00000005490259c0}: Invalid memory access> thrown in interpreter method <{method} {0x0000ffff34387610} 'checkValidState' '()V' in 'jdk/internal/foreign/MemorySessionImpl'> at bci 1 for thread 0x0000ffff041f9f60 (Thread-14) [0.899s][info][exceptions] Found matching handler for exception of type "jdk.internal.misc.ScopedMemoryAccess$ScopedAccessError" in method "checkValidState" at BCI: 7 [0.899s][info][exceptions] Exception <a 'java/lang/WrongThreadException'{0x000000054e3ff6b8}: Attempted access outside owning thread> thrown in interpreter method <{method} {0x0000ffff34387610} 'checkValidState' '()V' in 'jdk/internal/foreign/MemorySessionImpl'> at bci 12 for thread 0x0000ffff041f9f60 (Thread-14) [0.899s][info][exceptions] Exception <a 'java/lang/WrongThreadException'{0x000000054e3ff6b8}: Attempted access outside owning thread> thrown in interpreter method <{method} {0x0000ffff34565230} 'justClose' '()V' in 'jdk/internal/foreign/ConfinedSession'> at bci 1 for thread 0x0000ffff041f9f60 (Thread-14) [0.899s][info][exceptions] Exception <a 'java/lang/WrongThreadException'{0x000000054e3ff6b8}: Attempted access outside owning thread> thrown in C1 compiled method <{method} {0x0000ffff34370040} 'lambda$shape$0' '(Ljava/awt/geom/Point2D$Float;[F[CFLjava/lang/foreign/MemorySegment;IIIIII)V' in 'sun/font/HBShaper'> at PC0x0000ffff64afcad4 for thread 0x0000ffff041f9f60 [0.899s][info][exceptions] Found matching handler for exception of type "java.lang.WrongThreadException" in method "lambda$shape$0" at BCI: 129 [0.899s][info][exceptions] Thread 0x0000ffff041f9f60 continuing at PC 0x0000ffff64afdf04 for exception thrown at PC 0x0000ffff64afcad4 Thread[#45,Thread-14,5,VirtualThreads] error: java.lang.WrongThreadException: Attempted access outside owning thread java.lang.WrongThreadException: Attempted access outside owning thread at java.base/jdk.internal.foreign.MemorySessionImpl.wrongThread(MemorySessionImpl.java:318) at java.base/jdk.internal.foreign.MemorySessionImpl$$Lambda/0x0000008001035148.get(Unknown Source) at java.base/jdk.internal.misc.ScopedMemoryAccess$ScopedAccessError.newRuntimeException(ScopedMemoryAccess.java:113) at java.base/jdk.internal.foreign.MemorySessionImpl.checkValidState(MemorySessionImpl.java:213) at java.base/jdk.internal.foreign.ConfinedSession.justClose(ConfinedSession.java:88) at java.base/jdk.internal.foreign.MemorySessionImpl.close(MemorySessionImpl.java:236) at java.base/jdk.internal.foreign.ArenaImpl.close(ArenaImpl.java:50) at java.desktop/sun.font.HBShaper.lambda$shape$0(HBShaper.java:480) at java.desktop/sun.font.HBShaper$$Lambda/0x00000080010d6a58.run(Unknown Source) at java.base/jdk.internal.vm.ScopedValueContainer.runWithoutScope(ScopedValueContainer.java:112) at java.base/jdk.internal.vm.ScopedValueContainer.run(ScopedValueContainer.java:98) at java.base/java.lang.ScopedValue$Carrier.runWith(ScopedValue.java:484) at java.base/java.lang.ScopedValue$Carrier.run(ScopedValue.java:468) at java.desktop/sun.font.HBShaper.shape(HBShaper.java:464) at java.desktop/sun.font.SunLayoutEngine.layout(SunLayoutEngine.java:187) at java.desktop/sun.font.GlyphLayout$EngineRecord.layout(GlyphLayout.java:669) at java.desktop/sun.font.GlyphLayout.layout(GlyphLayout.java:459) at java.desktop/java.awt.Font.layoutGlyphVector(Font.java:2856) at FontLayoutStressTest.doLayout(FontLayoutStressTest.java:51) at FontLayoutStressTest.lambda$main$0(FontLayoutStressTest.java:67) at FontLayoutStressTest$$Lambda/0x0000008001001cc8.run(Unknown Source) at java.base/java.lang.Thread.runWith(Thread.java:1588) at java.base/java.lang.Thread.run(Thread.java:1575) So Arena.close() from that task run through a ScopedValue ends up throwing a WrongThreadException. I additionally had added a few more debug logs at more relevant locations including the code in `checkValidStateRaw()` method of `src/java.base/share/classes/jdk/internal/foreign/MemorySessionImpl.java`, from where the WrongThreadException gets thrown. Something like: --- a/src/java.base/share/classes/jdk/internal/foreign/MemorySessionImpl.java +++ b/src/java.base/share/classes/jdk/internal/foreign/MemorySessionImpl.java @@ -189,7 +189,11 @@ public boolean isAlive() { */ @ForceInline public void checkValidStateRaw() { - if (owner != null && owner != Thread.currentThread()) { + final Thread curr; + if (owner != null && owner != (curr = Thread.currentThread())) { + System.err.println("wrong thread: " + curr + + " (objid= " + Objects.toIdentityString(curr) + + ") owner: " + owner + " (objid=" + Objects.toIdentityString(owner) + ")"); throw WRONG_THREAD; } You will notice in the above log snippet, that this line has generated the following message: wrong thread: Thread[#45,Thread-14,5,VirtualThreads] (objid= java.lang.Thread@3c7f8406) owner: Thread[#45,Thread-14,5,VirtualThreads] (objid=java.lang.Thread@5490003) So it appears that although it's the "same" thread that owns the confined arena and is the one closing that Arena, the object identity check for `owner != Thread.currentThread()` on that line isn't passing. The debug log message that I added, also prints the object identity of the "owner" and the current thread and it shows that the object identities are not the same (although it's the "same" thread). I haven't fully grasped what's going on here and why, but the object identity of the Thread instance changing appears odd. The other part that isn't clear to me is whether this is the root cause of the JVM crash. Edit: I'm running another round of tests, this time with platform threads to rule out any virtual thread usage from the reproducer.
23-07-2024

I don't know why this was made a P2. It is very rare and there isn't even a proper understanding of the cause, despite lots of people looking at it very hard.
22-07-2024

This happens to be a client test but it isn't a client bug.
22-07-2024

This one is assigned to me, but I don't think there's anything I can do. Scoped values are pure Java code, and whatever the problem is, I'm sure it could be reproduced by a test case that doesn't use scoped values at all. I'd do so if I were able to reproduce the bug. I'm unassigning myself now.
19-07-2024

No crashes with -Xint after 20k runs on linux-aarch64 C1 only crashes (-XX:TieredStopAtLevel=1) 3/10000 on linux-aarch64 C2 only crashes (-XX:-TieredCompilation)
12-07-2024

>"All linux-aarch64 (I always run the same number on linux-x64 at the same time). " > >Mmmm, but what exact processor? This might be micro-architecture dependent. The crashes also occur on linux-x64 (AMD EPYC 9J14, from the attached hs_err file) and osx-x64 (Mac minis, i5-4308U CPU), just much less frequently. Linux-aarch64 are Oracle OCI AmpereOne processors.
11-07-2024

"All linux-aarch64 (I always run the same number on linux-x64 at the same time). " Mmmm, but what exact processor? This might be micro-architecture dependent.
06-07-2024

With -XX:+DeoptimizeALot (I made the flag product to be able to use it in release mode as I never managed to reproduce this issue in debug mode) there is no particular change in the number of failures - 3 in 5000. Two of those have the usual broken `scopedValueBindings` member of j.l.Thread. All linux-aarch64 (I always run the same number on linux-x64 at the same time).
03-07-2024

I've also been trying to look at this, but I'm currently stuck on trying to get a working gdb on the Linux/Aarch64 machine I'm using (keeps crashing while evaluating expressions in the debugger). One other possibility that comes to mind is that deoptimization is at fault here, as that is also quite rare. [~aph] maybe you could try running with -XX:+DeoptimizeALot to see if that triggers the failure.
01-07-2024

I've carefully stepped through (on AArch64) the code creating and scanning the stack frames associated with an upcall. I can find no fault in the way that saved oops are handled. Another possibility, given that it's said to be easier to reproduce this fault on AArch64 (although I haven't managed to do so) must be to do with inter-thread synchronization. From what I can see, though, these is no shortage of fences when performing a handshake at a safepoint. Having said that, I believe that G1 does some concurrent processing. I will now return to something else. As I said above, the ScopedValue implementation is plain Java. If it breaks in the way described that's because of something done to it by the VM, not because of something it did. If anyone can tell me how to reproduce this bug, maybe I can say more.
01-07-2024

So far I've been unable to reproduce this. For anyone who has done so outside CI, what was the environment? It's been seen on Neoverse N2, right? How many cores? Does that matter? Anything that might help would be good.
29-06-2024

Let's check that FP is handled correctly, and we should make sure that the printed OOP map is not misleading, if only for the sake of future maintainers. The scoped values code that handles this is pure Java. The behaviour we're seeing here is entirely consistent with a register pointing to scopedValueBindings not being updated whenit should be. We're missing a GC root somewhere.
28-06-2024

> (I hope I got the intrinsic names right - not sure if the names are actually verified to exist somewhere?) In https://bugs.openjdk.org/browse/JDK-8334386 I'm working to introduce a VM option which will list the supported intrinsic names on a platform, along with the class and method names to which that intrinsic name is mapped to.
28-06-2024

> Yes, it does have an oop map, and the frame pointer should be added to that implicitly. (that might be why it's not printed?). At least, that's how it worked several years ago when I looked at it. Sorry, I think I'm not remembering correctly. I'll look into this a bit more tomorrow, but the way it works, IIRC, is that when we walk to the sender frame from the nep_invoker_blob frame, we update a register map with the saved location of the frame pointer (see the call to update_map_with_saved_link in frame::sender_for_compiled_frame). The caller can then inspect the saved FP location if it need to.
28-06-2024

> we should make sure that the printed OOP map is not misleading It's handled by the oop map of the caller frame (that's the frame that knows whether there's an oop in FP/RBP after all). You should be able to see it if you print out the oop map of the frame for `TestStackWalk::payload` during the stackwalk (see whitebox.cpp::WB_VerifyFrames) in the test I pointed at. The oop map at the call to the nep_invoker_blob looks like this on my Windows-x64 machine: ImmutableOopMap {rbp=Oop [0]=Oop [8]=Oop [24]=Oop [32]=Oop }[WhiteBox::VerifyFrames]
28-06-2024

> Does nep_invoker_blob have an oopmap including FP that is scanned and updated by the GC? Yes, it does have an oop map, and the frame pointer should be added to that implicitly. (that might be why it's not printed?). At least, that's how it worked several years ago when I looked at it. FWIW, we had issues during development with RBP going dead during GC stack walks before, and added tests for that (e.g. https://github.com/openjdk/jdk/blob/master/test/jdk/java/foreign/stackwalk/TestStackWalk.java#L145). The code has change quite a bit since then, but the oops in the nep_invoker_blob frame were handled through frame::oops_do_internal -> frame::oops_code_blob_do, ~~but it seems that since https://bugs.openjdk.org/browse/JDK-8329629 we now silently ignore code blobs that are not nmethods? [1]. [~stefank] How are oops in the frames of non-nmethod code blobs handled after JDK-8329629?~~ Ah, never mind, that method still handles oops for non-nmethod frames as well [1]: https://github.com/openjdk/jdk/pull/18653/files#diff-47d6ef8a97116fd4facc79e558bd8bf5c7db97d36303d07238a9374a2be3aa3cR987
28-06-2024

One thing I'd like to know from the FFM people: Let's say that an OOP is in register FP. We call from Java into nep_invoker_blob. That OOP is still live in FP, and must be processed by GC. Does nep_invoker_blob have an oopmap including FP that is scanned and updated by the GC? stub _oop_maps->print() doesn't print it if so. So how does GC know where to find the saved OOP that was in FP, and is now saved in the nep_invoker_blob's stack? Surely it doesn't.
27-06-2024

The odd thing about this is that the ScopedValue implementation is literally plain Java. It's not hiding anything where a GC can't see it. The only odd thing is ensureMaterializedForStackWalk, which literally does nothing - it's a NOP - but C2 doesn't know that, so it prevents scalar replacement of the scopedValueBindings. The only code that accesses scopedValueBindings is in Java, before and after the native code in jdk_hb_shape(). So whatever the real cause of the problem is, I don't think it's in the ScopedValue code.
27-06-2024

I am working on this.
27-06-2024

I attached the hs_err file for that linux-x64 crash analyzed in https://bugs.openjdk.org/browse/JDK-8331735?focusedId=14684098&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14684098 . There is no OOME or VM error I can see.
26-06-2024

I've been thinking about where there is any code in the VM that might cause the current bindings to be invalid, and I can think of only one. When we get a virtual machine error, we have to repair the scope value context in case it was being updated at the time an out of memory error was detected. To do this, we do a stack walk: see JVM_FindScopedValueBindings. If there is something incorrect here, it might lead to an invalid OOP in scopedValueBindings. Do we know if a virtual machine error ever occurs during this test?
26-06-2024

After 20k runs with intrinsics disabled, ie. with -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_findScopedValueBindings,_scopedValueCache,_setScopedValueCache (I hope I got the intrinsic names right - not sure if the names are actually verified to exist somewhere?) there are still failures, so unless I got above names wrong, intrinsics do not change the situation. Now that I know where to look, most of the (G1) crashes have a broken scopedValueBindings field of the j.l.Thread in the register output of the hs_err files, e.g. R23=0x00000000e74b05ed is pointing into object: java.lang.Thread {0x00000000e74b0598} - klass: 'java/lang/Thread' - ---- fields (total size 15 words): - 'threadLocalRandomProbe' 'I' @12 0 (0x00000000) - private volatile 'eetop' 'J' @16 0 (0x0000000000000000) - private final 'tid' 'J' @24 22 (0x0000000000000016) - 'threadLocalRandomSeed' 'J' @32 0 (0x0000000000000000) - injected 'jvmti_thread_state' 'J' @40 0 (0x0000000000000000) - 'threadLocalRandomSecondarySeed' 'I' @48 80520126 (0x04cca3be) - injected 'jvmti_VTMS_transition_disable_count' 'I' @52 0 (0x00000000) - injected 'jfr_epoch' 'S' @56 0 (0x0000) - volatile 'interrupted' 'Z' @58 false (0x00) - injected 'jvmti_is_in_VTMS_transition' 'Z' @59 false (0x00) - private volatile 'name' 'Ljava/lang/String;' @60 "Thread-1"{0x00000000e74b0610} (0xe74b0610) - private volatile 'contextClassLoader' 'Ljava/lang/ClassLoader;' @64 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000000e752acc8} (0xe752acc8) - private 'inheritedAccessControlContext' 'Ljava/security/AccessControlContext;' @68 null (0x00000000) - private final 'holder' 'Ljava/lang/Thread$FieldHolder;' @72 a 'java/lang/Thread$FieldHolder'{0x00000000e74b0640} (0xe74b0640) - 'threadLocals' 'Ljava/lang/ThreadLocal$ThreadLocalMap;' @76 null (0x00000000) - 'inheritableThreadLocals' 'Ljava/lang/ThreadLocal$ThreadLocalMap;' @80 null (0x00000000) - private 'scopedValueBindings' 'Ljava/lang/Object;' @84 [error occurred during error reporting (printing register info, attempt 2), id 0xb, SIGSEGV (0xb) at pc=0x0000ffffa4d72234] <---------- !! R24=0x000000bb0000008d is an unknown value Just checking this, j.l.Thread, it is also in survivor space (0x00000000e74b0598), i.e. most likely just copied too.
26-06-2024

I ran open/test/jdk/java/awt/font/TextLayout/FontLayoutStressTest.java with release and debug builds, and with JTREG_REPEAT_COUNT=<n>, but couldn't duplicate the issue. If someone can duplicate then it we can easily replace the SV in HBShaper to use a TL instead of a SV and see if it duplicates as that would at least hint as to whether it's a SV or FFM issue. Also testing with the instrinics for Thread.{findScopedValueBindings,scopedValueCache,setScopedValueCache} disabled might yield new information too. ( Some of the error logs in the CI are a SEGV in java.awt.Font.getFont2D. There's no native frame or SV in these crashes. It's glyph layout so same area but not clear if this is the same thing or an unrelated issue )
25-06-2024

Probably first occurrence was JDK-8321379 right after JDK-8318364 has been integrated so updating affects version too.
25-06-2024

[~tschatzl] No, that doesn't ring a bell but maybe [~roland] has an idea.
25-06-2024

[~thartmann]: maybe, as [~aph] suggests (to me at least), an issue with code generation where wrong register contents are put into that member during stub generation and scoped values, does something come to your mind here?
25-06-2024

> If someone can duplicate then it we can easily replace the SV in HBShaper to use a TL instead of a SV and see if it duplicates as that would at least hint as to whether it's a SV or FFM issue. Also testing with the instrinics for Thread.{findScopedValueBindings,scopedValueCache,setScopedValueCache} disabled might yield new information too. As mentioned above the crashes reproduce "easily" after enough runs. You need at least ~50k runs of the test to be somewhat confident bad/not bad though. I agree that it is useful to perfom the suggested tests with TL and intrinsics enabled/disabled. Another option is trying to bisect the changes using this test to find some responsible change or even retry to reproduce without the FFM code. >( Some of the error logs in the CI are a SEGV in java.awt.Font.getFont2D. There's no native frame or SV in these crashes. It's glyph layout so same area but not clear if this is the same thing or an unrelated issue ) If the SV reference is bad, but not bad enough (i.e. pointing to an object start that has no reference by coincidence), GC will happily pass. At most Java code (but not stubs etc) will cause a ClassCastException or such - and we have such a failure (https://bugs.openjdk.org/browse/JDK-8331735?focusedId=14675454&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14675454).
25-06-2024

> Mach5 doesn't show any test failure in mainline after that. The failures only started to (re?)appear > recently. Why was there a ~5 months interval w/o any failures? The failures are very rare. I got 13 failures out of 243k runs of that test (i.e. one after every 20k runs). All on different machines. Some of them are attempts at being more targeted, or a debug build (which never reproduced), so that number might need to be reduced; I think it's like 1 in 5-10k runs or so if you just run linux-aarch64. I initially could not reproduce after 22k iterations for JDK-8321379 (which is not included in that 243k runs) so I closed it as CNR. Timing on otherwise loaded machines is different than running the same test on the same machine over and over too which might have an impact too. While some of the mach5 failures were on various platforms (OSX-x64, linux-x64), I only ever got it to reproduce on linux-aarch64 and fairly early just stopped trying to reproduce on anything but there. However the failures were always on different machines. It is possible that that linux-x64 crash was pure luck (i.e. hardware failure) or due to some other cause, it seems fairly unlikely that exactly this (and apparently only this) memory location was affected (other addresses in that j.l.Thread object are okay, and already successfully adjusted to the new locations). I could not find an indication that this particular machine has faulty hardware either. I.e. other tests failing randomly on that machine. Not only this crash with the clear indication that the value has been corrupted before GC but also the 45k non-failing runs with FFM/Scoped Values not used (with that flag) gives me a fairly good indication that it is an issue with scoped values.
25-06-2024

FWIW, FFM has only 1 Java heap-related interaction in the upcall stub (none in the downcall stub): It allocates a new JNI handle block, and sets that as active. The reference to the old handle block is saved in the upcall stub frame, which is then set back to active when we are returning from Java to native after the upcall. At that point, the newly allocated handle block is also freed. This mimics what is done for JNI, and should cover the scenario where a user does a JNI downcall, an FFM upcall, and then wants to continue doing multiple JNI downcalls (for which the new handle block is then used). The thread's handle block is only used to store local JNI handles that are created during a native JNI call, AFAIK. The old handle block is seen by the GC through frame::oops_do_internal -> UpcallStub::oops_do. The newly allocated handle block is handled by the GC through the JavaThread::_active_handles field.
25-06-2024

> The only thing I can think of is that java_lang_Thread::scopedValueBindings is not being scanned by the GC and so is stale. That would certainly lead to garbage in that struct. But it's just a plain old Java object, with no special properties, so it can't really be that. The crash in that analyzed core file occurs exactly because the field is scanned by GC, pointing inside an object (also in eden, i.e. just allocated). As mentioned in that crash, the bad oop for scopedValueBindings is already in the original object, GC "properly" copies the object including the bad value.
24-06-2024

The only thing I can think of is that java_lang_Thread::scopedValueBindings is not being scanned by the GC and so is stale. That would certainly lead to garbage in that struct. But it's just a plain old Java object, with no special properties,so it can't really be that. With regard to strong references, every scoped value binding object is referenced by a frame (of kind Thread::runWith) on the stack. This makes the whole thing robust against stack overflow, etc: you can just walk the stack to find the nearest enclosing bindings. I'd be seriously wondering about register corruption. But anyway, let me know if you want me to have a look.
24-06-2024

I hacked up a quick test to call to native function with a downcall handle, passing it the address of an upcall stub to invoke a callback. The native function is invoked with a ScopedValue binding for the current thread and the callback uses ScopedValue.get to its value. I couldn't provoke the crash seen here. (At this time, scopedValueBindings is just a regular field on Thread. The cache (for lookup) is an OopHandle in JavaThread. There are C2 instrinics for the cache access).
24-06-2024

A potential workaround could be disabling use of FFM in java/awt/font layout api making -Dsun.font.layout.ffm=false default.
24-06-2024

I am adding jhsdb output and gdb stack traces to this issue; any attempt at trying to link the not-yet-initialized scopedValueBindings member to a particular stack trace would just be guessing from me. One obvious option is that a GC happens while the scoped values are copied into the java.lang.Thread data structure for whatever reason (and for some reason there is magic in java.lang.Thread that prevents it from being zero-initialized) but idk (e.g. thread 3 in jshdb.out). So far nothing points into that this is a GC issue (crashes with all STW collectors at least, that particular crash showing that GC could not modify that bad reference, does not reproduce without FFM which uses scoped values), so moving it to core-libs/java.lang which are afaict responsible for scoped values.
24-06-2024

There has been an interesting failure on x64-linux where the test crashes in G1ParScanThreadState::trim_queue_to_threshold(); Some stack trace: Thread #1: [...] #60 0x00007f7d31b023ee in VMError::report_and_die (thread=thread@entry=0x7f7ca800b310, sig=sig@entry=11, pc=pc@entry=0x7f7d313b7282 <G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+7074> "\213H\bH\211E\260\205\311\017\216p\030", siginfo=siginfo@entry=0x7f7cb8981070, context=context@entry=0x7f7cb8980f40) at open/src/hotspot/share/utilities/vmError.cpp:1604 #61 0x00007f7d3196d79b in JVM_handle_linux_signal (sig=11, info=0x7f7cb8981070, ucVoid=0x7f7cb8980f40, abort_if_unrecognized=1) at open/src/hotspot/os/posix/signals_posix.cpp:649 #62 <signal handler called> #63 0x00007f7d313b7282 in Klass::layout_helper (this=<optimized out>) at open/src/hotspot/share/oops/klass.hpp:295 #64 oopDesc::size_given_klass (klass=<optimized out>, this=<optimized out>) at open/src/hotspot/share/oops/oop.inline.hpp:157 #65 G1ParScanThreadState::do_copy_to_survivor_space (old_mark=..., old=0xe0221218, region_attr=..., this=<optimized out>) at open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:463 #66 G1ParScanThreadState::do_oop_evac<narrowOop> (p=0xda9d7d14, this=0x7f7c44003030) at open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:218 #67 G1ParScanThreadState::dispatch_task (task=..., this=<optimized out>) at open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:296 #68 G1ParScanThreadState::trim_queue_to_threshold (this=this@entry=0x7f7c44003030, threshold=threshold@entry=0) at open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:317 #69 0x00007f7d313bb9ba in G1ParScanThreadState::trim_queue (this=0x7f7c44003030) at open/src/hotspot/share/gc/g1/g1ParScanThreadState.inline.hpp:60 #70 G1ParScanThreadState::steal_and_trim_queue (this=this@entry=0x7f7c44003030, task_queues=<optimized out>) at open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:328 #71 0x00007f7d313ef6e5 in G1ParEvacuateFollowersClosure::do_void (this=this@entry=0x7f7cb8981d30) at open/src/hotspot/share/gc/g1/g1YoungCollector.cpp:547 #72 0x00007f7d313efe9f in G1EvacuateRegionsBaseTask::evacuate_live_objects (termination_phase=G1GCPhaseTimes::Termination, objcopy_phase=G1GCPhaseTimes::ObjCopy, worker_id=2, pss=0x7f7c44003030, this=0x7f7cfbffe0e0) at open/src/hotspot/share/gc/g1/g1YoungCollector.cpp:602 The address it crashes on is 0xda9d7d14 (frame 66). $ (gdb) x/20x 0xda9d7d14 0xda9d7d14: 0xe0221218 0xda823a10 0x00000000 0x00000000 0xda9d7d24: 0x00000000 0x00000000 0x00000000 0x00000000 0xda9d7d34: 0x00000000 0x00000031 0x00000000 0x00166738 0xda9d7d44: 0x00000000 0x00000000 0xda9d7d50 0x00000031 0xda9d7d54: 0x00000000 0x00161570 0x00000009 0x65726854 I.e. it refers to 0xe0221218 This is an eden(!) region, i.e. the object has just been allocated, no GC occurred on it. | 258|0x00000000e0200000, 0x00000000e0300000, 0x00000000e0300000|100%| E|CS|TAMS 0x00000000e0200000| PB 0x00000000e0200000| Complete | 0 "E" means eden region Looking at the stack slots: stack at sp + 1 slots: 0x00000000da9d7cc0 is an oop: java.lang.Thread {0x00000000da9d7cc0} - klass: 'java/lang/Thread' - ---- fields (total size 15 words): - 'threadLocalRandomProbe' 'I' @12 0 (0x00000000) - private volatile 'eetop' 'J' @16 0 (0x0000000000000000) - private final 'tid' 'J' @24 43 (0x000000000000002b) - 'threadLocalRandomSeed' 'J' @32 0 (0x0000000000000000) - injected 'jvmti_thread_state' 'J' @40 0 (0x0000000000000000) - 'threadLocalRandomSecondarySeed' 'I' @48 -1170677104 (0xba38e290) - injected 'jvmti_VTMS_transition_disable_count' 'I' @52 0 (0x00000000) - injected 'jfr_epoch' 'S' @56 0 (0x0000) - volatile 'interrupted' 'Z' @58 false (0x00) - injected 'jvmti_is_in_VTMS_transition' 'Z' @59 false (0x00) - private volatile 'name' 'Ljava/lang/String;' @60 "Thread-13"{0x00000000da9d7d38} (0xda9d7d38) - private volatile 'contextClassLoader' 'Ljava/lang/ClassLoader;' @64 a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000000daa08e48} (0xdaa08e48) - private 'inheritedAccessControlContext' 'Ljava/security/AccessControlContext;' @68 null (0x00000000) - private final 'holder' 'Ljava/lang/Thread$FieldHolder;' @72 a 'java/lang/Thread$FieldHolder'{0x00000000da9d7d70} (0xda9d7d70) - 'threadLocals' 'Ljava/lang/ThreadLocal$ThreadLocalMap;' @76 null (0x00000000) - 'inheritableThreadLocals' 'Ljava/lang/ThreadLocal$ThreadLocalMap;' @80 null (0x00000000) - private 'scopedValueBindings' 'Ljava/lang/Object;' @84 0x00000000da9d7cc0 + 84 = 0xda9d7d14, which means that the `scopedValueBindings` member of java.lang.Thread contains garbage, since the java.lang.Thread is in Eden, no GC can have occurred. The java.lang.Thread instance is in Survivor space: | 169|0x00000000da900000, 0x00000000daa00000, 0x00000000daa00000|100%| S| |TAMS 0x00000000da900000| PB 0x00000000da900000| Complete | 0 "S" means survivor region It has just been copied (in this GC) from 0xda9d7cc0 containing the same garbage value in the `scopedValueBindings` field. (gdb) find /w 0x00000000da800000, 0x00000000e7400000, 0xba38e290 // looking for the `threadLocalRandomSecondarySeed` value in the j.l.Thread (at offset 48) 0xda9d7cf0 0xddccbab0 (gdb) x/30w 0xddccbab0-48 // original j.l.Thread in Eden, i.e. | 220|0x00000000ddc00000, 0x00000000ddd00000, 0x00000000ddd00000|100%| E|CS|TAMS 0x00000000ddc00000| PB 0x00000000ddc00000| Complete | 0 0xddccba80: 0xda9d7cc3 0x00000000 0x0016db90 0x00000000 0xddccba90: 0x00000000 0x00000000 0x0000002b 0x00000000 0xddccbaa0: 0x00000000 0x00000000 0x00000000 0x00000000 0xddccbab0: 0xba38e290 0x00000000 0x00000000 0xddcccb40 0xddccbac0: 0xddc005f0 0x00000000 0xddcccbb8 0x00000000 0xddccbad0: 0x00000000 >0xe0221218 0xddcccec0 0x00000000 0xddccbae0: 0x00000000 0x00000000 0x00000000 0x00000000 0xddccbaf0: 0x00000000 0x00000000 (gdb) x/30w 0xda9d7cf0-48 // j.l.Thread GC crashed on 0xda9d7cc0: 0x00000031 0x00000000 0x0016db90 0x00000000 0xda9d7cd0: 0x00000000 0x00000000 0x0000002b 0x00000000 0xda9d7ce0: 0x00000000 0x00000000 0x00000000 0x00000000 0xda9d7cf0: 0xba38e290 0x00000000 0x00000000 0xda9d7d38 0xda9d7d00: 0xdaa08e48 0x00000000 0xda9d7d70 0x00000000 0xda9d7d10: 0x00000000 >0xe0221218 0xda823a10 0x00000000 0xda9d7d20: 0x00000000 0x00000000 0x00000000 0x00000000 0xda9d7d30: 0x00000000 0x00000000 Looking at 0xe0221218 and surroundings: (gdb) x/80x 0xe0221218-32*8 0xe0221118: 0x00000055 0x00000001 0x00000042 0x0000004e 0xe0221128: 0x00000046 0x00000055 0x0000000d 0x00000001 0xe0221138: 0x0000000f 0x0000000f 0x0000000f 0x00000000 0xe0221148: >0x00000001< 0x00000000 0x00161170 0x00000040 // seems to be an object start 0xe0221158: 0x00000000 0x00000000 0x40e00000 0x00000000 0xe0221168: 0x41500000 0x00000000 0x41880000 0x00000000 0xe0221178: 0x41b00000 0x00000000 0x41f80000 0x00000000 0xe0221188: 0x42080000 0x00000000 0x42140000 0x00000000 0xe0221198: 0x422c0000 0x00000000 0x42400000 0x00000000 0xe02211a8: 0x42580000 0x00000000 0x427c0000 0x00000000 0xe02211b8: 0x42840000 0x00000000 0x42900000 0x00000000 0xe02211c8: 0x429c0000 0x00000000 0x42a20000 0x00000000 0xe02211d8: 0x42ae0000 0x00000000 0x42b60000 0x00000000 0xe02211e8: 0x42bc0000 0x00000000 0x42c60000 0x00000000 0xe02211f8: 0x42cc0000 0x00000000 0x42d20000 0x00000000 0xe0221208: 0x42d80000 0x00000000 0x42e20000 0x00000000 0xe0221218: >0x42f40000< 0x00000000 0x42fe0000 0x00000000 // the broken pointer pointing here So the `scopedValueBindings` seems to point into the middle of some object, probably just garbage; it is interesting that all the addresses in that array increase by exactly 12kb (= 3 * 4kb).
24-06-2024

> > All Java objects on the Java heap directly accessed by native code must be registered with the GC > >The code here *using* FFM, does not do anything like that. Did not look that way to me either when inspecting the code. May best _guess_, given the crashes (bad reference in handle area and the crash upcalling to Java and a crash accessing the Font2D object) may or may not indicate an issue with the scoped values because they also contain one of those and most likely they are implemented using these per-thread handles (or not). I need to get more information on their actual implementation. The additional tests I did that increased the GC frequency by simply causing GCs in the upcalls did not increase the crash frequency either. >It can't because FFM doesn't provide any way to do it. >That's why scoped values, or bound vars in method handles need to be used when doing upcalls. >You can see the native code in src/java.desktop/share/native/libfontmanager/HBShaper.c >Not a Java object reference in sight. >Now the 'upcall stubs' generated by Panama in HBShaper.java which are a MemorySegment >there are somehow mapped to the native C function pointers that you'll see in HBShaper.c >and then when called, somehow know which Java method to call but this is a basic part >of the Panana functionality so I would be surprised if there's a problem there. Afaiu for every such upcall a C->Java stub (which reorders arguments, saves registers, ...) is generated, and you get a pointer to that one. Then the C code can simply call it as normal (and probably there is a similar stub called when returning from Java). That is all I understand so far, I do not know either right now where the bad references come from (also because I could not reproduce the issue with verification yet). > > Do you have a CR for this "other test" available? > >I was referring to the (2nd level) linked bug > >https://bugs.openjdk.org/browse/JDK-8320253 >G1: SIGSEGV in G1ParScanThreadState::trim_queue_to_threshold As mentioned above, garbage collectors tend to show these issues frequently because of any broken reference for any reason, not necessarily (and nowadays actually most of the time) not because of G1 (or any garbage collector) changes. It happens, but is a minority of cases. A crash in G1ParScanThreadState::trim_queue_to_threshold() or the methods with the same purpose (following the object graph) for the other collectors means nothing other than there is a bad reference somewhere in the object graph. We advise people to run with verification (-XX:VerifyBefore/AfterGC) first before reporting these issues on the GC component. Depending on when it crashes (before or after the gc pauses) one can fairly easily limit whether the cause is gc or not. Unfortunately verification is expensive, so timing changes a lot, and it might just make the problem disappear (like in this case). >The test is unrelated to any of this client code or FFM and was submitted a week or so earlier. >The timing made me suppose something had changed in G1 to trigger this. No. Afaics that particular crash at the same location happened because of a machine with defect memory as commented in the CR. Some references were garbage after some bitflips, and the collector came across them first.
21-06-2024

> All Java objects on the Java heap directly accessed by native code must be registered with the GC The code here *using* FFM, does not do anything like that. It can't because FFM doesn't provide any way to do it. That's why scoped values, or bound vars in method handles need to be used when doing upcalls. You can see the native code in src/java.desktop/share/native/libfontmanager/HBShaper.c Not a Java object reference in sight. Now the 'upcall stubs' generated by Panama in HBShaper.java which are a MemorySegment there are somehow mapped to the native C function pointers that you'll see in HBShaper.c and then when called, somehow know which Java method to call but this is a basic part of the Panana functionality so I would be surprised if there's a problem there. > Do you have a CR for this "other test" available? I was referring to the (2nd level) linked bug https://bugs.openjdk.org/browse/JDK-8320253 G1: SIGSEGV in G1ParScanThreadState::trim_queue_to_threshold The test is unrelated to any of this client code or FFM and was submitted a week or so earlier. The timing made me suppose something had changed in G1 to trigger this.
20-06-2024

This seems to be an issue with FFM: with -Dsun.font.layout.ffm=false the test does not fail after 45k iterations; the same configuration with FFM enabled fails three times in 15k iterations.
19-06-2024

Thanks [~prr] for this background. This helps a lot. Some initial comments: > I can imagine that if GC has relocated some Java object because it > has no knowledge that some code using FFM has a direct reference to the > storage for the object, that bad things will happen, but I'm not sure where > to look for the problem. All Java objects on the Java heap directly accessed by native code must be registered with the GC (`CollectedHeap::pin_object()` called on it) and later unregistered (`CollectedHeap::unpin_object()`) to let the GC know about that it needs to be careful here. [...] > The crashes involving this test whilst few date from shortly after that was > integrated - although I also note that the earliest crash in this related group > is for another test and pre-dates this code being integrated. Do you have a CR for this "other test" available? This points to something else being responsible for the crashes, which I would like to investigate. > But I don't see how this client (as in client of FFM) code has anything >directly to do with the VM oops and GCs. Not directly, but if native code corrupts java object metadata or references (by using an outdated reference to a java object), the GC is typically the component of the VM that will come across these corrupted objects. It performs actions on every reference of "every" live object, so if there is a broken one, it will typically crash (which is actually better than the other option, silent corruption or passing on garbage to callers allocating memory). G1ParScanThreadState::trim_queue_to_threshold() is one central method in the G1 collector that reads metadata of live objects and follows references. I did start tests to check whether FFM could be causing this just in case using the -Dsun.font.layout.ffm=false option. >Generally this code makes a downcall, which can make multiple upcalls >from the downcall, including one to store the results, but once the downcall is done, >here's no FFM-related 'state' that's kept. >I will throw in that this code also used Scoped Values because FFM has no support >for Java objects. These are set up before the downcall and used in the upcalls. If these scoped values are not used by native code they should be fine. >But most of the backtraces show crashes after use of these is already finished. As above, the native code might overwrite some unrelated data with that outdated reference, causing random other data overwrites until the application or GC comes across them.
18-06-2024

I can imagine that if GC has relocated some Java object because it has no knowledge that some code using FFM has a direct reference to the storage for the object, that bad things will happen, but I'm not sure where to look for the problem. The test itself is 5 years old but there have been recent changes in the code under test. (1) The 3rd party native library was upgraded. Date: Tue Oct 31 19:01:15 2023 +0000 8313643: Update HarfBuzz to 8.2.2 (2) We started to use FFM/Panama to access it. Date: Tue Nov 21 17:46:29 2023 +0000 8318364: Add an FFM-based implementation of harfbuzz OpenType layout I'm inclined to guess it is related to (2), not (1). The crashes involving this test whilst few date from shortly after that was integrated - although I also note that the earliest crash in this related group is for another test and pre-dates this code being integrated. But I don't see how this client (as in client of FFM) code has anything directly to do with the VM oops and GCs. Generally this code makes a downcall, which can make multiple upcalls from the downcall, including one to store the results, but once the downcall is done, here's no FFM-related 'state' that's kept. I will throw in that this code also used Scoped Values because FFM has no support for Java objects. These are set up before the downcall and used in the upcalls. But most of the backtraces show crashes after use of these is already finished.
17-06-2024

--------------- T H R E A D --------------- Current thread (0x0000ffff28272c50): JavaThread "Thread-127" [_thread_in_Java, id=2584393, stack(0x0000fffe57a24000,0x0000fffe57c22000) (2040K)] Stack: [0x0000fffe57a24000,0x0000fffe57c22000], sp=0x0000fffe57c1f720, free space=2029k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) J 1618 c2 java.lang.invoke.LambdaForm$MH+0x00000f80010b9000.invoke(Ljava/lang/Object;JJIJ)I java.base@24-internal (46 bytes) @ 0x0000ffff703876f0 [0x0000ffff70387640+0x00000000000000b0] J 1607 c2 java.lang.invoke.LambdaForm$MH+0x00000f8001096400.invoke(Ljava/lang/Object;JJIJ)I java.base@24-internal (50 bytes) @ 0x0000ffff7038c8a0 [0x0000ffff7038c800+0x00000000000000a0] v blob 0x0000ffff6feafac8 C [libfontmanager.so+0x4e040] hb_font_get_glyph_h_advances_default(hb_font_t*, void*, unsigned int, unsigned int const*, unsigned int, int*, unsigned int, void*)+0x70 (hb-font.hh:308) C [libfontmanager.so+0xe06d4] _hb_ot_shape+0x15a0 (hb-font.hh:326) C [libfontmanager.so+0x1151c4] hb_shape_plan_execute+0x84 (hb-shaper-list.hh:47) C [libfontmanager.so+0x1156cc] hb_shape_full+0x78 (hb-shape.cc:148) C [libfontmanager.so+0x8f20] jdk_hb_shape+0x1a0 (HBShaper_Panama.c:130) v ~RuntimeStub::nep_invoker_blob 0x0000ffff6feb1df8 J 1732 c1 jdk.internal.foreign.abi.DowncallStub+0x00000f800108bc00.invoke(Ljava/lang/foreign/SegmentAllocator;Ljava/lang/foreign/MemorySegment;FLjava/lang/foreign/MemorySegment;Ljava/lang/foreign/MemorySegment;Ljava/lang/foreign/MemorySegment;IIIIIFFIILjava/lang/foreign/MemorySegment;Ljava/lang/foreign/MemorySegment;)V java.base@24-internal (514 bytes) @ 0x0000ffff68a29344 [0x0000ffff68a28c00+0x0000000000000744] J 1848 c2 sun.font.HBShaper.shape(Lsun/font/Font2D;Lsun/font/FontStrike;F[FLjava/lang/foreign/MemorySegment;[CLsun/font/GlyphLayout$GVData;IIIILjava/awt/geom/Point2D$Float;II)V java.desktop@24-internal (52 bytes) @ 0x0000ffff703a3264 [0x0000ffff703a2980+0x00000000000008e4] J 1820 c2 sun.font.GlyphLayout$EngineRecord.layout()V java.desktop@24-internal (108 bytes) @ 0x0000ffff703a9728 [0x0000ffff703a9180+0x00000000000005a8] J 1641 c1 sun.font.GlyphLayout.layout(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[CIIILsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector; java.desktop@24-internal (683 bytes) @ 0x0000ffff689fe878 [0x0000ffff689fcd40+0x0000000000001b38] J 1625 c1 java.awt.Font.layoutGlyphVector(Ljava/awt/font/FontRenderContext;[CIII)Ljava/awt/font/GlyphVector; java.desktop@24-internal (32 bytes) @ 0x0000ffff68a1e9b8 [0x0000ffff68a1e8c0+0x00000000000000f8] J 1715 c1 FontLayoutStressTest.doLayout()D (31 bytes) @ 0x0000ffff68a2bc54 [0x0000ffff68a2bbc0+0x0000000000000094] J 1908% c1 FontLayoutStressTest.lambda$main$0(Ljava/util/concurrent/CyclicBarrier;DLjava/util/concurrent/atomic/AtomicReference;)V (60 bytes) @ 0x0000ffff68a41de0 [0x0000ffff68a41cc0+0x0000000000000120] j FontLayoutStressTest$$Lambda+0x00000f80010018d0.run()V+12 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@24-internal j java.lang.Thread.run()V+19 java.base@24-internal v ~StubRoutines::call_stub 0x0000ffff6fd6e114 V [libjvm.so+0x7e3db8] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x218 (javaCalls.cpp:415) V [libjvm.so+0x7e5344] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x184 (javaCalls.cpp:329) V [libjvm.so+0x8ace7c] thread_entry(JavaThread*, JavaThread*)+0x8c (jvm.cpp:2937) V [libjvm.so+0x7fb7b4] JavaThread::thread_main_inner() [clone .part.0]+0xa0 (javaThread.cpp:759) V [libjvm.so+0xcfa848] Thread::call_run()+0xa8 (thread.cpp:225) V [libjvm.so+0xb7f340] thread_native_entry(Thread*)+0xdc (os_linux.cpp:849) C [libpthread.so.0+0x7950] start_thread+0x190 siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x00000000f8facbe0 Crash in panama related code 0x00000000f8facbe0 is an address in unallocated java heap.
17-06-2024

Another interesting failure: Oop handle area contains references to stale objects? Someone holding oops across GCs? Current thread (0x0000ffff54007d20): WorkerThread "GC Thread#1" [id=2390714, stack(0x0000fffef3606000,0x0000fffef3804000) (2040K)] Stack: [0x0000fffef3606000,0x0000fffef3804000], sp=0x0000fffef38023e0, free space=2032k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x70ad84] G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x2164 (klass.hpp:295) V [libjvm.so+0x728da0] G1ParCopyClosure<(G1Barrier)0, false>::do_oop(oopDesc**)+0x90 (g1ParScanThreadState.inline.hpp:53) V [libjvm.so+0x77eef8] HandleArea::oops_do(OopClosure*)+0x48 (handles.cpp:109) V [libjvm.so+0x8004e4] JavaThread::oops_do_no_frames(OopClosure*, NMethodClosure*)+0x24 (javaThread.cpp:1379) V [libjvm.so+0xcfa344] Thread::oops_do(OopClosure*, NMethodClosure*)+0xa4 (thread.cpp:439) V [libjvm.so+0xd06370] Threads::possibly_parallel_oops_do(bool, OopClosure*, NMethodClosure*)+0x10c (threads.cpp:1154) V [libjvm.so+0x72b7b0] G1RootProcessor::process_java_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x80 (g1RootProcessor.cpp:180) V [libjvm.so+0x72b8b4] G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x64 (g1RootProcessor.cpp:61) V [libjvm.so+0x73d704] G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned int)+0x24 (g1YoungCollector.cpp:664) V [libjvm.so+0x73d904] G1EvacuateRegionsBaseTask::work(unsigned int)+0x84 (g1YoungCollector.cpp:651) V [libjvm.so+0xda60f8] WorkerThread::run()+0x98 (workerThread.cpp:70)
17-06-2024

Crashes do not seem to be limited to linux-aarch64.
17-06-2024

Resetting the bug to give GC triage a chance to give this the proper GC triage treatment.
13-06-2024

Crash in the same place # V [libjvm.so+0x708044] G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x2164 Odds now are on a GC bug. Will reassign.
13-06-2024

Here is another variation: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000ffff89b089f0, pid=2690501, tid=2690590 # # JRE version: Java(TM) SE Runtime Environment (23.0+26) (build 23-ea+26-2183) # Java VM: Java HotSpot(TM) 64-Bit Server VM (23-ea+26-2183, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0x7089f0] G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x3b70 --------------- T H R E A D --------------- Current thread (0x0000ffff48008860): WorkerThread "GC Thread#2" [id=2690590, stack(0x0000ffff34519000,0x0000ffff34717000) (2040K)] Stack: [0x0000ffff34519000,0x0000ffff34717000], sp=0x0000ffff34715520, free space=2033k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x7089f0] G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x3b70 (oop.inline.hpp:196) V [libjvm.so+0x70acf0] G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+0x310 (g1ParScanThreadState.inline.hpp:60) V [libjvm.so+0x73b984] G1ParEvacuateFollowersClosure::do_void()+0x94 (g1YoungCollector.cpp:577) V [libjvm.so+0x73bf64] G1EvacuateRegionsTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int)+0x74 (g1YoungCollector.cpp:602) V [libjvm.so+0x739b7c] G1EvacuateRegionsBaseTask::work(unsigned int)+0x9c (g1YoungCollector.cpp:652) V [libjvm.so+0xd9b6f8] WorkerThread::run()+0x98 (workerThread.cpp:70) V [libjvm.so+0xcf51c8] Thread::call_run()+0xa8 (thread.cpp:225) V [libjvm.so+0xb79b40] thread_native_entry(Thread*)+0xdc (os_linux.cpp:846) C [libpthread.so.0+0x7950] start_thread+0x190 This all suggests a memory stomp to me.
04-06-2024

Seen once more on OL 8, AARCH64. The trace this time is different but we still look to be in VM code. But the runnable in which it occurs does use FFM and ScopedValues, which are new enough both in themselves that there might be a small problem there somewhere. It looks like the VM is trying to generate ClassCastException but I don't see what in the Java code being executed might cause this. # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000ffffbbff5128, pid=1211942, tid=1212009 # # JRE version: Java(TM) SE Runtime Environment (23.0+23) (build 23-ea+23-1918) # Java VM: Java HotSpot(TM) 64-Bit Server VM (23-ea+23-1918, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64) # Problematic frame: # V [libjvm.so+0xcb6128] Symbol::as_klass_external_name() const+0x18 # .... ... -------------- T H R E A D --------------- Current thread (0x0000ffff4c1dfee0): JavaThread "Thread-9" [_thread_in_vm, id=1212009, stack(0x0000ffff33a04000,0x0000ffff33c02000) (2040K)] Stack: [0x0000ffff33a04000,0x0000ffff33c02000], sp=0x0000ffff33bff290, free space=2028k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xcb6128] Symbol::as_klass_external_name() const+0x18 (symbol.hpp:139) V [libjvm.so+0xc2ea0c] SharedRuntime::generate_class_cast_message(Klass*, Klass*, Symbol*)+0x2c (sharedRuntime.cpp:1843) V [libjvm.so+0xc31210] SharedRuntime::generate_class_cast_message(JavaThread*, Klass*)+0xd0 (sharedRuntime.cpp:1835) V [libjvm.so+0x4a8204] Runtime1::throw_class_cast_exception(JavaThread*, oopDesc*)+0x70 (c1_Runtime1.cpp:735) v ~RuntimeStub::throw_class_cast_exception Runtime1 stub 0x0000ffffa3f35834 J 1586 c1 java.lang.ScopedValue.scopedValueBindings()Ljava/lang/ScopedValue$Snapshot; java.base@23-ea (65 bytes) @ 0x0000ffff9cad5bdc [0x0000ffff9cad5a00+0x00000000000001dc] j java.lang.ScopedValue$Carrier.run(Ljava/lang/Runnable;)V+12 java.base@23-ea j sun.font.HBShaper.shape(Lsun/font/Font2D;Lsun/font/FontStrike;F[FLjava/lang/foreign/MemorySegment;[CLsun/font/GlyphLayout$GVData;IIIILjava/awt/geom/Point2D$Float;II)V+48 java.desktop@23-ea j sun.font.SunLayoutEngine.layout(Lsun/font/FontStrikeDesc;[FFIILsun/font/TextRecord;ILjava/awt/geom/Point2D$Float;Lsun/font/GlyphLayout$GVData;)V+75 java.desktop@23-ea j sun.font.GlyphLayout$EngineRecord.layout()V+102 java.desktop@23-ea J 1713 c1 sun.font.GlyphLayout.layout(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[CIIILsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector; java.desktop@23-ea (683 bytes) @ 0x0000ffff9cadaa08 [0x0000ffff9cad8ec0+0x0000000000001b48] j java.awt.Font.layoutGlyphVector(Ljava/awt/font/FontRenderContext;[CIII)Ljava/awt/font/GlyphVector;+19 java.desktop@23-ea j FontLayoutStressTest.doLayout()D+15 j FontLayoutStressTest.lambda$main$0(Ljava/util/concurrent/CyclicBarrier;DLjava/util/concurrent/atomic/AtomicReference;)V+23 j FontLayoutStressTest$$Lambda+0x00007f80010018d0.run()V+12 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23-ea j java.lang.Thread.run()V+19 java.base@23-ea v ~StubRoutines::call_stub 0x0000ffffa3e1e114 V [libjvm.so+0x7e0dd8] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x218 (javaCalls.cpp:415) V [libjvm.so+0x7e2364] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x184 (javaCalls.cpp:329) V [libjvm.so+0x8a9f5c] thread_entry(JavaThread*, JavaThread*)+0x8c (jvm.cpp:2937) V [libjvm.so+0x7f8874] JavaThread::thread_main_inner() [clone .part.0]+0xa0 (javaThread.cpp:759) V [libjvm.so+0xcf4ac8] Thread::call_run()+0xa8 (thread.cpp:225) V [libjvm.so+0xb78520] thread_native_entry(Thread*)+0xdc (os_linux.cpp:846) C [libpthread.so.0+0x7950] start_thread+0x190 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::throw_class_cast_exception Runtime1 stub 0x0000ffffa3f35834 J 1586 c1 java.lang.ScopedValue.scopedValueBindings()Ljava/lang/ScopedValue$Snapshot; java.base@23-ea (65 bytes) @ 0x0000ffff9cad5bdc [0x0000ffff9cad5a00+0x00000000000001dc] j java.lang.ScopedValue$Carrier.run(Ljava/lang/Runnable;)V+12 java.base@23-ea j sun.font.HBShaper.shape(Lsun/font/Font2D;Lsun/font/FontStrike;F[FLjava/lang/foreign/MemorySegment;[CLsun/font/GlyphLayout$GVData;IIIILjava/awt/geom/Point2D$Float;II)V+48 java.desktop@23-ea j sun.font.SunLayoutEngine.layout(Lsun/font/FontStrikeDesc;[FFIILsun/font/TextRecord;ILjava/awt/geom/Point2D$Float;Lsun/font/GlyphLayout$GVData;)V+75 java.desktop@23-ea j sun.font.GlyphLayout$EngineRecord.layout()V+102 java.desktop@23-ea J 1713 c1 sun.font.GlyphLayout.layout(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[CIIILsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector; java.desktop@23-ea (683 bytes) @ 0x0000ffff9cadaa08 [0x0000ffff9cad8ec0+0x0000000000001b48] j java.awt.Font.layoutGlyphVector(Ljava/awt/font/FontRenderContext;[CIII)Ljava/awt/font/GlyphVector;+19 java.desktop@23-ea j FontLayoutStressTest.doLayout()D+15 j FontLayoutStressTest.lambda$main$0(Ljava/util/concurrent/CyclicBarrier;DLjava/util/concurrent/atomic/AtomicReference;)V+23 j FontLayoutStressTest$$Lambda+0x00007f80010018d0.run()V+12 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23-ea j java.lang.Thread.run()V+19 java.base@23-ea v ~StubRoutines::call_stub 0x0000ffffa3e1e114
22-05-2024

Ok, this is an indication of JDK 23 regression, so keeping this targeted to fix in 23
06-05-2024

[~vdyakov] Does it affect JDK 22? 21? 17? The test failure on CI-jdk23 is new. This test failed in the past but for other issues not the same as the recent issue observed on CI-jdk23.
06-05-2024

This is very odd. The whole call stack is Java code, and we've not changed anything in this area in some time. This could be a hotspot bug. --------------- T H R E A D --------------- Current thread (0x0000ffff1022a810): JavaThread "Thread-4" [_thread_in_Java, id=1499236, stack(0x0000ffff0f805000,0x0000ffff0fa03000) (2040K)] Stack: [0x0000ffff0f805000,0x0000ffff0fa03000], sp=0x0000ffff0fa00cb0, free space=2031k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) j java.awt.Font.getFont2D()Lsun/font/Font2D;+0 java.desktop@23-ea j java.awt.Font$FontAccessImpl.getFont2D(Ljava/awt/Font;)Lsun/font/Font2D;+1 java.desktop@23-ea j sun.font.FontUtilities.getFont2D(Ljava/awt/Font;)Lsun/font/Font2D;+4 java.desktop@23-ea j sun.font.StandardGlyphVector.initFontData()V+5 java.desktop@23-ea j sun.font.StandardGlyphVector.initGlyphVector(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[I[F[II)V+39 java.desktop@23-ea j sun.font.StandardGlyphVector.<init>(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[I[F[II)V+14 java.desktop@23-ea j sun.font.GlyphLayout$GVData.createGlyphVector(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;Lsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector;+274 java.desktop@23-ea j sun.font.GlyphLayout.layout(Ljava/awt/Font;Ljava/awt/font/FontRenderContext;[CIIILsun/font/StandardGlyphVector;)Lsun/font/StandardGlyphVector;+675 java.desktop@23-ea j java.awt.Font.layoutGlyphVector(Ljava/awt/font/FontRenderContext;[CIII)Ljava/awt/font/GlyphVector;+19 java.desktop@23-ea j FontLayoutStressTest.doLayout()D+15 j FontLayoutStressTest.lambda$main$0(Ljava/util/concurrent/CyclicBarrier;DLjava/util/concurrent/atomic/AtomicReference;)V+23 j FontLayoutStressTest$$Lambda+0x000003fe010018d0.run()V+12 j java.lang.Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V+5 java.base@23-ea j java.lang.Thread.run()V+19 java.base@23-ea v ~StubRoutines::call_stub 0x0000ffff67e4f114 V [libjvm.so+0x7e0618] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x218 (javaCalls.cpp:415) V [libjvm.so+0x7e1ba4] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x184 (javaCalls.cpp:329) V [libjvm.so+0x8a988c] thread_entry(JavaThread*, JavaThread*)+0x8c (jvm.cpp:2937) V [libjvm.so+0x7f80b4] JavaThread::thread_main_inner() [clone .part.0]+0xa0 (javaThread.cpp:761) V [libjvm.so+0xcf4b18] Thread::call_run()+0xa8 (thread.cpp:221) V [libjvm.so+0xb78290] thread_native_entry(Thread*)+0xdc (os_linux.cpp:846) C [libpthread.so.0+0x7950] start_thread+0x190 siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000001052d98
06-05-2024

Does it affect JDK 22? 21? 17?
06-05-2024