JDK-8340586 : JdkJfrEvent::get_all_klasses stores non-strong oops in JNI handles
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 23,24
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2024-09-23
  • Updated: 2024-11-19
  • Resolved: 2024-11-07
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 24
24 b24Fixed
Related Reports
Relates :  
Relates :  
Description
Test jdk/jfr/api/metadata/annotations/TestStackFilter.java with '-Xcomp' hit assertion:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/a4a7850a-7c35-410a-b879-d77fbb2f6087-S151444/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/37caeecc-de78-4839-81e0-361230c33f9b/runs/f2a054ce-ef19-45e0-893f-d926a879fbc8/workspace/open/src/hotspot/share/oops/compressedKlass.inline.hpp:46), pid=726126, tid=726763
#  assert(check_alignment(result)) failed: address not aligned: 0x00007fe594afbabe
#
# JRE version: Java(TM) SE Runtime Environment (24.0) (fastdebug build 24-internal-2024-09-22-1906186.leonid.mesnik.jdk-jfr)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 24-internal-2024-09-22-1906186.leonid.mesnik.jdk-jfr, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x14b4621]  oopDesc::print_value_on(outputStream*) const+0x311
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/a4a7850a-7c35-410a-b879-d77fbb2f6087-S15432/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/ab26347e-2427-402c-841b-f3b4370f4e48/runs/ecf46603-5cbd-48d8-88da-0f182398117e/testoutput/test-support/jtreg_open_test_jdk_jdk_jfr/scratch/0/core.726126)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
.....
Current thread (0x00007fe4ac0217d0):  WorkerThread "GC Thread#7"    [id=726763, stack(0x00007fe5054f6000,0x00007fe5055f6000) (1024K)]

Stack: [0x00007fe5054f6000,0x00007fe5055f6000],  sp=0x00007fe5055f4780,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x14b4621]  oopDesc::print_value_on(outputStream*) const+0x311  (compressedKlass.inline.hpp:46)
V  [libjvm.so+0x14896bf]  ObjArrayKlass::oop_print_on(oop, outputStream*)+0x1af  (objArrayKlass.cpp:390)
V  [libjvm.so+0x14b3f0e]  oopDesc::print_on(outputStream*) const+0x6e  (oop.cpp:47)
V  [libjvm.so+0xceea86]  G1VerifyLiveAndRemSetClosure::Checker<narrowOop>::print_containing_obj(outputStream*, G1HeapRegion*)+0xc6  (g1HeapRegion.cpp:506)
V  [libjvm.so+0xcef375]  G1VerifyLiveAndRemSetClosure::LiveChecker<narrowOop>::report_error()+0x125  (g1HeapRegion.cpp:572)
V  [libjvm.so+0xcefdb4]  void G1VerifyLiveAndRemSetClosure::do_oop_work<narrowOop>(narrowOop*)+0x4b4  (g1HeapRegion.cpp:650)
V  [libjvm.so+0xcf00df]  void OopOopIterateDispatch<G1VerifyLiveAndRemSetClosure>::Table::oop_oop_iterate<ObjArrayKlass, narrowOop>(G1VerifyLiveAndRemSetClosure*, oop, Klass*)+0x14f  (g1HeapRegion.cpp:667)
V  [libjvm.so+0xced02a]  G1HeapRegion::verify_liveness_and_remset(VerifyOption) const+0x2fa  (iterator.inline.hpp:295)
V  [libjvm.so+0xced433]  G1HeapRegion::verify(VerifyOption) const+0x33  (g1HeapRegion.cpp:708)
V  [libjvm.so+0xd04e25]  VerifyRegionClosure::do_heap_region(G1HeapRegion*)+0x125  (g1HeapVerifier.cpp:268)
V  [libjvm.so+0xcf6b9f]  G1HeapRegionManager::par_iterate(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int) const+0x6f  (g1HeapRegionManager.cpp:577)
V  [libjvm.so+0xd030c8]  G1VerifyTask::work(unsigned int)+0x38  (g1HeapVerifier.cpp:309)
V  [libjvm.so+0x1946950]  WorkerThread::run()+0x80  (workerThread.cpp:70)
V  [libjvm.so+0x17fc8c6]  Thread::call_run()+0xb6  (thread.cpp:225)
V  [libjvm.so+0x14e5da7]  thread_native_entry(Thread*)+0x127  (os_linux.cpp:858)
Registers:
RAX=0x00007fe5363fb000, RBX=0x00007fe53577cbb6, RCX=0x00007fe594afbabe, RDX=0x0000000000000001
RSP=0x00007fe5055f4780, RBP=0x00007fe5055f47c0, RSI=0x00007fe535770704, RDI=0x00000000d498c078
R8 =0x00007fe594afbabe, R9 =0x00007fe5357701b8, R10=0x00007fe5357701b0, R11=0x0000000000000000
R12=0x00000000d498c078, R13=0x00007fe5356f89c8, R14=0x00007fe5055f4960, R15=0x0000000000000015
RIP=0x00007fe534ae2621, EFLAGS=0x0000000000010206, CSGSFS=0x002b000000000033, ERR=0x0000000000000006
  TRAPNO=0x000000000000000e


Register to memory mapping:

RAX=0x00007fe5363fb000 in mmap'd memory region [0x00007fe5363fb000 - 0x00007fe5363fc000], tag mtInternal
RBX=0x00007fe53577cbb6: <offset 0x000000000214ebb6> in /opt/mach5/mesos/work_dir/jib-master/install/2024-09-22-1906186.leonid.mesnik.jdk-jfr/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so at 0x00007fe53362e000
RCX=0x00007fe594afbabe is an unknown value
RDX=0x0000000000000001 is an unknown value
RSP=0x00007fe5055f4780 in mmap'd memory region [0x00007fe5054f6000 - 0x00007fe5055f6000], tag mtThreadStack
RBP=0x00007fe5055f47c0 in mmap'd memory region [0x00007fe5054f6000 - 0x00007fe5055f6000], tag mtThreadStack
RSI=0x00007fe535770704: <offset 0x0000000002142704> in /opt/mach5/mesos/work_dir/jib-master/install/2024-09-22-1906186.leonid.mesnik.jdk-jfr/linux-x64-debug.jdk/jdk-24/fastdebug/lib/server/libjvm.so at 0x00007fe53362e000
RDI=0x00000000d498c078 is pointing into object: [Ljdk.internal.vm.FillerElement; 
{0x00000000d4966558} - klass: {type array int} - flags: is_cloneable_fast 
Comments
Thanks for the clarification. Okay, it's an OopHandle, but it is different from "normal ones" because it is not placed in OopStorage; instead, it is associated with the CLD. A practical example of this difference would be to compare the OopHandle for the _threadObj (backed in OopStorage) in a JavaThread, to this _java_mirror OopHandle for a Klass, which is supported instead by CLD metaspace?
08-11-2024

> How is the field of the Klass _java_mirror of type OopHandle classified as a "CLD handle"? Take a look at how it is created: void Klass::set_java_mirror(Handle m) { assert(!m.is_null(), "New mirror should never be null."); assert(_java_mirror.is_empty(), "should only be used to initialize mirror"); _java_mirror = class_loader_data()->add_handle(m); } "Normal" OopHandles are associated with an OopStorage, but the once in the CLDs are not.
08-11-2024

"Not sure what this means. A resolve of a WeakHandle always means add the referent to the snapshot (make it strongly reachable). And a peek would not do anything here w.r.t. keeping anything alive / strongly reachable." It means that it is counter-intuitive to have something classified as Weak, to carry semantics that you essentially "restore", "resurrect" or make an object reference "alive" by accessing it. Its the whole purpose of 'Weak' classification, in general, that you don't / can't.
08-11-2024

How is the field of the Klass _java_mirror of type OopHandle classified as a "CLD handle"? Klass: // java/lang/Class instance mirroring this class OopHandle _java_mirror; ClassLoaderData: WeakHandle _holder; // The oop that determines lifetime of this class loader OopHandle _class_loader; // The instance of java/lang/ClassLoader associated with // this ClassLoaderData
08-11-2024

[jdk23u-fix-request] Approval Request from Axel Boldt-Christmas Clean backport. This bug is more likely to occur after JDK-8326820, which is backported to JDK 23. This P2 should be backported as well.
07-11-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk23u/pull/218 Date: 2024-11-07 10:55:05 +0000
07-11-2024

> You talk about "loading CLD Handle oops," which I don't see this code doing anywhere, not even in the ClassHierarchyIterator. > We are loading the InstanceKlass::_java_mirror, which is distinct from the CLD holder oop (a WeakHandle). > "... but that the load from the CLD handle is loaded without any guarantee that the holder is strongly reachable." What load is this? `Klass::_java_mirror` is a CLD Handle. > And do you only need to do this if you intend to write the java_mirror into the object graph? A quick version of SATB (Snapshot-at-the-beginning). At some point you start the snapshot by capturing all the roots. We then trace through the object graph from these roots marking all objects as part of the snapshot. For GC that do this tracing concurrently, we allow the mutators (java threads) to load any object from the graph without marking as we know these are part of the snapshot and the GC will trace through them eventually. However if a mutator wants to modify the object graph, by performing a store, we capture this in our store barriers, and make sure that the object that is being removed is marked as part of the snapshot. We do not however care about what object is being added, because we assume that this object is either newly allocated, and thus implicitly marked (more technically newly allocated objects are not marked as part of the snapshot, but the GC will not consider these for collection until the next GC cycle), or the object was part of the snapshot, as the mutator must have loaded it by traversing the object graph through some root. And when it comes to non-strong loads, as the non-strong edges in the graph are not part of the snapshot, any such load must extend the snapshot by marking the referent. However inside the VM we can break this by retrieving oops that were not accessed by traversing the object graph through a root, or through a peak of a non-strong reference. In this specific example we walk the ClassHierarchy, and load the java mirror. CLDs are part of the snapshot through objects Klass* which all has an associated CLD. But the Klass* you get from ClassHierarchy may not be part of the snapshot, so we are effectively peeking at them. But to uphold the SATB invariants when performing the store of the java mirror (in this case in a root, but could have been deeper in the object graph) we must make sure that this object is part of the snapshot. For Klass* it is enough to do a load of the holder, which is a WeakHandle (PHANTOM_REFERENCE) and as such will extend the snapshot with the holder object, and behind this holder object the Klass* and the java mirror will be strongly reachable. > Are you saying that to load the _java_mirror from an InstanceKlass, you first need to perform a load of the InstanceKlass's ClassLoaderData _holder oop to ensure that the _java_mirror of the InstanceKlass is not unloaded? It is not really about not being unloaded. In the general case, just loading the _holder will not stop the Klass from being unloaded at the next safepoint, however it means that the _holder is currently and until the next safepoint strongly reachable. So it is safe to store it in the object graph. Here we also want this object to stay alive across a safepoint, so we store it in a JNI local, which is a root. > It seems unintuitive that to keep something alive, you need to load it using a WeakHandle; it's like reverse WeakHandle semantics. Not sure what this means. A resolve of a WeakHandle always means add the referent to the snapshot (make it strongly reachable). And a peek would not do anything here w.r.t. keeping anything alive / strongly reachable. I think maybe that the term "keep_alive" is a bad one, and we should have used something along the lines of "ensure_strongly_reachable" (until next safepoint)
07-11-2024

Changeset: 97b681e9 Branch: master Author: Axel Boldt-Christmas <aboldtch@openjdk.org> Date: 2024-11-07 06:28:02 +0000 URL: https://git.openjdk.org/jdk/commit/97b681e93a9469d8d16122dc10bbf2f5b5fe2266
07-11-2024

There are details here that might only be immediate to GC developers. You talk about "loading CLD Handle oops," which I don't see this code doing anywhere, not even in the ClassHierarchyIterator. We are loading the InstanceKlass::_java_mirror, which is distinct from the CLD holder oop (a WeakHandle). "... but that the load from the CLD handle is loaded without any guarantee that the holder is strongly reachable." What load is this? Are you saying that to load the _java_mirror from an InstanceKlass, you first need to perform a load of the InstanceKlass's ClassLoaderData _holder oop to ensure that the _java_mirror of the InstanceKlass is not unloaded? And do you only need to do this if you intend to write the java_mirror into the object graph? It seems unintuitive that to keep something alive, you need to load it using a WeakHandle; it's like reverse WeakHandle semantics.
05-11-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/21893 Date: 2024-11-05 06:34:02 +0000
05-11-2024

> What is this construct: "Universe::heap()->keep_alive(java_mirror); " It tells the GC that a peek / no_keepalive loaded object must be kept alive. As it was loaded without marking barriers. > and why has it become necessary now after years of this code working? Isn't the thread's local jni handle's GC roots? The handles are roots. But for roots to work, the oops that are stored must be strongly reachable. Our SATB barriers only care about mutations to the object graphs, so only stores do any meaningful work w.r.t. marking. Our loads are always effectivly no_keepalive. We rely on the fact that to load an object you must have reached it through a root, and thus it is strong. This is not the case for CLD handle oops, those loads are only as strong as their holder. The issue here is not that the jni handle is not handled correctly, but that the load from CLD handle is loaded without any guarantee that the holder is strongly reachable. CLD handle oops has worked this way for as long as I have worked with the codebase, and when I have spoken with [~eosterlund] he has said that the last time there was different semantics here was with CMS. This test is only about a year old, and (before JDK-8326820) JFR checkpoints would keep all CLDs alive. (Which caused issues, as if the JFR checkpoints interval was on the same or shorter order of magnitude than the GC marking time, it would effectively disable classunloading.) So when the test was created its implementation also made it so that the holders were always kept alive when writing to JNI handles, because of the checkpoints. A side note, the reason we have peak / no_keepalive loads on the CLD handles is because of single generational ZGC (which is now removed). As it does not use SATB, but have mark barriers in its loads. So for single generational ZGC a load does in keep an oop alive, so we needed the no_keepalive. Also if we ever start using WeakHandles and oops storage insteasd of the CLD handles, we again get that load barriers will mark, and peeks are necessary for code which does not want to keep objects alive.
04-11-2024

What is this construct: "Universe::heap()->keep_alive(java_mirror); " and why has it become necessary now after years of this code working? Isn't the thread's local jni handle's GC roots?
21-10-2024

JDK-8326820 was backported to JDK 23 b30 and JDK 23.0.1 b02, so I'm adding the corresponding affects versions although I did not try to reproduce it with these versions yet.
16-10-2024

I believe this is a preexisting bug which JDK-8326820 simply brought to light. While there still are some events in JFR I want to double check w.r.t. JDK-8326820, I am fairly confident that this is not caused by JDK-8326820. The problem here seems to be the same as JDK-8339725, and is exemplified by this comment https://github.com/openjdk/jdk/pull/20907#pullrequestreview-2290148202 : > I think this is wrong. We want 'resolve' to read the oop and tell the GC that this oop should be alive. If GetMethodDeclaringClass is crashing, make sure the holder is alive before the iteration could safepoint. > > This code does keep it in a Handle, so that should be sufficient. The problem here is `jfr_get_all_event_classes` which returns all sub-Classes of `jdk/internal/event/Event`. It does walking the class hierarchy and stores the relevant klasses' java mirrors inside handles. It then store these handles' oops in an ArrayList and returns them to Java. The issue is that you are only ever allowed to write java mirrors into the object graph (including jni_locals) if you know that the holder is kept alive, which is not true in this test. And not true in general for `jfr_get_all_event_classes` (with custom events, not loaded through the null classloader). Running with the following patch stops this crash from occurring. ``` diff --git a/src/hotspot/share/jfr/support/jfrJdkJfrEvent.cpp b/src/hotspot/share/jfr/support/jfrJdkJfrEvent.cpp index 0c2fb0206ec..b8985a15e53 100644 --- a/src/hotspot/share/jfr/support/jfrJdkJfrEvent.cpp +++ b/src/hotspot/share/jfr/support/jfrJdkJfrEvent.cpp @@ -100,7 +100,9 @@ static void transform_klasses_to_local_jni_handles(GrowableArray<const void*>& e for (int i = 0; i < event_subklasses.length(); ++i) { const InstanceKlass* k = static_cast<const InstanceKlass*>(event_subklasses.at(i)); assert(is_allowed(k), "invariant"); - event_subklasses.at_put(i, JfrJavaSupport::local_jni_handle(k->java_mirror(), thread)); + oop java_mirror = k->java_mirror(); + Universe::heap()->keep_alive(java_mirror); + event_subklasses.at_put(i, JfrJavaSupport::local_jni_handle(java_mirror, thread)); } } ``` At the time of JDK-8339725 I started to look at if we could satisfy this dream of "just put it in a handle". Here is some early p.o.c. I worked on that back then https://github.com/openjdk/jdk/compare/master...xmas92:jdk:instanceMirrorOoop (only builds in debug). It is not fool proof, may do unnecessary work, but is less of a whack'a'mole than our current situation. Then there is the project I put on hold over a year ago to remove the CLD handle area and put all the oops directly in the object graph. This also solves this by no longer having these weird CLD oops, and instead have the runtime go through the less confusing (and less error prone) OopStorage. And instead move the scary loads to the interpreter and compiled code, where we already today are much more strict about keeping holders alive if we uses, classes, constant pools, resolved references etc. (Latest rebase: https://github.com/openjdk/jdk/compare/master...xmas92:jdk:cld_handle_area_removal_v7) This is nothing that is going in anytime soon, but maybe interesting to push for once leyden is more stable.
15-10-2024

This is a regression from JDK-8326820 in JDK 24 b05 ([~aboldtch]). I can reproduce it reliably on Mach5 with changeset 5909d54147355dd7da5786ff39ead4c15816705c but not with the changeset just before (d5375c7db658de491c1f5bad053040d21b82941e). Moving to hotspot/gc.
15-10-2024

This first reproduces with JDK 24 b5. Narrowing it down ...
15-10-2024

Note that we also have this bug: JDK-8336164
14-10-2024

Yes, thus it is a HLH -> P2.
01-10-2024

I can locally reproduce this intermittently when running with -Xcomp. Initial ILW = Assert in G1 heap verification most likely due to a broken oop, single test, no known workaround = HLH = P2
01-10-2024

The bug happend during heap verification, not during printing Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x14b4621] oopDesc::print_value_on(outputStream*) const+0x311 (compressedKlass.inline.hpp:46) V [libjvm.so+0x14896bf] ObjArrayKlass::oop_print_on(oop, outputStream*)+0x1af (objArrayKlass.cpp:390) V [libjvm.so+0x14b3f0e] oopDesc::print_on(outputStream*) const+0x6e (oop.cpp:47) V [libjvm.so+0xceea86] G1VerifyLiveAndRemSetClosure::Checker<narrowOop>::print_containing_obj(outputStream*, G1HeapRegion*)+0xc6 (g1HeapRegion.cpp:506) ...... V [libjvm.so+0xcef375] G1VerifyLiveAndRemSetClosure::LiveChecker<narrowOop>::report_error()+0x125 (g1HeapRegion.cpp:572) ..... ^^^^ The error is here V [libjvm.so+0xcefdb4] void G1VerifyLiveAndRemSetClosure::do_oop_work<narrowOop>(narrowOop*)+0x4b4 (g1HeapRegion.cpp:650) V [libjvm.so+0xcf00df] void OopOopIterateDispatch<G1VerifyLiveAndRemSetClosure>::Table::oop_oop_iterate<ObjArrayKlass, narrowOop>(G1VerifyLiveAndRemSetClosure*, oop, Klass*)+0x14f (g1HeapRegion.cpp:667) V [libjvm.so+0xced02a] G1HeapRegion::verify_liveness_and_remset(VerifyOption) const+0x2fa (iterator.inline.hpp:295) V [libjvm.so+0xced433] G1HeapRegion::verify(VerifyOption) const+0x33 (g1HeapRegion.cpp:708) V [libjvm.so+0xd04e25] VerifyRegionClosure::do_heap_region(G1HeapRegion*)+0x125 (g1HeapVerifier.cpp:268) V [libjvm.so+0xcf6b9f] G1HeapRegionManager::par_iterate(G1HeapRegionClosure*, G1HeapRegionClaimer*, unsigned int) const+0x6f (g1HeapRegionManager.cpp:577) V [libjvm.so+0xd030c8] G1VerifyTask::work(unsigned int)+0x38 (g1HeapVerifier.cpp:309) V [libjvm.so+0x1946950] WorkerThread::run()+0x80 (workerThread.cpp:70) V [libjvm.so+0x17fc8c6] Thread::call_run()+0xb6 (thread.cpp:225) V [libjvm.so+0x14e5da7] thread_native_entry(Thread*)+0x127 (os_linux.cpp:858)
28-09-2024

The failure might looks like: https://bugs.openjdk.org/browse/JDK-8337192 however it has been fixed in JDK24 b09 and the current issue is still reproduced with JDK24 b17. Reproduces with Xcomp only but quite often.
23-09-2024