JDK-8236743 : JFR: assert(klass != __null) failed: invariant in ObjectSampleCheckpoint::add_to_leakp_set
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jfr
  • Affected Version: 11.0.7,14
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2020-01-08
  • Updated: 2021-02-10
  • Resolved: 2020-01-31
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 14 JDK 15
14 b35Fixed 15Fixed
Related Reports
Duplicate :  
Relates :  
Description
Stress test crashed with 
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (open/src/hotspot/share/jfr/recorder/checkpoint/types/traceid/jfrTraceId.inline.hpp:123), pid=7787, tid=7943
#  assert(klass != __null) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (14.0) (fastdebug build 14-internal+0-2020-01-07-2124408.leonid.mesnik.jdk14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 14-internal+0-2020-01-07-2124408.leonid.mesnik.jdk14, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x128f23c]  ObjectSampleCheckpoint::add_to_leakp_set(Method const*, unsigned long)+0x34c
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/d65fa3d4-10de-46aa-a2db-52cc9981537a-S789/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/9d52e96f-a2c4-4092-93b4-9412534447ec/runs/94952704-dc41-4005-ab99-243d348f6e9b/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/core.7787)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

---------------  S U M M A R Y ------------

Command Line: -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -XX:MaxRAMPercentage=6 -XX:MaxRAMPercentage=50 -XX:MaxMetaspaceSize=256m -XX:+HeapDumpOnOutOfMemoryError -XX:+CrashOnOutOfMemoryError -Djava.net.preferIPv6Addresses=false -XX:+DisplayVMOutputToStderr -Xlog:gc*,gc+heap=debug:gc.log:uptime,timemillis,level,tags -XX:+DisableExplicitGC -XX:+StartAttachListener --add-exports=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-exports=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED --add-exports=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED -Djava.io.tmpdir=/opt/mach5/mesos/work_dir/slaves/d65fa3d4-10de-46aa-a2db-52cc9981537a-S789/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/9d52e96f-a2c4-4092-93b4-9412534447ec/runs/94952704-dc41-4005-ab99-243d348f6e9b/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/java.io.tmpdir -Duser.home=/opt/mach5/mesos/work_dir/slaves/d65fa3d4-10de-46aa-a2db-52cc9981537a-S789/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/9d52e96f-a2c4-4092-93b4-9412534447ec/runs/94952704-dc41-4005-ab99-243d348f6e9b/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/user.home -agentpath:/opt/mach5/mesos/work_dir/jib-master/install/2020-01-07-2124408.leonid.mesnik.jdk14/linux-x64-debug.test/hotspot/jtreg/native/libJvmtiStressModule.so -Xverify:all -javaagent:redefineagent.jar -XX:NativeMemoryTracking=detail applications.kitchensink.process.stress.Main /opt/mach5/mesos/work_dir/slaves/d65fa3d4-10de-46aa-a2db-52cc9981537a-S789/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/9d52e96f-a2c4-4092-93b4-9412534447ec/runs/94952704-dc41-4005-ab99-243d348f6e9b/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/kitchensink.final.properties

Host: ol7-build-test-5549, Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz, 8 cores, 58G, Oracle Linux Server release 7.7
Time: Tue Jan  7 22:26:34 2020 UTC elapsed time: 1749 seconds (0d 0h 29m 9s)

---------------  T H R E A D  ---------------

Current thread (0x00007fb92c0df800):  JavaThread "JFR Recorder Thread" daemon [_thread_in_vm, id=7943, stack(0x00007fb886ced000,0x00007fb886dee000)]

Stack: [0x00007fb886ced000,0x00007fb886dee000],  sp=0x00007fb886dec9d0,  free space=1022k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x128f23c]  ObjectSampleCheckpoint::add_to_leakp_set(Method const*, unsigned long)+0x34c
V  [libjvm.so+0x128f40e]  ObjectSampleCheckpoint::write_stacktrace(JfrStackTrace const*, JfrCheckpointWriter&)+0xfe
V  [libjvm.so+0x1290098]  StackTraceBlobInstaller::install(ObjectSample*)+0x1e8
V  [libjvm.so+0x1290604]  ObjectSampleCheckpoint::on_rotation(ObjectSampler const*, JfrStackTraceRepository&)+0xf4
V  [libjvm.so+0xd352a4]  JfrRecorderService::pre_safepoint_write()+0x114
V  [libjvm.so+0xd357a2]  JfrRecorderService::write()+0xd2
V  [libjvm.so+0xd35a9f]  JfrRecorderService::finalize_current_chunk()+0x3f
V  [libjvm.so+0xd396d1]  JfrRecorderService::rotate(int)+0x121
V  [libjvm.so+0xd3b7ba]  recorderthread_entry(JavaThread*, Thread*)+0x15a
V  [libjvm.so+0x15a14a6]  JavaThread::thread_main_inner()+0x226
V  [libjvm.so+0x15a66d6]  Thread::call_run()+0xf6
V  [libjvm.so+0x12d5db6]  thread_native_entry(Thread*)+0x116
Comments
[~sgehwolf] this is a good candidate for backporting to OpenJDK11u. JFR on C2 or CE crashes on JDK11 on the dacapo:jython benchmark and this issue seems to fix it.
10-02-2021

[~dnsimon] Thanks for the heads-up. I'll add it to our tracking.
10-02-2021

Changeset: 953fbd2e Author: Markus Gr��nlund <mgronlun@openjdk.org> Date: 2020-01-31 12:17:55 +0000 URL: https://git.openjdk.java.net/panama-foreign/commit/953fbd2e
07-02-2020

URL: https://hg.openjdk.java.net/jdk/jdk14/rev/f8bf9cb16b5e User: mgronlun Date: 2020-01-31 11:18:46 +0000
31-01-2020

Fix Request Without this fix, usages of the OldObjectSampleEvent (at the core of memory leak detection in JFR) involving allocation stack traces must be considered unsafe and risky. In scenarios involving JVMTI class redefinition / retransformation (pervasive these days), it can lead to crashes because of attempts to access Method*s in the allocation stack trace whose underlying metaspace memory could have potentially been freed. For a detailed description, please see previous comment. The scope of the problem is JFR only, more specifcally the memory leak detection part of JFR. The OldObjectSample stacktrace setting is disabled in default.jfc (out-of-the-box) and only enabled in profile.jfc. This reduces the area of impact. The fix is considered low risk. The fix has been tested using jdk_tier1-5 + jdk_jfr test sets. Long running stress tests have been performed (30 min, 8 Hour) in situations employing class retransform / redefinitions operations. Public review email with webrev link: https://mail.openjdk.java.net/pipermail/hotspot-jfr-dev/2020-January/001078.html Reviewed by: [~egahlin]
30-01-2020

Fix request approved.
29-01-2020

The ObjectAllocationInNewTLABEvent and ObjectSampleEvent share a stack trace taken at the site of interception, with the latter being responsible for materializing an JfrStackTrace object (array of JfrStackFrames) and registering it with the JFR stack trace table. Although the stack trace is shared, for the ObjectSampleEvent there are some special requirements: One is that the stack trace needs to be fully resolved into one (or many) blobs representing a serialized set of necessary constants. This requirement is fulfilled by setting special tag bits in the Method* objects for identification, in addition to setting the regular tag bits allowing serialization of the stack trace for the purpose of the ObjectAllocationInNewTLABEvent and others. A previous enhancement, JDK-8225797 - "OldObjectSample event creates unexpected amount of checkpoint data", was filed because of problems reported in connection with large amount of checkpoint data both retained and accumulating over time. When working on JDK-8225797, the analysis revealed that these special tag bits were set in the Method*'s at the time and location of capturing the initial stack trace. Albeit functionally correct, it would eventually lead to problems, since all stack traces taken at this site had their Method*s tagged with the special tag bits. The purpose of these bits is to identify which constants will outlive the chunk, to be serialized in a blob. A most sampled objects die young, there is only a very small number out of the total number of sampled objects for which this work needs to be done. So, for JDK-8225797, one aspect of the overall solution was to delay the setting of these special tag bits as long as possible; in effect waiting for the small survival set of object sample candidates to stabilize. As part of the chunk rotation mechanism, which is a boundary for the lifetime of constants, this set would then be inspected. The stack trace for an individual ObjectSample would then be serialized into a blob with the special bits only set for constants related to these surviors. Existing mechanism: A JFR stack trace table lookup for the ObjectSample is performed to retrieve its original stack trace. Once found, the stack frames and associated information (methodid, lineno, bci and frame type) are serialized into a blob. Part of this serialization, for each frame in the stack trace, the associated Method* is tagged with the special bits. This is a relatively unsafe and sensitive operation and special care is provided to have the Method* linger (raw as it were) during the entire chunk lifetime. The method holder (the InstanceKlass* carrying these Method*s) can unload at any time. The solution to this is based on the fact that JFR already intercepts Klasses that are unloading. By representing objects indirectly using a numeric identifier (traceid), instead of the raw pointers, we can determine if the method holder InstanceKlass* has unloaded before attempting to access the Method*. This is accomplished by having the methodid encode the identifier for the method holder klass. This leads us to JDK-8236743: void ObjectSampleCheckpoint::add_to_leakp_set(const Method* method, traceid method_id) { if (is_processed(method_id) || is_klass_unloaded(method_id)) { return; } JfrTraceId::set_leakp(method); } inline void JfrTraceId::set_leakp(const Method* method) { assert(method != NULL, "invariant"); const Klass* const klass = method->method_holder(); assert(klass != NULL, "invariant"); <<-------------------------- assertion does not hold assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant"); assert(METHOD_FLAG_USED_THIS_EPOCH(method), "invariant"); SET_LEAKP(klass); SET_METHOD_LEAKP(method); } is_klass_unloaded(methodid) performs the check to see if the method holder klass has unloaded. In this situation, it has not, so we proceed with accessing the Method*. The method holder Klass* is retrieved from the method using method->method_holder() but now there is no Klass* associated with this method. How can this be? It has to do with JVMTI redefinition / retransformation. If a Klass* is redefined / retransformed, the method holder InstanceKlass* will get an entire new set of Method*s. The previous, now old and replaced, set of Method*s (one or several of which could exist in the JFR stack trace table) is moved onto the temporary loaded scratch InstanceKlass*. In VM_RedefineClasses::doit(), a construct, MetadataOnStackMark, attempt to discover existing usages of Method*s (mechanism: Method::set_on_stack()), to be used during the redefinition / method replacement process to prevent premature unloading of existing Method*s. In InstanceKlass::add_previous_version(): ... // If the constant pool for this previous version of the class // is not marked as being on the stack, then none of the methods // in this previous version of the class are on the stack so // we don't need to add this as a previous version. ConstantPool* cp_ref = scratch_class->constants(); if (!cp_ref->on_stack()) { log_trace(redefine, class, iklass, add)("scratch class not added; no methods are running"); // For debugging purposes. scratch_class->set_is_scratch_class(); scratch_class->class_loader_data()->add_to_deallocate_list(scratch_class); return; } Since our Method* is not live on the stack, but instead located in a JfrStackFrame in the JFR stack trace table, cp_ref->on_stack() is false for the scratch class constant pool. This means the temporary scratch InstanceKlass*, and the previous set of Method*, is not kept alive (added as a previous version to the original method holder). Instead, the scratch InstanceKlass* is submitted for deallocation and destruction. This implies the access to the Method*, even though the original method holder klass has not unloaded, is incorrect an unsafe. Here we are attempting to read a Method*, ConstantPool* and InstanceKlass* whose corresponding Metadata could have been (and in this case has) released (and reused). This is the problem that needs to be fixed. A possible solution would be to extend MetadataOnStackMark to have it also inspect the JFR stack trace table to mark the Method*s as "being on stack". Setting semantic correctness concerns aside, this would add to the overall time spent in the JVMTI VM_RedefineClasses safepoint operation. In addition, such a solution would need to mark all Method*s found in the stack trace table as "alive", but we only need the small set of methods related to surviving ObjectSamples to be kept alive. A more effective solution can be accomplished by retaining most of the existing logic, but, instead of having a direct Method* in the JfrStackFrame, we change it into the InstanceKlass* of the method holder. An existing method holder InstanceKlass* will not be deallocated during JVMTI redefinition / retransformation. It will always hold the latest set of Method*s. Of course, the InstanceKlass* could unload at any time, but here the existing unload detection mechanism already in place (and described above) will prevent illegal access attempts. Giving up direct access to the Method*, a means to derive the correct Method* is needed for the purpose of setting the special tag. This can be accomplished by a reverse method lookup, because the original method id number identifying the method is encoded in the methodid identifier. The original method id number can be masked off and a reverse lookup can be performed using InstanceKlass::method_with_orig_idnum(int orig_method_id_num).
29-01-2020