Bug ID: JDK-8249245 assert(((((JfrTraceIdBits::load(klass)) & ((JfrTraceIdEpoch::this_epoch_method_and_class

JDK-8249245 : assert(((((JfrTraceIdBits::load(klass)) & ((JfrTraceIdEpoch::this_epoch_method_and_class_bits()))) != 0))) failed: invariant

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 15,16,17

Priority: P3
Status: Resolved
Resolution: Fixed
OS: linux,windows
CPU: x86_64

Submitted: 2020-07-13
Updated: 2025-01-13
Resolved: 2021-01-18

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 17
17 b06Fixed

Related Reports

Relates :	JDK-8277919 - OldObjectSample event causing bloat in the class constant pool in JFR recording
Relates :	JDK-8260723 - assert(((((JfrTraceIdBits::load(klass)) & ((JfrTraceIdEpoch::this_epoch_method_and_class_bits()))) != 0))) failed: invariant

Description

The following test failed in the JDK16 CI:

applications/dacapo/Dacapo24H.java

Here's a snippet from the log file:

Stress process main method is started.
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/4728e7c1-7e67-490e-be0f-6bbf2a2f33db-S176/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/aee6a73b-f084-4278-9f0c-ba9c4193ecaa/runs/38e347bf-7b52-44c9-9aa9-19ecd82f6dfb/workspace/open/src/hotspot/share/jfr/recorder/checkpoint/types/traceid/jfrTraceIdLoadBarrier.inline.hpp:112), pid=22205, tid=22295
#  assert(((((JfrTraceIdBits::load(klass)) & ((JfrTraceIdEpoch::this_epoch_method_and_class_bits()))) != 0))) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (16.0+6) (fastdebug build 16-ea+6-181)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 16-ea+6-181, mixed mode, sharing, tiered, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x139bd74]  JfrTraceIdLoadBarrier::load_leakp(Klass const*, Method const*)+0x124
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/4728e7c1-7e67-490e-be0f-6bbf2a2f33db-S196/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/84072d92-4fac-4276-bdcd-29ca6087b64f/runs/918083e9-1538-4cdb-8fb3-a952ae65f609/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_dacapo_Dacapo24H_java/scratch/0/core.22205)
#
# JFR recording file will be written. Location: /opt/mach5/mesos/work_dir/slaves/4728e7c1-7e67-490e-be0f-6bbf2a2f33db-S196/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/84072d92-4fac-4276-bdcd-29ca6087b64f/runs/918083e9-1538-4cdb-8fb3-a952ae65f609/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_dacapo_Dacapo24H_java/scratch/0/hs_err_pid22205.jfr
#
Unsupported internal testing APIs have been used.

# An error report file with more information is saved as:
# /opt/mach5/mesos/work_dir/slaves/4728e7c1-7e67-490e-be0f-6bbf2a2f33db-S196/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/84072d92-4fac-4276-bdcd-29ca6087b64f/runs/918083e9-1538-4cdb-8fb3-a952ae65f609/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_dacapo_Dacapo24H_java/scratch/0/hs_err_pid22205.log
[thread 22296 also had an error][thread 22290 also had an error]

#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
----------System.err:(150/15347)----------


Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x00007f5800e3efc0):  JavaThread "JFR Recorder Thread" daemon [_thread_in_vm, id=22295, stack(0x00007f58425e5000,0x00007f58426e6000)]

Stack: [0x00007f58425e5000,0x00007f58426e6000],  sp=0x00007f58426e4a20,  free space=1022k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x139bd74]  JfrTraceIdLoadBarrier::load_leakp(Klass const*, Method const*)+0x124
V  [libjvm.so+0x139a3de]  ObjectSampleCheckpoint::write_stacktrace(JfrStackTrace const*, JfrCheckpointWriter&)+0xfe
V  [libjvm.so+0x139ae2a]  StackTraceBlobInstaller::install(ObjectSample*)+0x1ea
V  [libjvm.so+0x139b557]  ObjectSampleCheckpoint::on_rotation(ObjectSampler const*, JfrStackTraceRepository&)+0x337
V  [libjvm.so+0xda2dc4]  JfrRecorderService::pre_safepoint_write()+0xc4
V  [libjvm.so+0xda3315]  JfrRecorderService::chunk_rotation()+0x75
V  [libjvm.so+0xda4da5]  JfrRecorderService::rotate(int)+0x45
V  [libjvm.so+0xda8e4a]  recorderthread_entry(JavaThread*, Thread*)+0x34a
V  [libjvm.so+0x16d3ebc]  JavaThread::thread_main_inner()+0x21c
V  [libjvm.so+0x16d9c40]  Thread::call_run()+0x100
V  [libjvm.so+0x13dda56]  thread_native_entry(Thread*)+0x116

Comments

Changeset: 061ffc47 Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2021-01-18 16:14:07 +0000 URL: https://git.openjdk.java.net/jdk/commit/061ffc47

18-01-2021

Tag algorithm: At a tag site, if the artifact is determined to not be already tagged in the current epoch, the corresponding tag bit (which is a function of the current epoch) is set, for example: 00000001 The flushpoint thread then proceeds to serialize tagged entries: 1. Sets the "serialized" meta bit: 1000 00000001 ^ 2. Clears the "cleared epoch" meta bit: 1000 00000001 ^ // is cleared if set; testers become more performant (no slow branches); note: the "cleared epoch" meta bit is central to the description to follow. After a while there is a rotation, and this involves an epoch shift. After rotation / epoch shift: 1. Serialization of not yet serialized artifacts tagged in the previous epoch. The epilogue will then clear the previous epoch bit for all tagged artifacts: 2. Clear "serialized" meta bit: 0000 00000001 ^ 3. Sets the "cleared epoch" meta bit: 0001 00000001 ^ 4. The tag bit for the previous epoch is cleared using CAS (not to loose information in the current epoch): 0001 00000000 ^ Why this extra "cleared epoch" meta bit? Because we allow threads to set tag bits in artifacts without synchronization. Therefore, in the next moment, since the CAS could have been interleaved with a thread setting bits in the new epoch, the artifact might look like this: 0001 00000011 User thread "tag" vs "rotation / epoch shift"-interleaving: 1. User thread loads existing value: 00000001 2. Rotation / epoch shift thread issues the clear CAS operation: 00000000 3. User thread OR:s "this epoch bit", now 0x2, [00000001 | 10] and stores it back: 00000011 // this has effectively undone the clear operation This situation is exactly why we have the "cleared epoch" meta bits. At tag sites, the predicate should_tag() (is_not_tagged()) is evaluated to determine if a tag operation should be issued: template <typename T> inline bool should_tag(const T* t) { assert(t != NULL, "invariant"); return is_not_tagged(TRACE_ID_RAW(t)); } inline bool is_not_tagged(traceid value) { const traceid this_epoch_bit = JfrTraceIdEpoch::this_epoch_bit(); return ((value & ((this_epoch_bit << META_SHIFT) | this_epoch_bit)) != this_epoch_bit); } Because of the race, considered benign, this is now the value: 0011 00000001 Is this value to be considered tagged in this epoch? If we only check the lsb bit, we would say yes, since the bit corresponding to this epoch is set; that is was set due to a race in the previous epoch eludes us. Fortunately, we have more information available in the metabits: inline bool is_not_tagged(traceid value) { const traceid this_epoch_bit = JfrTraceIdEpoch::this_epoch_bit(); return ((value & ((this_epoch_bit << META_SHIFT) | this_epoch_bit)) != this_epoch_bit); } value: 0011 00000001 expression is : (1 << 8 | 1) == 0001 00000001 is (0001 00000001 & 0001 00000001) != 0000 00000001 ? Yes - because the "cleared epoch" meta bit is set, hence we _should_ treat it as effectively untagged and proceed with the tag operation, even though the lsb bit for this epoch is set. All artifacts use should_tag() to support this, expect for Method because method tagging was historically a bit special. The fundamental problem described in this bug is because only the lsb bit check is used to determine the tag state for methods, i.e. the metabit check exposed in should_tag() is not used. This is strange because I have already added the should_tag(const Method* method) specialization. And at one time I think it was also updated to be used with methods, but maybe i forgot to check it in. // method inline traceid JfrTraceIdLoadBarrier::load(const Klass* klass, const Method* method) { assert(klass != NULL, "invariant"); assert(method != NULL, "invariant"); if (METHOD_FLAG_NOT_USED_THIS_EPOCH(method)) { <<-- is not using should_tag(), but instead a method specific macro that is too weak SET_METHOD_AND_CLASS_USED_THIS_EPOCH(klass); SET_METHOD_FLAG_USED_THIS_EPOCH(method); assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant"); assert(METHOD_FLAG_USED_THIS_EPOCH(method), "invariant"); enqueue(klass); JfrTraceIdEpoch::set_changed_tag_state(); } return (METHOD_ID(klass, method)); } compare this with: // klass inline traceid JfrTraceIdLoadBarrier::load(const Klass* klass) { assert(klass != NULL, "invariant"); if (should_tag(klass)) { <<----------------- is using should_tag() SET_USED_THIS_EPOCH(klass); enqueue(klass); JfrTraceIdEpoch::set_changed_tag_state(); } assert(USED_THIS_EPOCH(klass), "invariant"); return TRACE_ID(klass); } Macro expansion: if (METHOD_FLAG_NOT_USED_THIS_EPOCH(method)) { <<--- #define METHOD_FLAG_NOT_USED_THIS_EPOCH(method) (!(METHOD_FLAG_USED_THIS_EPOCH(method))) #define METHOD_FLAG_USED_THIS_EPOCH(method) (METHOD_FLAG_PREDICATE(method, (THIS_EPOCH_METHOD_FLAG_BIT))) #define THIS_EPOCH_METHOD_FLAG_BIT ((jbyte)THIS_EPOCH_BIT) #define THIS_EPOCH_BIT (JfrTraceIdEpoch::this_epoch_bit()) #define METHOD_FLAG_PREDICATE(method, bits) ((method)->is_trace_flag_set(bits)) <<--- class JfrTraceFlag { private: mutable jshort _flags; public: JfrTraceFlag() : _flags(0) {} bool is_set(jshort flag) const { return (_flags & flag) != 0; <<-- the compare is only against THIS_EPOCH_BIT, not a composite bit set including the "cleared epoch" metabit, and, even had it been a composite, it can't check against != 0 } inline traceid JfrTraceIdLoadBarrier::load(const Klass* klass, const Method* method) { assert(klass != NULL, "invariant"); assert(method != NULL, "invariant"); if (METHOD_FLAG_NOT_USED_THIS_EPOCH(method)) { <<--- The result is that, for methods where interleaving occurred, we do _not_ take the path that would also tag the klass. Therefore, we later get the assertion failure for implication: is_tagged(method) -> is_tagged(klass) SET_METHOD_AND_CLASS_USED_THIS_EPOCH(klass); <<-- klass tagging here SET_METHOD_FLAG_USED_THIS_EPOCH(method); assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant"); assert(METHOD_FLAG_USED_THIS_EPOCH(method), "invariant"); enqueue(klass); JfrTraceIdEpoch::set_changed_tag_state(); } return (METHOD_ID(klass, method)); } Fix is to update (and this time also check-in) to use should_tag() also for Method*, like so: inline traceid JfrTraceIdLoadBarrier::load(const Klass* klass, const Method* method) { assert(klass != NULL, "invariant"); assert(method != NULL, "invariant"); if (should_tag(method)) { <<--- also evaluate metabits SET_METHOD_AND_CLASS_USED_THIS_EPOCH(klass); SET_METHOD_FLAG_USED_THIS_EPOCH(method); assert(METHOD_AND_CLASS_USED_THIS_EPOCH(klass), "invariant"); assert(METHOD_FLAG_USED_THIS_EPOCH(method), "invariant"); enqueue(klass); JfrTraceIdEpoch::set_changed_tag_state(); } return (METHOD_ID(klass, method)); }

17-01-2021

ILW = HLM = P3

14-07-2020