Bug ID: JDK-8288482 JFR: Cannot resolve method

Type: Bug
Component: hotspot
Sub-Component: jfr
Affected Version: 15,16,17,18,19,20

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2022-06-15
Updated: 2022-11-23
Resolved: 2022-07-19

JDK 19	JDK 20
19 b32Fixed	20Fixed

[0.109s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.109s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.109s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.109s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.109s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.109s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.110s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object
[0.110s][info ][jfr,system,parser] Missing object id=99811329 in pool jdk.types.Method. All ids should reference an object

Reproducer:

$ java Repro.java
May require hundreds of retries before NPE happens
$ jfr -J-Xlog:jfr+system+parser=info print file.jfr

Happens with first event in second chunk. Could possibly be a regression from the removal of event handlers JDK-8282420. The retry-mechanism may not have survived properly.

Changeset: a6bab2a3 Author: Markus Grönlund <mgronlun@openjdk.org> Date: 2022-07-19 09:34:49 +0000 URL: https://git.openjdk.org/jdk19/commit/a6bab2a37905a191e95464b47a2dd4c89ce7be61
19-07-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk19/pull/144 Date: 2022-07-17 13:14:10 +0000
17-07-2022
// Tag bits: "meta bits" == high 8 bits, "bits" == low 8 bits // "Ideal" tagging tag bits (meta bits \| bits): 00000000 00000000 Current epoch: 0 A thread tags in the current epoch 0 (no meta bits \| epoch 0 tag): 00000000 00000001 Now the epoch is shifted, current epoch: 1 There is a collection and serialization of artefacts tagged in the previous epoch, here, epoch 0. It also involves clearing the epoch 0 tag bit (cas) and setting the epoch 0 cleared meta bit: Tag bits (epoch 0 cleared meta bit \| no bits): 00000001 00000000 A thread now tags in epoch 1 (epoch 0 cleared meta bit \| epoch 1 tag): 00000001 00000010 Now the epoch is again shifted, current epoch: 0 There is a collection and serialization of artefacts tagged in the previous epoch, here, epoch 1. It also involves clearing the epoch 1 tag bit (cas) and setting the epoch 1 cleared meta bit: Tag bits (epoch 1 cleared meta bit, epoch 0 cleared meta bit \| no tags): 00000011 00000000 // Tag interleaving T1: load : 00000001 // current T1: or : 00000001 \| 10 // setting the tag for current epoch 1 T2: CAS : 00000000 // T2 clears the previous epoch 0 T1: store : 00000011 // clearing of the previous epoch 0 is lost. This situation is handled by setting the epoch 0 cleared meta bit. Tag bits: (epoch 0 cleared meta bit \| epoch 1 tag, epoch 0 tag): 00000001 00000011 // This situation is handled by setting the epoch 0 cleared meta bit. Even though the epoch 0 bit is "set" via the interleave, since the epoch 0 cleared meta bit is also set, the artefact is effectively NOT tagged in epoch 0. // JDK_8288482 A bug incorrectly clears the wrong epoch cleared meta bit as part of serialization. template <> void set_serialized<Method>(MethodPtr method) { assert(method != NULL, "invariant"); SET_METHOD_SERIALIZED(method); assert(IS_METHOD_SERIALIZED(method), "invariant"); CLEAR_THIS_EPOCH_METHOD_CLEARED_BIT(method); } The current epoch meta bit is cleared as part of the artefact serialization, which is only correct for flushpoints and class unloading. It is wrong for serializations related to chunk rotation because these have already issued an epoch shift and collect for the previous epoch. Details: The above set_serialized() function invalidates the invariant "This situation is handled by setting the epoch cleared meta bit" as part of serializing artefacts related to rotations. State: previous epoch 1, current epoch 0. Collection and serialization of artifacts tagged in epoch 1: 1) SET_METHOD_SERIALIZED(method); 00010001 00000011 // serialized meta bit is set 2) CLEAR_THIS_EPOCH_METHOD_CLEARED_BIT(method); // INVALID CLEARING of the meta bit for the current epoch when we are serializing artefacts for the previous epoch. 00010000 00000011 // clearing the wrong meta bit At this point, if a thread attempts a tag for epoch 0, it will see: 00010000 00000011 That is, the tag bit for epoch 0 is set, and the cleared meta bit for epoch 0 is NOT set (due to the bug that cleared it). To the thread, this implies the artefact is already serialized; hence, the thread will skip enqueuing the klass as part of the load barrier. Summary: Serializations related to rotations must not issue the CLEAR_THIS_EPOCH_METHOD_CLEARED_BIT because, at this point, the epoch has already shifted. As such, we are collecting the previous epoch and must not clear the current one. CLEAR_THIS_EPOCH_METHOD_CLEARED_BIT is only valid for flushpoints and class unloading, that collect the current epoch.
17-07-2022
The issue can be reproduced with only ScrubEvents, it also happens on JDK 17.
07-07-2022
How do we best debug this issue? It usually reproduces within 3 minutes on Mac (ARM). There are several other tests that emits lots of events without issues. What is unique with this test is that it is running a 'profile' recording at the same time. The test emits 1 000 000 events. I have tried to lower it, but then I can't reproduce the issue. It's likely related to the chunk rotation. Always happens in the second chunk.
21-06-2022
I have been unable to reproduce this issue on my Windows machines. Inspection and instrumentation of the retry logic have validated it to be correct.
20-06-2022

Duplicate :	JDK-8260723 - assert(((((JfrTraceIdBits::load(klass)) & ((JfrTraceIdEpoch::this_epoch_method_and_class_bits()))) != 0))) failed: invariant
Relates :	JDK-8287699 - jdk/jfr/api/consumer/TestRecordingFileWrite.java fails with exception: java.lang.Exception: Found event that should not be there.
Relates :	JDK-8233705 - Let artifact iteration running time be a function of incrementally tagged artifacts