Bug ID: JDK-8234096 Mutex rank ordering problem JVMTI LoadedClassesClosure and G1 SATB queue

JDK-8234096 : Mutex rank ordering problem JVMTI LoadedClassesClosure and G1 SATB queue

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 8u232

Priority: P3
Status: Open
Resolution: Unresolved
OS: generic
CPU: generic

Submitted: 2019-11-13
Updated: 2021-09-21

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 8
8-poolUnresolved

Related Reports

Relates :

JDK-8187577 - JVM crash during gc doing concurrent marking

Description

A user reported a mutex rank ordering problem between JVMTI LoadedClassesClosure and G1's SATB queues. As far as I can tell, this is caused by JDK-8187577:

ClassLoaderData::loaded_classes_do() acquires the "Metaspace allocation lock/5" and LoadedClassesClosure tries to enqueue it into G1's SATB queue, thus attempts to acquire "SATB_Q_CBL_mon/16"

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/buildbot/worker/jdk8u-redhat-linux/build/hotspot/src/share/vm/runtime/mutex.cpp:1321), pid=52148, tid=0x00007f270fd93700
#  fatal error: acquiring lock SATB_Q_CBL_mon/16 out of order with lock Metaspace allocation lock/5 -- possible deadlock
#
# JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0-builds.shipilev.net-openjdk-jdk8-redhat-fastdebug-b113-20191111-aarch64-shenandoah-jdk8u232-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.71-b113-20191111-aarch64-shenandoah-jdk8u232-09-fastdebug mixed mode linux-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00007f270805d800):  JavaThread "main" [_thread_in_vm, id=52149, stack(0x00007f270fc93000,0x00007f270fd94000)]

Stack: [0x00007f270fc93000,0x00007f270fd94000],  sp=0x00007f270fd8b770,  free space=993k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x1235370]  VMError::report_and_die()+0x160
V  [libjvm.so+0x79c14f]  report_fatal(char const*, int, char const*)+0x6f
V  [libjvm.so+0xe03c57]  Monitor::set_owner_implementation(Thread*)+0x227
V  [libjvm.so+0xf63f95]  PtrQueueSet::enqueue_complete_buffer(void**, unsigned long)+0x25
V  [libjvm.so+0xf64268]  PtrQueueSet::process_or_enqueue_complete_buffer(void**)+0x78
V  [libjvm.so+0xf646d5]  PtrQueue::enqueue_known_active(void*)+0x265
V  [libjvm.so+0x9036b6]  G1SATBCardTableModRefBS::enqueue(oop)+0x166
V  [libjvm.so+0xc353d2]  LoadedClassesClosure::do_klass(Klass*)+0x102
V  [libjvm.so+0x693955]  ClassLoaderData::loaded_classes_do(KlassClosure*)+0x55
V  [libjvm.so+0x69528b]  ClassLoaderDataGraph::loaded_classes_do(KlassClosure*)+0x2b
V  [libjvm.so+0xc337b9]  JvmtiGetLoadedClasses::getLoadedClasses(JvmtiEnv*, int*, _jclass***)+0xb9
V  [libjvm.so+0xb9e4d7]  jvmti_GetLoadedClasses+0x127
C  [libjdwp.so+0xc7a7]  classTrack_processUnloads+0x77
C  [libjdwp.so+0x14b6b]  event_callback+0x2eb
C  [libjdwp.so+0x158ff]  cbClassPrepare+0x7f
V  [libjvm.so+0xc2524a]  JvmtiExport::post_class_prepare(JavaThread*, Klass*)+0xea
V  [libjvm.so+0x9f336c]  InstanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*) [clone .part.235]+0x6ac
V  [libjvm.so+0x9f33ec]  InstanceKlass::link_class_impl(instanceKlassHandle, bool, Thread*)+0x4c
V  [libjvm.so+0x9f4727]  InstanceKlass::initialize(Thread*)+0x157
V  [libjvm.so+0xaac066]  InterpreterRuntime::_new(JavaThread*, ConstantPool*, int)+0x236

Comments

I ported the testcase to jdk14 and tested it, and we're good. In jdk14, there is no locking involved in G1 SATB enqueue(). I also tested the same in jdk11, and it's good there too, but for other reasons :-) You said, loaded_classes_do() holds the ClassLoaderDataGraph_lock, but I don't see that. Instead it seems to use a lock-free CAS-based protocol. Also, the Shared_SATB_Q_lock has a much lower rank (access+1) in jdk11, which would be below everything else afaict. The testcase is here (applies on 11 too): http://cr.openjdk.java.net/~rkennke/JDK-8234096-test/webrev-jdk14.00/
15-11-2019
This is 8u-specific. The code is very different in 14. I believe you replaced the Metaspace locking in loaded_classes_do() with lock-free iteration around jdk10 timeframe, and the whole explicit enqueuing has been replaced by the Access API stuff. I have a fix and G1-testcase (see above), and I will propose it shortly to jdk8u-dev. I would like your opinion on the fix anyway (maybe in review).
15-11-2019
loaded_classes_do() has the ClassLoaderDataGraph_lock now, but the rank of that is quite high. Then we also lowered the rank of the G1SATB lock, which I think you can still take in the iteration under the right G1 conditions in the barrier.
15-11-2019
Is this a bug in JDK 14 or only 8u?
14-11-2019
This testcase reproduces with G1: http://cr.openjdk.java.net/~rkennke/JDK-8234096-test/webrev.02/
14-11-2019
This is a slightly better testcase: http://cr.openjdk.java.net/~rkennke/JDK-8234096-test/webrev.01/ It still has the problem that, with a fix, it keeps pushing stuff on the SATB queues (because it doesn't filter them). This can make the test time-out, or even prevent reaching a safepoint. However, filtering the SATB queues means that the klass-mirrors will always all be marked already, and never even attempted to enqueue. A fix candidate is this: http://cr.openjdk.java.net/~rkennke/JDK-8234096/webrev.00/ It avoids enqueuing oops during loaded_classes_do() (and thus under the wrong lock) and instead enqueues them during extract().
14-11-2019
I made a reproducer for Shenandoah (applies and fails on top of https://hg.openjdk.java.net/shenandoah/jdk8/hotspot): http://cr.openjdk.java.net/~rkennke/JDK-8234096-test/webrev.00/ The bug is not Shenandoah-specific though, and it should be possible to come up with a testcase for G1 too. What I do in the test: - Call JVMTI GetLoadedClasses() in a loop - Use aggressive heuristics to continuously dive into GC cycles - Use tiny SATB buffers in order to take the locking path as often as possible - Bypass SATB filtering, otherwise no single oop would be enqueued This quickly fails similar to the original bug report.
14-11-2019
Coleen, can you check this?
13-11-2019
I believe this was kinda-fixed in later versions by JDK-8180325, by making loaded_classes_do() lock-free.
13-11-2019