JDK-8308745 : ObjArrayKlass::allocate_objArray_klass may call into java while holding a lock
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-05-24
  • Updated: 2024-03-14
  • Resolved: 2024-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 23
23 b14Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Found while trying to stress metaspace. 
Seems like the issue is that the MultiArray_lock is taken to make the array klass creation atomic. However if ObjArrayKlass::allocate_objArray_klass fails to allocate due to metaspace exhaustion and JVMTI JVMTI_EVENT_RESOURCE_EXHAUSTED is enabled Metaspace::report_metadata_oome will try to call into java, with MultiArray_lock held.
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/scratch/aboldtch/jdk/open/src/hotspot/share/runtime/interfaceSupport.inline.hpp:187), pid=732125, tid=1579048
#  assert(!thread->owns_locks()) failed: must release all locks when leaving VM
#
# JRE version: Java(TM) SE Runtime Environment (21.0) (fastdebug build 21-internal-LTS-2023-05-15-1851326.aboldtch...)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 21-internal-LTS-2023-05-15-1851326.aboldtch..., mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x11c39fc]  JvmtiJavaThreadEventTransition::JvmtiJavaThreadEventTransition(JavaThread*)+0x21c

Relevant stack trace:
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x11c39fc]  JvmtiJavaThreadEventTransition::JvmtiJavaThreadEventTransition(JavaThread*)+0x21c  (interfaceSupport.inline.hpp:187)
V  [libjvm.so+0x11bdd7b]  JvmtiExport::post_resource_exhausted(int, char const*)+0x14b  (jvmtiExport.cpp:1781)
V  [libjvm.so+0x13c49fd]  Metaspace::report_metadata_oome(ClassLoaderData*, unsigned long, MetaspaceObj::Type, Metaspace::MetadataType, JavaThread*)+0x41d  (metaspace.cpp:972)
V  [libjvm.so+0x13c4c35]  Metaspace::allocate(ClassLoaderData*, unsigned long, MetaspaceObj::Type, JavaThread*)+0xc5  (metaspace.cpp:922)
V  [libjvm.so+0x12053c2]  Klass::initialize_supers(Klass*, Array<InstanceKlass*>*, JavaThread*)+0x3b2  (array.inline.hpp:36)
V  [libjvm.so+0x639761]  ArrayKlass::complete_create_array_klass(ArrayKlass*, Klass*, ModuleEntry*, JavaThread*)+0x21  (arrayKlass.cpp:105)
V  [libjvm.so+0x146e046]  ObjArrayKlass::allocate_objArray_klass(ClassLoaderData*, int, Klass*, JavaThread*)+0x3f6  (objArrayKlass.cpp:127)
V  [libjvm.so+0xe5a423]  InstanceKlass::array_klass(int, JavaThread*)+0xe3  (instanceKlass.cpp:1481)
V  [libjvm.so+0xe5dbfb]  InstanceKlass::allocate_objArray(int, int, JavaThread*)+0x23b  (instanceKlass.cpp:1406)
V  [libjvm.so+0x14ac1e7]  oopFactory::new_objArray(Klass*, int, JavaThread*)+0xe7  (oopFactory.cpp:122)
V  [libjvm.so+0xea06d9]  InterpreterRuntime::anewarray(JavaThread*, ConstantPool*, int, int)+0xf9  (interpreterRuntime.cpp:256)
Comments
Changeset: 1877a487 Author: Coleen Phillimore <coleenp@openjdk.org> Date: 2024-03-08 13:27:54 +0000 URL: https://git.openjdk.org/jdk/commit/1877a4879598356b777742fe80bdd5fa77ca8e8d
08-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/17739 Date: 2024-02-06 22:59:04 +0000
05-03-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/17660 Date: 2024-01-31 18:57:23 +0000
31-01-2024

There's a metaspace allocation function that doesn't call GC or throw OOM, returning null. Calling this function would be good for ObjArrayKlass, except just afterwards we create the mirror and link the newly created ObjArrayKlass into the mirror and back to the component type. The mirror creation, and subsequent subtype link (allocates Array<Klass*> to link into the Klass), have to be done without a lock, so we'd have to give up the lock, ie here: // Initialize instance variables ObjArrayKlass* oak = ObjArrayKlass::allocate(loader_data, n, element_klass, name); if (oak == nullptr) { MutexUnlocker mul(MultiArray_lock); // throw oom metaspace return nullptr } ModuleEntry* module = oak->module(); assert(module != nullptr, "No module entry for array"); { MutexUnlocker mul(MultiArray_lock); // Call complete_create_array_klass after all instance variables has been initialized. ArrayKlass::complete_create_array_klass(oak, super_klass, module, CHECK_NULL); } Then we're left with the same problem as the lock-free solution: multiple threads can get here and create the ObjArrayKlass and some have to deallocate if they lose the race.
30-01-2024

Oh now I see I already had a comment about the claim bit idea. The latest code has the claim bit as a recursive lock so that if the event goes to Java and back to allocating an ObjArray, it won't hang.
30-01-2024

Yes, the MultiArray_lock is a flaw in the design because we want to create the ObjArray for the dimensions, and while creating these arrays, we also create the mirrors and other things that rely on the hierarchy of arrays to be completed. There is also a pointer from the component_mirror() to the newly created objArrayKlass, ie: java.lang.Class {0x000000011f012348} - klass: 'java/lang/Class' - ---- fields (total size 14 words): - private volatile transient 'classRedefinedCount' 'I' @12 0 (0x00000000) - injected 'klass' 'J' @16 140364816058896 (0x00007fa93b000a10) - injected 'array_klass' 'J' @24 0 (0x0000000000000000) ... - private final 'componentType' 'Ljava/lang/Class;' @64 a 'java/lang/Class'{0x000000011f011f88} = 'MyClass' (0x23e023f1) ... - signature: [LMyClass; ComponentType is MyClass component mirror java.lang.Class {0x000000011f011f88} - klass: 'java/lang/Class' - ---- fields (total size 14 words): - private volatile transient 'classRedefinedCount' 'I' @12 0 (0x00000000) - injected 'klass' 'J' @16 140364816058368 (0x00007fa93b000800) - injected 'array_klass' 'J' @24 140364816058896 (0x00007fa93b000a10) <--- points to the array of MyClass. ... - signature: LMyClass; All this to say one of the uses of MultiArray_lock to only have one thread to link these classes together. I have a prototype of a lock free solution that CAS for InstanceKlass::_array_klasses and ArrayKlass::_higher_dimension. If the CAS fails, we deallocate the newly creates ObjArrayKlass, and only link it into the componentType class only if the CAS succeeds. This works pretty well, except for one test on macosx/x64. The test is a stress test for a bug that the JIT encountered where the component mirror array_klass link was not updated after the mirror. The test runtime/CreateMirror/ArraysNewInstanceBug.java takes 100x longer to run on macos/x64 and times out. It's a stress test, and it could be excluded for the platform as it doesn't show this behavior on other machines. But the number of metaspace deallocations this test shows is worrying. Another approach that I implemented was to create a recursive claim field rather than taking the MultiArray_lock. It's effectively the same thing as an exclusive lock, but it's recursive and can be used if the thread allocating the metadata has to call into Java, and back into the VM to create the same objArrayKlass or another one. It's a bit simpler (not as simple as the lock). The MultiArray_lock becomes a Monitor and the thread that can't claim allocation, will wait on this monitor until it can. This passes tier1-7. The last approach that I didn't get far prototyping, is to allocate the metaspace without the code that throws OOM and calls into JNI to report the OOM metaspace event. I'm not sure why I gave up on this other than I had to create new APIs and check for and pass nullptrs around, and throw the OOM outside the lock. Hm.
30-01-2024

>> That array classes of super classes are super classes of array classes of subclasses is due to some compiler optimization, something about covariance that a C2 person has to explain. [~jrose] or [~vlivanov] ? > The type relationships for array types in Java is just part of the language definition. Right, it's part of JLS: https://docs.oracle.com/javase/specs/jls/se8/html/jls-4.html#jls-4.10.3 4.10.3. Subtyping among Array Types The following rules define the direct supertype relation among array types: If S and T are both reference types, then S[] >1 T[] iff S >1 T.
07-12-2023

The more I look at this code the more it seems that using the MultiArray_lock the way we do, around code that can potentially throw Java exceptions (plus whatever associated JVMTI events that can be posted) is just fundamentally flawed. We hold a mutex across code that can escape to Java - and we must not do that. Further, we can't require any kind of lock to create multi-dim arrays, and allow a jump to Java in the midst of that as the Java code could also need to create a multi-dim array! But this seems a long-standing flaw in the design.
05-12-2023

Thanks for clarifying [~coleenp]. The type relationships for array types in Java is just part of the language definition. What I was trying to see with this code was if there is a relatively easy way to get these low-level metaspace OOM failures to be reported back up to this code level so that we can respond to them after releasing the lock. Unfortunately the call chain seems far too deep and generic. So the only real solution here seems to be to release the lock before doing this step. But I'm unclear exactly what role the MultiArray_lock plays here.
05-12-2023

The code in ObjArrayKlass::allocate_objArray_klass, creates arrays for the super class and secondary supers of the class requested, say A extends B implements I. If you're creating a one dimensional array of A "[LA;", this creates an "[LB;" and an "[LI;". Then ArrayKlass::complete_create_array_klass calls Klass::initialize_supers to link classes "[LB;" and "[LI;" to super and secondary supers of the newly created "[LA;". ie. it's done in two steps, we create the array classes first, then link them in once the superclass is created. That array classes of super classes are super classes of array classes of subclasses is due to some compiler optimization, something about covariance that a C2 person has to explain. [~jrose] or [~vlivanov] ?
05-12-2023

Just looking at that stack: V [libjvm.so+0x13c4c35] Metaspace::allocate(ClassLoaderData*, unsigned long, MetaspaceObj::Type, JavaThread*)+0xc5 (metaspace.cpp:922) V [libjvm.so+0x12053c2] Klass::initialize_supers(Klass*, Array<InstanceKlass*>*, JavaThread*)+0x3b2 (array.inline.hpp:36) V [libjvm.so+0x639761] ArrayKlass::complete_create_array_klass(ArrayKlass*, Klass*, ModuleEntry*, JavaThread*)+0x21 (arrayKlass.cpp:105) Shouldn't we have handled supers before we get to complete_create_array_klass? The super handling explicitly unlocks the MultiArray_lock: if (!supers_exist) { // Oops. Not allocated yet. Back out, allocate it, and retry. Klass* ek = nullptr; { MutexUnlocker mu(MultiArray_lock); super_klass = element_super->array_klass(CHECK_NULL); for( int i = element_supers->length()-1; i >= 0; i-- ) { Klass* elem_super = element_supers->at(i); elem_super->array_klass(CHECK_NULL); } // Now retry from the beginning ek = element_klass->array_klass(n, CHECK_NULL); } // re-lock return ObjArrayKlass::cast(ek); ??
05-12-2023

That's an interesting idea: Current code has: Mutex-lock -> metaspace allocate -> metaspace OOM (allocate OutOfMemoryError object) -> JVMTI event -> native transition -> possible transition to Java There are three things wrong with the current code. One is that we have a Mutex-lock around allocating the OOM object, and the second is that we're transitioning to native, and the third is that we could transition to Java (the native transition assert prevents this now) with the Mutex-lock held. For InstanceKlass-> _array_klasses and Array-> _higher dimension, I removed the lock and had the CAS winner set the field, and deallocate allocated arrays for losers. That worked great for most things but I have a timeout on macosx x86 for one test that does this with 64 threads that all seem to run at the same time. So I want to add a claim bit to serialize the InstanceKlass case instead. But this could be problematic for the subsequent Java code like: New code: claim-bit -> metaspace allocate -> metaspace OOM (allocate OOM object) -> JVMTI event -> native transition -> Java code that tries to also allocate this InstanceKlass::_array_klasses will hang. Maybe restricting it so that the JVMTI event isn't fired if this is the case might be a reasonable compromise (?)
04-12-2023

We don't/can't control what the callback does. We could potentially choose not to report metaspace OOMs, but given we have been doing so, that loss may upset consumers of these events. Event callbacks have to be written carefully but nothing is specifically prohibited for them (something I have lamented in the past as there is no way we can tolerate arbitrary diversions into Java code from deep in the VM!). But generally, we have the responsibility of ensuring that JVMTI event reporting is done in a safe manner, knowing that the callback may indeed call back into Java. Though IIUC in this particular case it is a transition to native that triggers the assert, not a transition to Java. I assume the OOM event generation logic is deep in the allocation code so knows nothing about any higher-level locking protocols? Otherwise we could process the OOM locally and only trigger the event when safe to do so.
04-12-2023

This would be easier to fix if we disallowed upcalls to Java while allocating metadata for resource exhausted events. I can remove the lock (because we can't allocate object for OOM with the lock held) but want a bit to claim the objArray allocation for this InstanceKlass. Using CAS to allocate the objArray then reclaim it if the CAS fails, is really slow on my macosx machine for the one stress test that does this in multiple threads. If the JVMTI resource exhausted callback tried to allocate some metadata when Metaspace is exhausted, it would hang anyway. Although I suppose the same would happen if Java code tries to allocate heap when the Java heap is exhausted also. This seems like such a problematic interface. Do we really need to allow upcalls for Metaspace exhaustion? <event label="Resource Exhausted" id="ResourceExhausted" const="JVMTI_EVENT_RESOURCE_EXHAUSTED" num="80" since="1.1"> <description> Sent when a VM resource needed by a running application has been exhausted. Except as required by the optional capabilities, the set of resources which report exhaustion is implementation dependent.
30-11-2023

ILW = HLL = P4
30-05-2023

Oh, yes, Robbin pointed that out at one point to me a long time ago for Metaspace OOM (also might be a problem with Java Heap OOM). Thanks for reporting these - they fell off my radar. How did you trigger these? Can you post the command line?
30-05-2023

Maybe I confused things. I think the title is wrong. We are not calling Java but some jvmti native callback. The issue is that we are transitioning into native from VM with a lock. There seems however to be common for jvmti ResourceExhausted callback to call into java again. Note it is not the OOME that is the issue, but jvmti.
29-05-2023

You would have to guarantee to throw a pre-allocated stackless OOME instance, but they are the fallback instances - IIRC initially we try to create a stack and that requires executing Java code which is not permitted while holding the Mutex.
29-05-2023

For some reason, I think we thought you could throw OOM while holding a Mutex. But I guess not. There might be more of these around.
26-05-2023