JDK-8274753 : ZGC: SEGV in MetaspaceShared::link_shared_classes
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 18
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: os_x
  • CPU: aarch64
  • Submitted: 2021-10-05
  • Updated: 2022-01-27
  • Resolved: 2021-10-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 18
17.0.3-oracleFixed 18 b19Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
Test: runtime/cds/appcds/javaldr/GCDuringDump.java

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000109c92fd8, pid=49600, tid=6147
#
# JRE version: Java(TM) SE Runtime Environment (18.0+18) (fastdebug build 18-ea+18-1045)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 18-ea+18-1045, interpreted mode, tiered, compressed class ptrs, z gc, bsd-aarch64)
# Problematic frame:
# V  [libjvm.dylib+0xc92fd8]  MetaspaceShared::link_shared_classes(JavaThread*)+0x1a4

---------------  T H R E A D  ---------------

Current thread (0x0000000113008a20):  JavaThread "main" [_thread_in_vm, id=6147, stack(0x000000016b688000,0x000000016b88b000)]

Stack: [0x000000016b688000,0x000000016b88b000],  sp=0x000000016b88a930,  free space=2058k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0xc92fd8]  MetaspaceShared::link_shared_classes(JavaThread*)+0x1a4
V  [libjvm.dylib+0xc93630]  MetaspaceShared::preload_and_dump_impl(JavaThread*)+0x9c
V  [libjvm.dylib+0xc931d4]  MetaspaceShared::preload_and_dump()+0x50
V  [libjvm.dylib+0xf7bd44]  Threads::create_vm(JavaVMInitArgs*, bool*)+0xc08
V  [libjvm.dylib+0x9062f4]  JNI_CreateJavaVM+0xa8
C  [libjli.dylib+0x49a4]  JavaMain+0x104
C  [libjli.dylib+0x7570]  ThreadJavaMain+0xc
C  [libsystem_pthread.dylib+0x706c]  _pthread_start+0x140


siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0xbababababababac6

Looks very similar to the crash in JDK-8273505
Comments
Fix Request (17u): (second attempt, Goetz' didn't work). Fix needed a small adaptation to work correctly in jdk17. See PR for details: https://github.com/openjdk/jdk17u-dev/pull/118 (needs reviews, hint hint). We want to fix this to prevent the JVM from crashing when running with Zgc. The risk should be pretty low. The patch has been tested for two weeks in SAPs nightlies.
26-01-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk17u-dev/pull/118 Date: 2022-01-25 09:20:30 +0000
25-01-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk17u-dev/pull/10 Date: 2021-12-20 13:11:41 +0000
20-12-2021

Fix request [17u] I backport this for parity with 17.0.3-oracle. Clean backport.
20-12-2021

Changeset: 110e38de Author: Coleen Phillimore <coleenp@openjdk.org> Date: 2021-10-11 12:11:11 +0000 URL: https://git.openjdk.java.net/jdk/commit/110e38ded8e09361f24c582c770d35f5cfdabf82
11-10-2021

The ZGC crash that Stefan found is that there's this class in the dumptime_table, which is not loaded: adding com/sun/net/httpserver/HttpExchange to the table 0x0000000800c01c20 The ZGC parameters cause class unloading but this class isn't cleaned up because it's not also in ClassLoaderData::_klasses array.
07-10-2021

Yes, I think the smallest fix to backport to JDK 17 would be to use OopHandles.
07-10-2021

The problem is as [~stefank] described. The HandleMark wipes out the contents of the handles stored in _loaded_cld_handles. This can be observed by setting a memory breakpoint on the contents of the Handle: class CollectCLDClosure : public CLDClosure { void do_cld(ClassLoaderData* cld) { if (!cld->is_unloading()) { _loaded_cld.append(cld); _loaded_cld_handles.append(Handle(_current_thread, cld->holder_phantom())); <<<< } is executed inside a HandleMark, so the Handles will be deleted, so we can't keep the loaders alive. 0 __memset_sse2 << memory break point hit 1 HandleMark::~HandleMark 2 ClassLoaderDataGraphIterator::~ClassLoaderDataGraphIterator 3 ClassLoaderDataGraph::loaded_cld_do 4 MetaspaceShared::link_shared_classes 5 MetaspaceShared::preload_and_dump_impl 6 MetaspaceShared::preload_and_dump ================ The problem can also be verified with this patch I can verify the problem with this patch: =============== diff --git a/src/hotspot/share/cds/metaspaceShared.cpp b/src/hotspot/share/cds/metaspaceShared.cpp index a884cbd0c00..67c59f77118 100644 --- a/src/hotspot/share/cds/metaspaceShared.cpp +++ b/src/hotspot/share/cds/metaspaceShared.cpp @@ -588,11 +588,16 @@ class CollectCLDClosure : public CLDClosure { Thread* _current_thread; public: CollectCLDClosure(Thread* thread) : _current_thread(thread) {} - ~CollectCLDClosure() { + void validate() { for (int i = 0; i < _loaded_cld.length(); i++) { ClassLoaderData* cld = _loaded_cld.at(i); + assert(!cld->is_unloading(), "must be still alive"); + assert(_loaded_cld_handles.at(i)() == cld->holder_phantom(), "must still be the same"); } } + ~CollectCLDClosure() { + validate(); + } void do_cld(ClassLoaderData* cld) { if (!cld->is_unloading()) { _loaded_cld.append(cld); @@ -655,6 +660,8 @@ void MetaspaceShared::link_shared_classes(TRAPS) { ClassLoaderDataGraph::loaded_cld_do(&collect_cld); } + collect_cld.validate(); + while (true) { bool has_linked = false; for (int i = 0; i < collect_cld.nof_cld(); i++) { =============== Instead, _loaded_cld_handles.at(i)() returns NULL. As [~stefank] suggested, The fix would be change the array to GrowableArray<OopHandle>, to avoid getting wiped out by the HandleMark.
06-10-2021

It's more like Low frequency, and workaround is to not use CDS, which is medium. This isn't a P1 at any rate.
06-10-2021

I know what the problem is. It's a misuse of Handles. Look at the following code: ``` void MetaspaceShared::link_shared_classes(TRAPS) { LambdaFormInvokers::regenerate_holder_classes(CHECK); // Collect all loaded ClassLoaderData. CollectCLDClosure collect_cld(THREAD); { // ClassLoaderDataGraph::loaded_cld_do requires ClassLoaderDataGraph_lock. // We cannot link the classes while holding this lock (or else we may run into deadlock). // Therefore, we need to first collect all the CLDs, and then link their classes after // releasing the lock. MutexLocker lock(ClassLoaderDataGraph_lock); ClassLoaderDataGraph::loaded_cld_do(&collect_cld); } ... <code assuming that the klass holders will be held alive + safepoints> ``` CollectCLDClosure visits all CLDs and creates a Handle for each holder object. These handles will be registered in the handle area _AND_ they are copied into an array of Handles. The last part is done to ensure that the holder objects are kept alive: GrowableArray<Handle> _loaded_cld_handles; // keep the CLDs alive ... void do_cld(ClassLoaderData* cld) { if (!cld->is_unloading()) { _loaded_cld.append(cld); _loaded_cld_handles.append(Handle(_current_thread, cld->holder_phantom())); } } However, this array doesn't help at all. The GC is not aware of the existence of this array. It is only handles in the handle area that the GC will find. And that's the problem. The code in ClassLoaderDataGraph::loaded_cld_do sets up a handle mark inside ClassLoaderDataGraphIterator. All holder objects gets registered in the handle area by the do_cld call above, then when we're done with the iterator, the handle mark is destructed, and all the added handles are invalidated (the GC will not find them). So, now we don't have anything that keeps the holder objects alive. Remember that the _loaded_cld_handles array isn't visted by the GC. In fact, that array now contains pointer into stale memory that could / will be reused by new Handles. I think the fix for this is to register OopHandles instead.
06-10-2021

I haven't been able to reproduce this bug with the test in the Bug. However, I can reproduce it with the test in JDK-8273505, which is the Bug that introduced the Handles array: make -C ../build/fastdebug/ test TEST="runtime/cds/appcds/loaderConstraints/DynamicLoaderConstraintsTest.java#default-cl" JTREG="JAVA_OPTIONS=-XX:+UseZGC -XX:ZCollectionInterval=0.01" Converting the array from using Handles to OopHandles seems to fix the problem. Unfortunately, that reproducer also crashes in another path: V [libjvm.so+0x76871e] ClassLoaderData::keep_alive() const+0xc V [libjvm.so+0x766f0e] ClassLoaderData::is_alive() const+0x18 V [libjvm.so+0x713ffc] Klass::is_loader_alive() const+0x20 V [libjvm.so+0x11624b6] IterateDumpTimeSharedClassTable::do_entry(InstanceKlass*, DumpTimeClassInfo&)+0x32 V [libjvm.so+0x1163bfa] void ResourceHashtableBase<FixedResourceHashtableStorage<15889u, InstanceKlass*, DumpTimeClassInfo>, InstanceKlass*, DumpTimeClassInfo, (ResourceObj::allocation_type)2, (MEMFLAGS)13, &(DumpTimeSharedClassTable_hash(InstanceKlass* const&)), &(bool primitive_equals<InstanceKlass*>(InstanceKlass* const&, InstanceKlass* const&))>::iterate<IterateDumpTimeSharedClassTable>(IterateDumpTimeSharedClassTable*) const+0x7c V [libjvm.so+0x115e714] SystemDictionaryShared::dumptime_classes_do(MetaspaceClosure*)+0x4a V [libjvm.so+0x8e4798] DynamicArchiveBuilder::iterate_roots(MetaspaceClosure*, bool)+0x32 V [libjvm.so+0x4f36af] ArchiveBuilder::gather_klasses_and_symbols()+0x83 V [libjvm.so+0x4f431d] ArchiveBuilder::gather_source_objs()+0x4f V [libjvm.so+0x8e449d] DynamicArchiveBuilder::doit()+0x71 V [libjvm.so+0x8e4937] VM_PopulateDynamicDumpSharedSpace::doit()+0x7f When crashing, we have the following address in the register: 0x0000000800c01a0 which sort-of matches one of the unloaded classes: Classes unloaded (2 events): Event: 0.649 Thread 0x00007f1d2c06de80 Unloading class 0x0000000800c01a00 'MyHttpHandler' Event: 0.649 Thread 0x00007f1d2c06de80 Unloading class 0x0000000800c01800 'LoaderConstraintsApp' I think we need a new Bug for this crash.
06-10-2021

If GC doesn't know about this Handle array, how does GC know about local variable Handles? Handles are pointers to the oop, which should be fixed by GC in Thread::handle_area, so the pointers should remain good.
06-10-2021

Both this and JDK-8274338 use the Handles to keep CLD->phantom_holder() alive so that GC won't unload these classes, but it still unloads the class. I don't see an intervening HandleMark in either of these cases. This only happens for ZGC, so I'm going to reassign to gc to have a look.
05-10-2021

ILW = HLM = P3
05-10-2021