JDK-8218851 : JVM crash in custom classloader stress test, JDK 12 & 13
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 12,13
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • CPU: x86
  • Submitted: 2019-02-12
  • Updated: 2020-09-03
  • Resolved: 2019-02-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 13
11.0.10-oracleFixed 13 b09Fixed
Related Reports
Relates :  
Description
JVM is crashing when running a classloading and unloading stress program on JDK 12-b30, and JDK 13-b7 on Linux configured to use ZGC. Have been able to reproduce the crash on a 2 socket Westmere EP, but not on a single socket Sandy Bridge. An hs*err.log from a fast debug JVM built against the JDK 13 repo is attached.

The stress program that runs a random number of threads, between 1 and # of hardware threads on the machine. Each thread randomly selects one of 15 jar files, and loads all classes from that jar file using a custom classloader. There is a custom classloader class for each of the 15 jar files. In other words, there are 15 custom classloader classes. Each thread creates an instance of a custom classloader that will load all classes in jar that maps to that custom classloader. There can be multiple custom classloader instances loading classes from the same jar file.
After each class is loaded, the program asks for the list of methods in the class that has just been loaded from the jar file. Both the custom classloader instance, the names of the classes loaded and class method names are saved for potential use later. If a class (or dependent class) has already been loaded, identified by a LinkageError, a count is incremented for the jar file indicating how many classes had already been loaded. This count is also preserved. Not ideal that LinkageErrors can occur.
Once all classes in a jar file are loaded by a custom classloader, the thread passes back the class names, their method names and number of classes already loaded to the main thread. The main thread randomly selects a set of classes that have been loaded to be preserved for a time period between 30 seconds and 3 minutes. Those that are randomly selected are given to an ���aging thread���. The aging thread takes the set of classes that have been loaded and attempts to create object instances for the classes that have been loaded. The aging thread continues to do this until the "preserved time" has been exceeded. Once the "preserved time" has been exceeded, the custom classloader instance and the classes it loaded, and new instances allocated are released and eligible to be garbage collected.
All the above continues to run for up to 6 hours.
Comments
Fix request (11u) -- will label after testing completed. I would like to downport this for parity with 11.0.10-oracle. Applies clean.
31-08-2020

I had a second crash from this program: # Internal Error (/scratch/coleen/hg/13zgc-constraints/open/src/hotspot/share/classfile/classLoaderData.inline.hpp:37), pid=14164, tid=14234 # assert(_holder.is_null() || holder_no_keepalive() != __null) failed: This class loader data holder must be alive # From: #36 Klass::class_loader (this=this@entry=0x7fb5ac54ab18) at /scratch/coleen/hg/13zgc-constraints/open/src/hotspot/share/oops/klass.cpp:675 #37 0x00007fb89f0f83ed in SystemDictionary::check_constraints (d_hash=d_hash@entry=1507211930, k=k@entry=0x7fb5ace18a18, class_loader=..., defining=defining@entry=true, __the_thread__=__the_thread__@entry=0x7fb8987e6540) at /scratch/coleen/hg/13zgc-constraints/open/src/hotspot/share/classfile/systemDictionary.cpp:2117 #38 0x00007fb89f0f86c8 in SystemDictionary::define_instance_class (k=k@entry=0x7fb5ace18a18, One of the threads from the unloaded klass crash: #4 0x00007fb89f0f1264 in MutexLockerEx::MutexLockerEx (no_safepoint_check=false, mutex=<optimized out>, this=<synthetic pointer>) at /scratch/coleen/hg/13zgc-constraints/open/src/hotspot/share/runtime/mutexLocker.hpp:231 #5 SystemDictionary::do_unloading (gc_timer=0x7fb89fb846a0 <ZStatPhase::_timer>) at /scratch/coleen/hg/13zgc-constraints/open/src/hotspot/share/classfile/systemDictionary.cpp:1829 Shows that the GC was waiting for the SystemDictionary_lock to clean the loader constraint table.
15-02-2019

[3968.301s][info][class,loader,constraints] constraint check failed for name org/apache/cassandra/cql3/statements/BatchStatement, loader com.oracle.gc.classloaderstress.ApacheCassandraCustomClassLoader @6fa1725f: the presented class object differs from that stored [3968.301s][info][class,loader,constraints] class is being loaded org/apache/cassandra/cql3/statements/BatchStatement Added the last line from here: if (!p->klass()->is_loaded()) { ResourceMark rm; log_info(class, loader, constraints)("class is being loaded %s", p->klass()->name()->as_C_string()); // Only return fully loaded classes. Classes found through the // constraints might still be in the process of loading. return NULL; } in find_constrained_klass to confirm Stefan's diagnosis.
13-02-2019

Here is the ILW evaluation: Impact: High - crash Likelihood: Low: - on a 2 socket Westmere EP, but not on a single socket Sandy Bridge - ZGC only - Only reproduce in stress mode, no deterministic reproducer Workaround: Medium: Fix or avoid class loader constaint condition ILW: HLM: =====>> P3
12-02-2019

Yeah, the SystemDictionary_lock was the first thing I checked for.
12-02-2019

Could just be a missing null check from JDK-8199852 which was fixed in 11. I was concerned with stale information in the loader constraint table, if we'd unloaded the klass, but we don't set the state from loaded+ to unloaded when unloading (good!)
12-02-2019

We are trying to reproduce without ZGC. No other GC does concurrent class unloading. The failing code is protected by holding the SystemDictionary lock, so it's not apparent why concurrent class unloading would provoke this.
12-02-2019

Questions: can it be reproduced without ZGC? Are any of the other GC threads doing concurrent class unloading?
12-02-2019

As far as we know, this isn't ZGC specific.
12-02-2019

Log of how this was narrowed down: Important pieces from the hs_err file: We crash when dereferencing 0x0000000000000098 siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000098 In Klass::class_loader: # V [libjvm.so+0xa29091] Klass::class_loader() const+0x1 Called from: SystemDictionary::define_instance_class(InstanceKlass*, Thread*) Klass::class_loader() is implemented with: oop Klass::class_loader() const { return class_loader_data()->class_loader(); } ClassLoaderData* class_loader_data() const { return _class_loader_data; } And gdb confirms that we crash because a Klass* is NULL and therefore Klass::_class_loader_data is 0x0000000000000098: (gdb) p &((Klass*)0)->_class_loader_data $6 = (ClassLoaderData **) 0x98 There are at least three places inlinable in define_instance_class where we call Klass::class_loader(). There is a hint in the hs_err file showing us which call is the most likely: stack at sp + 4 slots: 0x00007fa03c18bba0 points into unknown readable memory: 6c 6f 61 64 65 72 20 63 decoding this gives the beginning of a string that says "loader c": (gdb) p /c {0x6c, 0x6f, 0x61, 0x64, 0x65, 0x72, 0x20, 0x63} $3 = {108 ���l���, 111 ���o���, 97 ���a���, 100 ���d���, 101 ���e���, 114 ���r���, 32 ' ���, 99 ���c���} The most likely call is then this (which has a string strarting with "loader c"): if (throwException == false) { if (constraints()->check_or_update(k, class_loader, name) == false) { throwException = true; ss.print("loader constraint violation: loader %s", loader_data->loader_name_and_id()); ss.print(" wants to load %s %s.", k->external_kind(), k->external_name()); Klass *existing_klass = constraints()->find_constrained_klass(name, class_loader); if (existing_klass->class_loader() != class_loader()) { and we crash because find_constrained_klass has returned NULL. We know that in order to reach that code, check_or_updated must have returned false: bool LoaderConstraintTable::check_or_update(InstanceKlass* k, Handle loader, Symbol* name) { LogTarget(Info, class, loader, constraints) lt; LoaderConstraintEntry* p = *(find_loader_constraint(name, loader)); if (p && p->klass() != NULL && p->klass() != k) { if (lt.is_enabled()) { ResourceMark rm; lt.print(���constraint check failed for name %s, loader %s: ��� ���the presented class object differs from that stored���, name->as_C_string(), ClassLoaderData::class_loader_data(loader())->loader_name_and_id()); } return false; } So, p != NULL and p->klass() != NULL but p->klass() isn't the Klass* we were looking for. and with that knowledge if we look in find_constrained_klass we see: InstanceKlass* LoaderConstraintTable::find_constrained_klass(Symbol* name, Handle loader) { LoaderConstraintEntry *p = *(find_loader_constraint(name, loader)); if (p != NULL && p->klass() != NULL) { assert(p->klass()->is_instance_klass(), "sanity"); if (!p->klass()->is_loaded()) { // Only return fully loaded classes. Classes found through the // constraints might still be in the process of loading. return NULL; } return p->klass(); } // No constraints, or else no klass loaded yet. return NULL; } and can conclude that we received NULL because the Klass wasn't "is loaded": if (!p->klass()->is_loaded()) { // Only return fully loaded classes. Classes found through the // constraints might still be in the process of loading. return NULL; }
12-02-2019