Bug ID: JDK-8156137 SIGSEGV in ReceiverTypeData::clean_weak_klass

JDK-8156137 : SIGSEGV in ReceiverTypeData::clean_weak_klass_links

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 9

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2016-05-05
Updated: 2023-07-21
Resolved: 2016-08-31

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 8	JDK 9	Other
8u162Fixed	9 b138Fixed	emb-8u181Fixed

Related Reports

Duplicate :	JDK-8160469 - SIGSEGV in ReceiverTypeData::clean_weak_klass_links
Duplicate :	JDK-8187945 - frequent crashes ciObjectFactory::create_new_metadata
Duplicate :	JDK-8165950 - SIGSEGV in ReceiverTypeData::receiver(unsigned int)
Duplicate :	JDK-8169330 - SIGSEGV at PSParallelCompact::IsAliveClosure::do_object_b
Duplicate :	JDK-8143237 - EAV in compiled code
Relates :	JDK-8015837 - Nashorn crashes with tiered on x86 when running v8 benchmark
Relates :	JDK-8241653 - VM crashing regularly, libjvm.so Klass::is_loader_alive, G1ParallelCleaningTask::work
Relates :	JDK-8164692 - InstanceKlass::_previous_version_count goes negative
Relates :	JDK-8165950 - SIGSEGV in ReceiverTypeData::receiver(unsigned int)

Description

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f5b19bda4f6, pid=22620, tid=22644
#
# JRE version: Java(TM) SE Runtime Environment (9.0) (fastdebug build 9-internal+0-2016-05-04-223150.vkozlov.8155162)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 9-internal+0-2016-05-04-223150.vkozlov.8155162, compiled mode, tiered, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x10ca4f6]  ReceiverTypeData::clean_weak_klass_links(BoolObjectClosure*)+0x226
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e" (or dumping to /scratch/home/aurora/sandbox/results/kitchensink/core.22620)
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

Current thread (0x00007f5b1408e000):  GCTaskThread "GC Thread#15" [stack: 0x00007f5af71f2000,0x00007f5af72f3000] [id=22644]

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x10ca4f6]  ReceiverTypeData::clean_weak_klass_links(BoolObjectClosure*)+0x226
V  [libjvm.so+0x10ca5ad]  VirtualCallData::clean_weak_klass_links(BoolObjectClosure*)+0x1d
V  [libjvm.so+0x10d5c35]  MethodData::clean_method_data(BoolObjectClosure*)+0x115
V  [libjvm.so+0xc5ea31]  InstanceKlass::clean_method_data(BoolObjectClosure*)+0x41
V  [libjvm.so+0xc5ea78]  InstanceKlass::clean_weak_instanceklass_links(BoolObjectClosure*)+0x28
V  [libjvm.so+0xad2439]  G1ParallelCleaningTask::work(unsigned int)+0x4c9
V  [libjvm.so+0x1510ed0]  GangWorker::loop()+0xe0
V  [libjvm.so+0x118e482]  java_start(Thread*)+0x112

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000

Comments

verified by nightly testing
26-07-2017
URL: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/882e8cda60b3 User: lana Date: 2016-09-28 20:43:13 +0000
28-09-2016
URL: http://hg.openjdk.java.net/jdk9/hs-comp/hotspot/rev/882e8cda60b3 User: dlong Date: 2016-08-31 21:04:40 +0000
31-08-2016
I tested with a fix for _previous_version_count, but put in an artificial delay for classes to move from the previous versions list to the deallocate list (simulating on_stack metadata), and I was able to get the same crash, so this is evidence that JDK-8164692 is a separate (but contributing) bug. I'm testing my proposed fix, which is to process previous versions in Klass::clean_weak_klass_links().
27-08-2016
I suspect the problem with InstanceKlass::_previous_version_count may be the root cause. If the count is wrong, then ClassLoaderDataGraph::do_unloading() may skip calling InstanceKlass::purge_previous_versions(), allowing those scratch classes to be seen by ClassLoaderDataGraphKlassIteratorAtomic and G1ConcurrentMark, as explain above. While fixing _previous_version_count should get unused previous versions cleaned up in a timely manner, I suspect there is still a problem with any previous versions that are still "on_stack". If there is dead metadata in the MethodData for an on_stack previous version, then both G1MarkSweep and G1ConcurrentMark need to agree to clean it or ignore it.
25-08-2016
The MethodData doesn't contain a scratch class. The scratch class has is_old methods with MethodData pointing to metadata that has become unloaded. However, if the scratch class is removed from the deallocate list at G1MarkSweep time, then a later G1ConcurrentMark would never see it, so I guess my problem scenario must be wrong. Let me see if I can figure out the real problem path.
24-08-2016
The scratch class shouldn't be publicized to the rest of the system. How did the MethodData get a scratch class in it? There are checks in the compiler to make sure MethodData doesn't add methods that are old (is_old) to MethodData after the redefinition is finished, so the scratch class should certainly not be in the MethodData. The lifetime of scratch class is only during the redefinition.
23-08-2016
After further investigation, I think I've found the cause. We have a scratch class S thanks to JVMTI RedefineClasses. Class S has MethodData that references a class U that is going to be unloaded. G1 handles this in two different ways. G1ConcurrentMark uses ClassLoaderDataGraphKlassIteratorAtomic to iterate over all classes and calls clean_weak_instanceklass_links. G1MarkSweep (and other GCs) use Klass::clean_weak_klass_links() to iterate over live, non-scratch classes, and calls clean_weak_instanceklass_links. Now the problem scenario is: 1: G1MarkSweep skips class S in clean_weak_klass_links, because scratch classes are not added to the class hierarchy tree. The full GC then frees the metadata for class U. Now the MethodData for S contains stale metadata. 2. When a later G1ConcurrentMark calls clean_weak_instanceklass_links on S, it will crash on the stale metadata. I couldn't find any recent changes to explain why this is happening now. I'm not a GC/runtime expert, so I don't know if the different between G1MarkSweep and G1ConcurrentMark regarding scratch classes is by design or not. Assigning to gc.
23-08-2016
[~coleenp] In a recent crash, static InstanceKlass::_previous_version_count was -20. It looks like nothing fatal can happen if this count is wrong, but it appears that the accounting is wrong. 1) We never decrement it when we purge a previous version, and 2) we can do a decrement without an increment in add_previous_version().
19-08-2016
This bug isn't a runtime bug, please reevaluate and see if it's a GC bug or a bug in MethodData?
18-08-2016
Question: What happens when we clean out MethodData? I think this should be changed to p->is_old(), where we clean out any methods that have been redefined. void VirtualCallData::clean_weak_method_links() { ReceiverTypeData::clean_weak_method_links(); for (uint row = 0; row < method_row_limit(); row++) { Method* p = method(row); if (p != NULL && !p->on_stack()) { clear_method_row(row); } } }
17-08-2016
The deallocate_list only has klasses on it that were once "scratch_classes" during the redefinition. They should never be added to any MethodData. From this crash it looks like the klass pointer has been unloaded because it's 0xbaadfade (which is used to mangle the metadata). I was confused with clean_weak_method_links, which does clean methods that are not "on_stack" during redefinition. I might need this in class unloading under the walk_all_metadata conditional though, but this isn't related to this crash. You would see a stale Method pointer.
17-08-2016
From JDK-8143237: RULE "bigapps/Kitchensink/modulePostprocess" Crash EXCEPTION_ACCESS_VIOLATION RULE "bigapps/Kitchensink/stability" Crash EXCEPTION_ACCESS_VIOLATION RULE "bigapps/Kitchensink/stressExitCode" Crash EXCEPTION_ACCESS_VIOLATION
12-08-2016
Yes, I'll take it. I think I moved this clean_weak_x_links because it was getting stale metadata or was inefficient some time ago.
12-08-2016
Yep, stale Klass pointer seems the most likely culprit.
11-08-2016
[~dnsimon] "Firstly, I think MethodProfileWidth should be moved from globals.hpp to jvmci_globals.hpp since it's only used by JVMCI code. I can create a separate bug for this if you want.": Sure, please do. "However, the crash here cannot be related to stale Method pointers in VirtualCallData..." I agree. I think it's a stale receiver Klass. But it looks like we should have the same problem without JVMCI.
11-08-2016
Upon further investigation, I think the `if (!this->is_VirtualCallData()) { ` block of code should simply be deleted. Otherwise, a class unload/redefinition event can turn a profile with non-complete coverage (i.e., `non_profiled_count != 0`) into a profile denoting complete coverage which in turn could cause Graal to generate overly-optimistic code. BTW, I just sent a mail to hotspot-compiler-dev describing other issues uncovered during our investigation: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2016-August/024030.html
11-08-2016
Firstly, I think MethodProfileWidth should be moved from globals.hpp to jvmci_globals.hpp since it's only used by JVMCI code. I can create a separate bug for this if you want. Next, I'm not sure about the `if (!this->is_VirtualCallData()) { ` logic. I'll ask the rest of the Graal team. However, the crash here cannot be related to stale Method pointers in VirtualCallData records since MethodProfileWidth is 0 by default and I don't see it being changed on the command line.
11-08-2016
If we redefine a class and there are methods on the stack, then we create a previous version of the class. If we redefine the class more than once, we can get more than one previous version. In purge_previous_versions() we try to clean up previous versions that are no longer on the stack. We put them on the class loader deallocate list, but we don't clean the weak klass links at the same time. Instead, GC will clean the weak klass links, if the mirror is no longer alive. There might be a disconnect here. The class loader deallocate list will get freed when GC next does class unloading. It's not clear that we are guaranteed to clean the MethodData before the Klass gets freed. ReceiverTypeData::clean_weak_klass_links() only cleans the row if the class loader is not alive, but we need to clean it if the Klass is redefined.
11-08-2016
[~dnsimon] We could guard the calls to update_mdp_by_constant() and profile_called_method() with EnableJVMCI like you suggest, but then wouldn't we still have a problem when EnableJVMCI is enabled?
11-08-2016
Re: SIGBUS, that is what we get on sparc, and is what I would expect if 0xbaadfadebaadfade is an unmapped address. The SEGV on address 0x0 with SI_KERNEL is the anomaly, but we can see it's the same problem because the register we are dereferencing has value 0xbaadfadebaadfade.
11-08-2016
The comment on 2016-06-08 02:05 lists rules for SIGBUS crashes. The analysis in JDK-8015837 seemed more substantial than the conjecture in JDK-8004124, but yes there may be non-kernel-bug related reasons for getting SI_KERNEL. I just wanted to flag that. That said we the signal context does not show that the problem occurs trying to access baadfade, but: siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000
11-08-2016
@David, I'm not sure which SIGBUS crashes you are referring to. According to 8004124, a non-canonical address can cause SI_KERNEL, which is consistent with trying to access 0xbaadfadebaadfade, which is what we store in freed metaspace.
11-08-2016
The original crash shows SI_KERNEL so this may be JDK-8015837 rearing its head again. It is unclear whether later SIGBUS crashes are even related to the original issue.
10-08-2016
This also looks strange: #if INCLUDE_JVMCI if (!this->is_VirtualCallData()) { // if this is a ReceiverTypeData for JVMCI, the nonprofiled_count // must also be reset (see "Description of the different counters" above) set_nonprofiled_count(0); } #endif If I read the comment right, then shoudn't that be "if (this->is_VirtualCallData())" instead?
10-08-2016
I looked at two recent crashes, and the problem InstanceKlass is java/io/RandomAccessFile, and the Method whose MethodData we are looking at has JVM_ACC_IS_OLD set. The _code field 0. Because of this, I suspect that the problem is related to RedefineClasses. I see that when JVMCI is used, we do this in VirtualCallData::clean_weak_method_links(): if (p != NULL && !p->on_stack()) { clear_method_row(row); } So if the method is "on_stack", we skip the clear. Do we finish the clear at some other time? [~dnsimon] or [~never], can I assign this to one of you?
10-08-2016
[~dlong] well, it can. But it's a question to GC folks: I was told multiple times that it's safe to process dead InstanceKlasses w/o synchronization during GC.
27-07-2016
[~vlivanov] Could this be related to 8139595?
22-07-2016
V [libjvm.so+0xc5ea78] InstanceKlass::clean_weak_instanceklass_links(BoolObjectClosure)+0x28 V [libjvm.so+0xad2439] G1ParallelCleaningTask::work(unsigned int)+0x4c9 Is this comment and code in InstanceKlass::clean_weak_instanceklass_links() a concern: // Since GC iterates InstanceKlasses sequentially, it is safe to remove stale entries here. DependencyContext dep_context(&_dep_context); dep_context.expunge_stale_entries(); Are we still processing InstanceKlasses sequentially from G1ParallelCleaningTask? Looks like the caller does this: // All workers will help cleaning the classes, InstanceKlass klass; while ((klass = claim_next_klass()) != NULL) { clean_klass(klass); }
22-07-2016
Correct. There is no jvmci tag in the hs_err output.
06-05-2016
[~never] pointed out that JVMCI is only in the path because VirtualCallData::clean_weak_klass_links overrides ReceiverTypeData::clean_weak_klass_links when INCLUDE_JVMCI is true. However, since EnableJVMCI is false, it cannot be related to JVMCI compilation.
06-05-2016
Verifying the Klass* in some way inside the profiling code might identify this problem. Maybe comparing the header against baadfade before putting it in the MDO?
06-05-2016
I'm assuming the 1 in next cell is the receiver count? It could be that a stale oop created a new entry after it's Klass* was unloaded. If the oop didn't cause any other crashes along the way and was dead at the next GC the crash would look like this.
06-05-2016
The problem which causes the crash is VirtualCallData entry contains stale Klass pointer as a receiver type: (gdb) f 10 #10 ReceiverTypeData::clean_weak_klass_links (this=this@entry=0x7f6dc405b090, is_alive_cl=is_alive_cl@entry=0x7f4c1df13630) at /opt/jprt/T/P1/153744.tohartma/s/hotspot/src/share/vm/oops/methodData.cpp:408 408 Klass* p = receiver(row); (gdb) p p $20 = (Klass ) 0x7f49bf0a2800 (gdb) x p 0x7f49bf0a2800: 0xbaadfadebaadfade (gdb) p this $21 = (ReceiverTypeData const) 0x7f6dc405b090 gdb) p *this $23 = {<CounterData> = {<BitData> = {<ProfileData> = {<ResourceObj> = {<AllocatedObj> = { _vptr.AllocatedObj = 0x7f6dcd99cf50 <vtable for VirtualCallData+16>}, _allocation_t = {18446603964292681582, 0}}, _data = 0x7f49bf0b2168}, <No data fields>}, <No data fields>}, <No data fields>} (gdb) x/16gx this->_data->_cells 0x7f49bf0b2170: 0x0000000000000000 0x0000000000000000 0x7f49bf0b2180: 0x00007f49bf0a2800 0x0000000000000001 0x7f49bf0b2190: 0x0000000000000000 0x0000000000000000 0x7f49bf0b21a0: 0x0000000000060005 0x0000000000000000 0x7f49bf0b21b0: 0x0000000000000000 0x00007f49bf0a2800 0x7f49bf0b21c0: 0x0000000000000001 0x0000000000000000 0x7f49bf0b21d0: 0x0000000000000000 0x00000000000d0007 0x7f49bf0b21e0: 0x0000000000000001 0x0000000000000030
06-05-2016
As far as I can tell, JVMCI is not enabled in the VM (I don't see -XX:+EnableJVMCI in http://aurora.ru.oracle.com/slot-gw/1444745.JAVASE.NIGHTLY.VM.Comp_Baseline-Tiered.2016-05-04-3/results/kitchensink/hs_err_pid22620.log). However, this code in interp_masm_ <cpu>.cpp does not test EnableJVMCI: // The method data pointer needs to be updated to reflect the new target. #if INCLUDE_JVMCI if (MethodProfileWidth == 0) { update_mdp_by_constant(mdp, in_bytes(VirtualCallData::virtual_call_data_size())); } #else // INCLUDE_JVMCI update_mdp_by_constant(mdp, in_bytes(VirtualCallData:: virtual_call_data_size())); #endif // INCLUDE_JVMCI Maybe that should be something like this instead: #if INCLUDE_JVMCI if (EnableJVMCI && MethodProfileWidth == 0) { update_mdp_by_constant(mdp, in_bytes(VirtualCallData::virtual_call_data_size())); } else { #endif // INCLUDE_JVMCI update_mdp_by_constant(mdp, in_bytes(VirtualCallData:: virtual_call_data_size())); #if INCLUDE_JVMCI } #endif // INCLUDE_JVMCI Also, all calls to InterpreterMacroAssembler::profile_called_method probably need a similar guard.
06-05-2016
ILW = HLH = P2 I = H/M? = crash w/ fastdebug (due to metaspace zapping); possibly, a crash in product build as well L = L = 2 crashes were observed W = H? = no workaround is known
05-05-2016
Doug, can you look on it?
05-05-2016
Thanks, Stefan.
05-05-2016
This needs to be evaluated by the compiler team. We crash because one of the compiler data structures has a pointer to something that probably has been unloaded. This usually means that some part of the compiler is using Klass* without keeping a pointer to the mirror. There has been a few non-gc related crashes like this recently. One was caused by synchronized, parallel cleaning of the MethodData from compiler related threads. Another was interfaces that received a Klass*, but the mirror was not explicitly kept alive. Note also that the ReceiverTypeData, VirtualCallData, and MethodData are not used by the GCs, we clean these data types.
05-05-2016
Giving to GC team for evalutation. The crash happens during parallel processing in one of GC threads when it dereferences dead Klass pointer. No signs of class unloading before the crash (10 last GCs didn't unload any classes). A problem with parallel processing?
05-05-2016
Disassembly around the crash site: <+538>: je 0x7ffff7180480 <_ZN16ReceiverTypeData22clean_weak_klass_linksEP17BoolObjectClosure+432> <+540>: mov (%r8),%rax <+543>: lea -0xb49916(%rip),%rcx # 0x7ffff6636be0 <_ZNVK5Klass8is_klassEv> <+550>: mov 0x8(%rax),%rax <=== <+554>: cmp %rcx,%rax <+557>: je 0x7ffff7180300 <_ZN16ReceiverTypeData22clean_weak_klass_linksEP17BoolObjectClosure+48> RAX=0xbaadfadebaadfade
05-05-2016
Filed against hotspot/compiler for initial evaluation: there's JVMCI-specific method on stack during the crash: #if INCLUDE_JVMCI void VirtualCallData::clean_weak_klass_links(BoolObjectClosure* is_alive_cl) { ... } But it can be GC problem as well, RAX is loaded from zapped metadata: RAX=0xbaadfadebaadfade is an unknown value There are no signs of previous class unloading in the log: Event: 587.244 GC heap before {Heap before GC invocations=307 (full 2): ... Metaspace used 45438K, capacity 47058K, committed 57344K, reserved 57344K } Event: 587.362 GC heap after {Heap after GC invocations=308 (full 2): ... Metaspace used 45438K, capacity 47058K, committed 57344K, reserved 57344K } ... Event: 587.721 GC heap before {Heap before GC invocations=311 (full 2): ... Metaspace used 45438K, capacity 47058K, committed 57344K, reserved 57344K } Event: 587.822 GC heap after {Heap after GC invocations=312 (full 2): ... Metaspace used 45438K, capacity 47058K, committed 57344K, reserved 57344K } ... Event: 587.969 Executing VM operation: CGC_Operation
05-05-2016