Bug ID: JDK-8235305 Corrupted oops embedded in nmethods due to parallel modification during optional evacuation

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 12,13,14

Priority: P2
Status: Closed
Resolution: Fixed
CPU: x86_64

Submitted: 2019-12-04
Updated: 2021-10-05
Resolved: 2020-01-22

JDK 14	JDK 15
14 b33Fixed	15Fixed

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffdd35e8986, pid=13168, tid=16240
#
# JRE version: Java(TM) SE Runtime Environment (14.0+26) (build 14-ea+26-1199)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (14-ea+26-1199, mixed mode, tiered, g1 gc, windows-amd64)
# Problematic frame:
# V  [jvm.dll+0x198986]  oopDesc::size_given_klass+0x6
#
# ...
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#

No KS failures caused by this bug in CI.
31-01-2020
URL: https://hg.openjdk.java.net/jdk/jdk14/rev/082f1d3eb164 User: tschatzl Date: 2020-01-22 09:00:49 +0000
22-01-2020
Fix request: Risk evaluation: The problem are random crashes with abortable mixed gcs. The change removes a race in scanning gc roots from compiled code during the abortable mixed gc phase that can result in mangled references embedded in that code. The GC then at some later point may crash on those. Due to complete understanding of the issue and thorough testing with and without a reproducer I would put this change as low risk. The reproducer that significantly increases the chance of the failure occurring involves significant modifications of the VM and is not suitable for inclusion. Test coverage: Multiple hs-tier1-5 and 24h kitchensink completions with and without reproducer that makes the problem reproduce reasonably well (24h kitchensink crashing after ~5mins with reproducer when not fixed); hs-tier1-8 with known, different, issues only. Review thread: https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-January/028325.html ; [~stefank], [~sjohanss], [~eosterlund] either reviewed or contributed to the solution in various ways. Webrev: http://cr.openjdk.java.net/~tschatzl/8235305/webrev.1/
20-01-2020
... and [~eosterlund]
13-01-2020
The problem is that during optional evacuation it may happen that multiple threads try to change an embedded oop at the same time. Since the oop* may not be aligned to a word, access is not atomic, so word tearing may occur, leaving garbage in that location of the code (for e.g. some follow-up gc to trip over). I.e. one thread can execute: scan_opt_rem_set_roots(r); G1OopStarChunkedList* opt_rem_set_list = _pss->oops_into_optional_region(r); ... _opt_refs_scanned += opt_rem_set_list->oops_do(&cl, _pss->closures()->strong_oops()); (opt_rem_set may contain oop references into nmethods) while another thread executes: r->strong_code_roots_do(_pss->closures()->weak_codeblobs()); Thanks go to [~stefank] for finding this.
13-01-2020
Reproduced with latest jdk and also with the change for JDK-8230706 backed out.
11-01-2020
Been able to reproduce this with a build based on jdk-14 build 27 with a small patch that makes more old regions optional during mixed GCs. I'm currently running builds with the same patch based on jdk-14 build 28 & 29, to see if those also reproduce. With the latest jdk I haven't seen the failure yet, but the 24h run timed out so I will have to restart.
09-01-2020
Potentially an issue with JDK-8230706 as it has been pushed before the observed failures. Stack trace from the second failure (and probably the first one, although incomplete): V [libjvm.so+0x94a243] oopDesc::size_given_klass(Klass)+0x2e3 V [libjvm.so+0xb2afe4] G1ParScanThreadState::copy_to_survivor_space(G1HeapRegionAttr, oop, markWord)+0x44 V [libjvm.so+0xb5b6b0] void G1ParCopyClosure<(G1Barrier)0, (G1Mark)0>::do_oop_work<oop>(oop)+0x100 V [libjvm.so+0xa992b5] void G1CodeBlobClosure::HeapRegionGatheringOopClosure::do_oop_work<oop>(oop)+0x25 V [libjvm.so+0x125186e] nmethod::oops_do(OopClosure, bool)+0x1ae V [libjvm.so+0xa94de2] G1NmethodProcessor::do_regular_processing(nmethod)+0x22 V [libjvm.so+0x125237c] nmethod::oops_do_process_weak(nmethod::OopsDoProcessor)+0x2c V [libjvm.so+0xa93a65] G1CodeBlobClosure::do_code_blob(CodeBlob)+0x55 V [libjvm.so+0xaa0c7d] G1CodeRootSet::nmethods_do(CodeBlobClosure) const+0x4d V [libjvm.so+0xb4b5f7] G1ScanCollectionSetRegionClosure::do_heap_region(HeapRegion)+0xd7 V [libjvm.so+0xabfe60] G1CollectionSet::iterate_part_from(HeapRegionClosure, HeapRegionClaimer, unsigned long, unsigned long, unsigned int, unsigned int) const [clone .part.26]+0x100 V [libjvm.so+0xac1447] G1CollectionSet::iterate_incremental_part_from(HeapRegionClosure, HeapRegionClaimer, unsigned int, unsigned int) const+0x97 V [libjvm.so+0xb39679] G1RemSet::scan_collection_set_regions(G1ParScanThreadState, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x89 V [libjvm.so+0xab3c13] G1EvacuateRegionsBaseTask::work(unsigned int)+0x83
08-01-2020
This has also happened on Linux and looking at that failure this could very well be a dup of: https://bugs.openjdk.java.net/browse/JDK-8235119 In both occurrences the crash happens during a Mixed collection and in the hr_err from the Linux crash we can see the stack trace better: Stack: [0x00007f509f1f0000,0x00007f509f2f0000], sp=0x00007f509f2ee6a0, free space=1017k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x94a243] oopDesc::size_given_klass(Klass)+0x2e3 V [libjvm.so+0xb2afe4] G1ParScanThreadState::copy_to_survivor_space(G1HeapRegionAttr, oop, markWord)+0x44 V [libjvm.so+0xb5b6b0] void G1ParCopyClosure<(G1Barrier)0, (G1Mark)0>::do_oop_work<oop>(oop)+0x100 V [libjvm.so+0xa992b5] void G1CodeBlobClosure::HeapRegionGatheringOopClosure::do_oop_work<oop>(oop)+0x25 V [libjvm.so+0x125186e] nmethod::oops_do(OopClosure, bool)+0x1ae V [libjvm.so+0xa94de2] G1NmethodProcessor::do_regular_processing(nmethod)+0x22 V [libjvm.so+0x125237c] nmethod::oops_do_process_weak(nmethod::OopsDoProcessor)+0x2c V [libjvm.so+0xa93a65] G1CodeBlobClosure::do_code_blob(CodeBlob)+0x55 V [libjvm.so+0xaa0c7d] G1CodeRootSet::nmethods_do(CodeBlobClosure) const+0x4d V [libjvm.so+0xb4b5f7] G1ScanCollectionSetRegionClosure::do_heap_region(HeapRegion)+0xd7 V [libjvm.so+0xabfe60] G1CollectionSet::iterate_part_from(HeapRegionClosure, HeapRegionClaimer, unsigned long, unsigned long, unsigned int, unsigned int) const [clone .part.26]+0x100 V [libjvm.so+0xac1447] G1CollectionSet::iterate_incremental_part_from(HeapRegionClosure, HeapRegionClaimer, unsigned int, unsigned int) const+0x97 V [libjvm.so+0xb39679] G1RemSet::scan_collection_set_regions(G1ParScanThreadState, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x89 V [libjvm.so+0xab3c13] G1EvacuateRegionsBaseTask::work(unsigned int)+0x83 V [libjvm.so+0x16ce714] GangWorker::run_task(WorkData)+0x84 V [libjvm.so+0x16ce858] GangWorker::loop()+0x48 V [libjvm.so+0x15a5846] Thread::call_run()+0xf6 V [libjvm.so+0x12d4c46] thread_native_entry(Thread*)+0x116 I will do some more digging to see if I can prove this is during the optional phase, in which case I think we can close it as a dup of the above mentioned bug.
08-01-2020
From hs_err: stack at sp + 0 slots: 0x00007ffdd373a0f9 jvm.dll::G1ParScanThreadState::copy_to_survivor_space + 0x49 stack at sp + 1 slots: 0x0 is NULL stack at sp + 2 slots: 0x0 is NULL stack at sp + 3 slots: 0x0 is NULL stack at sp + 4 slots: 0x00007ffdd370b24e jvm.dll::G1CodeRootSetTable::add + 0x9e stack at sp + 5 slots: 0x00007ffdd370b07e jvm.dll::G1CodeRootSet::add + 0x5e stack at sp + 6 slots: 0x00007ffdd3785bbe jvm.dll::HeapRegionRemSet::add_strong_code_root + 0x3e stack at sp + 7 slots: 0x00007ffdd370a794 jvm.dll::G1CodeBlobClosure::HeapRegionGatheringOopClosure::do_oop + 0x44 We're most likely in the G1 safepoint: Event: 74288.868 Executing VM operation: GetStackTrace Event: 74288.871 Executing VM operation: GetStackTrace done Event: 74288.871 Executing VM operation: G1CollectForAllocation Notice this symbol in the registers: R14=0x00007ffdd3c9ed70 jvm.dll::oop_Relocation::`vftable' + 0x0
10-12-2019

Duplicate :	JDK-8237485 - assert(!is_null(v)) failed: narrow klass value can never be zero
Relates :	JDK-8235324 - Dying objects are published from users of CollectedHeap::object_iterate
Relates :	JDK-8274787 - Kitchensink24HStress.java failed with EXCEPTION_ACCESS_VIOLATION in oopDesc::size_given_klass
Relates :	JDK-8237485 - assert(!is_null(v)) failed: narrow klass value can never be zero