JDK-8253081 : G1 fails on stale objects in archived module graph in Open Archive regions
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 16
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2020-09-14
  • Updated: 2021-07-21
  • Resolved: 2020-11-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16
16 b25Fixed
Related Reports
Blocks :  
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8253261 :  
JDK-8266331 :  
Description
Crashing tests:

vmTestbase/nsk/sysdict/vm/stress/btree/btree002/btree002.java
vmTestbase/nsk/sysdict/vm/stress/btree/btree004/btree004.java
vmTestbase/nsk/sysdict/vm/stress/btree/btree005/btree005.java


Appeared in the CI that included JDK-8244778

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/opt/mach5/mesos/work_dir/slaves/4728e7c1-7e67-490e-be0f-6bbf2a2f33db-S209/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/5138b034-dd14-4eb5-b0ea-3240c6d38cfb/runs/8181b3d1-5d6f-4a6f-8773-ffde6519a79a/workspace/open/src/hotspot/share/oops/compressedOops.inline.hpp:84), pid=7524, tid=7558
#  assert(!is_null(v)) failed: narrow klass value can never be zero
#
# JRE version: Java(TM) SE Runtime Environment (16.0+16) (fastdebug build 16-ea+16-662)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 16-ea+16-662, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xbc3c41]  G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x21d61
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/4076d11c-c6ed-4d07-84c1-4ab8d55cd975-S108743/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/197b2215-0257-4e6c-806b-8cc09e1af96c/runs/2ceb0eba-8381-4679-aa94-6f7f8246299a/testoutput/test-support/jtreg_open_test_hotspot_jtreg_vmTestbase_nsk_sysdict/scratch/2/core.7524)
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
---------------  T H R E A D  ---------------

Current thread (0x00007f092c00a4b0):  GCTaskThread "GC Thread#4" [stack: 0x00007f09231ef000,0x00007f09232ef000] [id=7558]

Stack: [0x00007f09231ef000,0x00007f09232ef000],  sp=0x00007f09232ed8a0,  free space=1018k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xbc3c41]  G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x21d61
V  [libjvm.so+0xbdca21]  G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MemoryType)5, 131072u>, (MemoryType)5>*)+0x12f1
V  [libjvm.so+0xb1724a]  G1ParEvacuateFollowersClosure::do_void()+0x1da
V  [libjvm.so+0xb20e28]  G1EvacuateRegionsBaseTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x98
V  [libjvm.so+0xb188b6]  G1EvacuateRegionsBaseTask::work(unsigned int)+0x66
V  [libjvm.so+0x18e1c74]  GangWorker::run_task(WorkData)+0x84
V  [libjvm.so+0x18e1db4]  GangWorker::loop()+0x44
V  [libjvm.so+0x179d040]  Thread::call_run()+0x100
V  [libjvm.so+0x14942c6]  thread_native_entry(Thread*)+0x116

Register to memory mapping:
Comments
Changeset: d3095605 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2020-11-18 08:21:03 +0000 URL: https://github.com/openjdk/jdk/commit/d3095605
18-11-2020

[~jiangli]: In the current implementation out for review we wanted to avoid the second pass and open up the possibility for other collectors to use CDS. So we came up with the more generic solution of the archive initialization creating a "root object array" that makes sure that the collector knows exactly of the liveness of all objects in the open archive. Ultimately for G1 this changed the OA regions to always-pinned Old regions.
13-11-2020

I only noticed this bug report yesterday. As Thomas pointed out in above comments, the current OARC region has the assumption that the dormant object state transition is one directional only: dormant -> live. For more general class pre-initialization, the limitation needs to be addressed. Please see more details in Enhance G1 Support for Open Archive Heap Region section in http://cr.openjdk.java.net/~jiangli/Leyden/Java%20Class%20Pre-resolution%20and%20Pre-initialization%20(OpenJDK).pdf, which is linked from JDK-8233887. Based on this bug report, the issues appears to be not just limited to tools requiring heap iterations. One possible solution of handling it (also called out in the linked doc): After marking live objects in the open archive heap regions (ORAC), do another iteration of the OARC objects and clear out-going references from objects that are not 'live'. This approach does not require tracking the OARC object state transition (dormant->live->dormant), and is guaranteed to clear all stale references.
12-11-2020

Disallowing dormant objects seems a good option -- this may also make it possible to support archived heap for non-G1 collectors: simply map all achive regions into the old space and fill any gaps with dummy arrays. I have implemented a prototype that stores the archived mirrors and archived static fields into an objArray, which is in turn stored in a global OopHandle: https://github.com/openjdk/jdk/compare/master...iklam:8253081-null-narrow-klass For testing purpose, you can select what gets stored into the objArray: (0) Don't store anything: $ java -Xshare:dump roots[0] = 2 words, klass = 0x800007878, obj = 0x7bf07cbc8 $ java -version java version "16-internal" 2021-03-16 Java(TM) SE Runtime Environment (slowdebug build 16-internal+0-adhoc.iklam.open) Java HotSpot(TM) 64-Bit Server VM (slowdebug build 16-internal+0-adhoc.iklam.open, mixed mode, sharing) (1) store archived mirrors (1265 roots for 1265 archived classes): $ java -Xshare:dump -XX:+UseNewCode roots[1265] = 635 words, klass = 0x800007878, obj = 0x7bf07cbc8 $ java -version (crash with -XX:+VerifyArchivedFields) (2) store archived fields (14 roots for 14 classes that each has an archived static field) $ java -Xshare:dump -XX:+UseNewCode2 roots[14] = 9 words, klass = 0x800007878, obj = 0x7bf07cbc8 $ java -version (crash with -XX:+VerifyArchivedFields) In the -XX:+UseNewCode case, it looks like G1 doesn't like to see classes whose _class_loader_data is NULL. This happens because the global objArray has a reference to a mirror of a class that has not be loaded yet. #8 <signal handler called> #9 0x00007ffff5acf7f8 in WeakHandle::is_null (this=0x8) at /jdk2/ken/open/src/hotspot/share/oops/weakHandle.hpp:56 #10 0x00007ffff5d350a8 in ClassLoaderData::holder_no_keepalive (this=0x0) at /jdk2/ken/open/src/hotspot/share/classfile/classLoaderData.cpp:616 #11 0x00007ffff5f43582 in G1FullGCMarker::follow_klass (this=0x7fff980123b0, k=0x8003a3508) at /jdk2/ken/open/src/hotspot/share/gc/g1/g1FullGCMarker.inline.hpp:168 #12 0x00007ffff5f433b3 in G1MarkAndPushClosure::do_klass (this=0x7fff980129e0, k=0x8003a3508) at /jdk2/ken/open/src/hotspot/share/gc/g1/g1FullGCOopClosures.inline.hpp:57 (gdb) frame 11 (gdb) p k $2 = (Klass *) 0x8003a3508 (gdb) p k->_name->print() Symbol: '[Ljdk/internal/math/FDBigInteger;' count 65535$3 = void (NOTE** it's an array class) (gdb) p k->_class_loader_data $4 = (ClassLoaderData *) 0x0
22-09-2020

Some more attempts on this shows that what is expected needs more work: In JDK-8179302 we added open archive regions that contain references into the regular heap; one special feature are "dormant" objects, i.e. objects not initially live but that may become live and stay that way. This functionality clashes with the expected behavior here, i.e. that live objects that become dead are handled as usual. G1 can't distinguish dormant/dead objects that become live from live objects that become dead, it's either or due to how dormant objects are implemented. For support of dormant objects they (actually all objects within the open archive regions) are treated as always live. If an actually live object becomes dead, G1 can't know that it should now really "ignore" it (see the limitations of this approach in JDK-8179302). The options I can see at this time: * separate what is now open-archive and these type of mutable archived object graphs; these still-to-be-named regions are basically handled the same as other old regions (i.e. apart from pinning them forever, mapped from file at startup). * disallow dormant objects, make them reachable from somewhere so that open archive regions can be managed similar to regular old regions. Objects only go from live -> dead, working like above. Half-baked and probably just non-working options: * exact remembered set/PRTs for these regions - then we won't need to scan these dead objects?
21-09-2020

A quick (obviously incomplete) prototype that enables full liveness analysis in open archive regions fails elsewhere, probably due to some bug/oversight. But this is one option fixing this issue.
17-09-2020

The problem looks like that parts of the module graph stored in the open archive region gets unreferenced/dead. In the case of the reproducer, an objArrayOop is affected, and still contains references to random locations that cause the crashes/assertion failures when code iterates over it. E.g. when a remembered set entry covering that dead object (and other ones that actually have valid references) is iterated over during GC. G1 archive support does not allow dead objects containing stale references in archive regions. [~iklam] thinks that the reason for such a change in an object's liveness referenced by the module graph could be some hash table (backing array) getting either full or being disposed after class loading/unloading. This could explain that the test runs passes without class unloading (with -XX:-ClassUnloading). Need to investigate more.
17-09-2020

This may be a bug in how G1 handles the archive regions. I added a patch to verify the heap before/after an archived static field is restored from CDS. It also reverts JDK-8253261. We can see that the heap contents are valid after the full module graph is restored https://github.com/iklam/jdk/commit/2559f14ca65061a88ba9ed21e9257723d5aaa7fe You can edit run.sh in bug-8253081.tar.gz and add -Xlog:cds+heap: [0.030s][info ][cds] optimized module handling: enabled ... [0.074s][info ][cds,heap] Verify heap before initializing static field(s) in java.lang.module.Configuration [0.141s][info ][cds,heap] initialize_from_archived_subgraph java.lang.module.Configuration 0x000000080017d910 [0.141s][info ][cds,heap] Verify heap after initializing static field(s) in java.lang.module.Configuration [0.162s][info ][cds,heap] Verify heap before initializing static field(s) in jdk.internal.loader.ArchivedClassLoaders [0.181s][info ][cds,heap] initialize_from_archived_subgraph jdk.internal.loader.ArchivedClassLoaders 0x00000008001cdb18 [0.181s][info ][cds,heap] Verify heap after initializing static field(s) in jdk.internal.loader.ArchivedClassLoaders [0.205s][info ][cds,heap] Verify heap before initializing static field(s) in jdk.internal.module.ArchivedBootLayer [0.228s][info ][cds,heap] initialize_from_archived_subgraph jdk.internal.module.ArchivedBootLayer 0x000000080024a318 [0.228s][info ][cds,heap] Verify heap after initializing static field(s) in jdk.internal.module.ArchivedBootLayer [0.251s][debug][module ] set_bootloader_unnamed_module(): recording unnamed module for boot loader ... The level = 10 the height = 11 Stress time: 30 seconds ... # assert(!is_null(v)) failed: narrow klass value can never be zero
16-09-2020

Reassigning to GC team for analysis.
16-09-2020

The above patch also prints out more info about the archived objects to aid debugging: $ -XX:MaxRAM=8g -Xlog:cds+heap=debug .... [2.293s][debug][cds,heap] Archived heap object 0x00000000800da6a0 ==> 0x00000000ffe7cbb0 : java.util.HashMap$Node [2.293s][debug][cds,heap] Archived heap object 0x00000000800da690 ==> 0x00000000ffe7cbd0 : java.util.Collections$UnmodifiableSet [2.294s][debug][cds,heap] Archived heap object 0x0000000080076d60 ==> 0x00000000ffe7cbe0 : java.lang.Module$ArchivedData ....
16-09-2020

Attached stand-alone reproducer: bug-8253081.tar.gz $ tar zxvf bug-8253081.tar.gz $ cd bug-8253081 see README edit JAVA_HOME in run.sh $ bash run.sh
16-09-2020

[~tschatzl] It's too complicated to backout. I created JDK-8253261 to disable full module graph temporarily.
16-09-2020

Would it be possible to consider backing out the responsible change (JDK-8244778) if there is no quick fix available? Almost all of my (local) specjbb2015 runs (simple compliant single-VM runs) crash with the change included, and I've heard from others that they get regular crashes in other perf testing (e.g. with JMH) too. Thanks, Thomas
16-09-2020

I can confirm that it's caused by JDK-8244778 Archive full module graph in CDS. I can reproduce 100% with $ jtrreg -vmoptions:'-Xlog:cds' ./vmTestbase/nsk/sysdict/vm/stress/btree/btree002/btree002.java --> FAIL But if I use a special -D property that would disable archived module graph, the test passes $ jtrreg -vmoptions:'-Xlog:cds -Djdk.module.showModuleResolution=true' ./vmTestbase/nsk/sysdict/vm/stress/btree/btree002/btree002.java --> PASS
14-09-2020

[~iklam] - These failures are showing up in Tier5 pretty reliably since the following was pushed: JDK-8244778 Archive full module graph in CDS
14-09-2020

Second failure mode: # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000000106c22c27, pid=3709, tid=26883 # # JRE version: Java(TM) SE Runtime Environment (16.0+16) (fastdebug build 16-ea+16-662) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 16-ea+16-662, mixed mode, sharing, tiered, compressed oops, g1 gc, bsd-amd64) # Problematic frame: # V [libjvm.dylib+0x422c27] oopDesc::size_given_klass(Klass*)+0x17 # # Core dump will be written. Default location: core.3709 # # If you would like to submit a bug report, please visit: # https://bugreport.java.com/bugreport/crash.jsp # --------------- T H R E A D --------------- Current thread (0x00007f91e4401e80): GCTaskThread "GC Thread#1" [stack: 0x0000700009a11000,0x0000700009b11000] [id=26883] Stack: [0x0000700009a11000,0x0000700009b11000], sp=0x0000700009b10a10, free space=1022k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.dylib+0x422c27] oopDesc::size_given_klass(Klass*)+0x17 V [libjvm.dylib+0x6eb347] G1ParScanThreadState::do_copy_to_survivor_space(G1HeapRegionAttr, oop, markWord)+0x127 V [libjvm.dylib+0x6edeef] void G1ParScanThreadState::do_oop_evac<unsigned int>(unsigned int*)+0x18f V [libjvm.dylib+0x6e9d5e] G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x2ae V [libjvm.dylib+0x68115d] G1ParEvacuateFollowersClosure::do_void()+0x6d V [libjvm.dylib+0x686aa3] G1EvacuateRegionsBaseTask::evacuate_live_objects(G1ParScanThreadState*, unsigned int, G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases)+0x93 V [libjvm.dylib+0x686897] G1EvacuateRegionsBaseTask::work(unsigned int)+0xa7 V [libjvm.dylib+0x10eb10d] GangWorker::run_task(WorkData)+0x5d V [libjvm.dylib+0x10eb1f3] GangWorker::loop()+0x53 V [libjvm.dylib+0xfe7b77] Thread::call_run()+0x1b7 V [libjvm.dylib+0xdaa9ef] thread_native_entry(Thread*)+0x15f C [libsystem_pthread.dylib+0x3661] _pthread_body+0x154 C [libsystem_pthread.dylib+0x350d] _pthread_body+0x0 C [libsystem_pthread.dylib+0x2bf9] thread_start+0xd
14-09-2020