JDK-8291725 : Leftover marks when VM shutdown aborts bitmap clearing make mixed gc fail
  • Type: Bug
  • Component: hotspot
  • Sub-Component: gc
  • Affected Version: 20
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2022-08-02
  • Updated: 2022-09-12
  • Resolved: 2022-09-06
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20 b14Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
While trying to reproduce JDK-8288970, I hit the following assert when executing runtime/cds/appcds/dynamicArchive/LotsUnloadTest3.java (attached to JDK-8288970):

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/workspace/open/src/hotspot/share/gc/g1/g1RemSet.cpp:1270), pid=310717, tid=310802
#  assert(bitmap->get_next_marked_addr(hr->bottom(), hr->end()) == hr->end()) failed: Bitmap should have no mark for region 2
#
# JRE version: Java(TM) SE Runtime Environment (20.0+9) (fastdebug build 20-ea+9-487)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-ea+9-487, compiled mode, tiered, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xe598e2]  G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(HeapRegion*)+0xa2

Stack: [0x00007fae651ec000,0x00007fae652ec000],  sp=0x00007fae652eab00,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xe598e2]  G1MergeHeapRootsTask::G1ClearBitmapClosure::do_heap_region(HeapRegion*)+0xa2
V  [libjvm.so+0xd87843]  G1CollectedHeap::par_iterate_regions_array(HeapRegionClosure*, HeapRegionClaimer*, unsigned int const*, unsigned long, unsigned int) const+0xe3
V  [libjvm.so+0xe71231]  G1MergeHeapRootsTask::work(unsigned int)+0x1e1
V  [libjvm.so+0x1be5a00]  WorkerThread::run()+0x80
V  [libjvm.so+0x1a8ebf0]  Thread::call_run()+0x100
V  [libjvm.so+0x174c663]  thread_native_entry(Thread*)+0x103

The test was executed with:
-Xcomp -XX:CompileCommand=quiet -XX:CompileCommand=compileonly,java.lang.invoke.MethodHandles::* -XX:CompileCommand=compileonly,java.lang.invoke.MethodType::*
Comments
Changeset: b2067e63 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2022-09-06 13:41:37 +0000 URL: https://git.openjdk.org/jdk/commit/b2067e63da116235740f5891f6218c9e1fd7b527
06-09-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9735 Date: 2022-08-03 20:36:44 +0000
04-08-2022

The issue cropped up because of an optimization for bitmap clearing during young gc added in JDK-8210708. So this is a JDK 20 issue.
03-08-2022

JDK-8274007 is basically the same issue for full gcs during VM shutdown.
03-08-2022

I think similar to full gc there needs to be special handling in G1MergeHeapRootsTask::G1ClearBitmapClosure() during shutdown.
03-08-2022

The concurrent cycle abort likely comes from the VM shutting down as the hs_err log contains the following lines that some threads already terminated: [...] 0x00007fae90058080 ConcurrentGCThread "G1 Main Marker" terminated [...] 0x00007fae900a2900 ConcurrentGCThread "G1 Refine#0" terminated 0x00007fae900a3af0 ConcurrentGCThread "G1 Service" terminated Then the problem makes sense - VM shuts down, aborts concurrent marking/clearing the bitmap. However an additional gc (in this case mixed) gets scheduled with a dirty bitmap.
03-08-2022

From the time stamps it looks a bit like that the concurrent cycle is somehow aborted by the Prepare Mixed gc...
03-08-2022

This seems to be an issue where we get the Concurrent Abort (that before-verification should be in Full GC?) after "Concurrent Cleanup for Next Mark" (i.e. bitmap clear) but before the cycle has actually been done (i.e. we reset the state). It is still fishy that there is a mark somewhere on the bitmaps after "Concurrent Cleanup for Next Mark". [15.406s][info ][gc,verify ] GC(246) Verifying During GC (Cleanup Before) 21.432ms [15.429s][info ][gc,verify ] GC(246) Verifying During GC (Cleanup After) 23.529ms [15.429s][info ][gc ] GC(246) Pause Cleanup 57M->57M(62M) 45.367ms [15.429s][info ][gc,marking] GC(246) Concurrent Cleanup for Next Mark [15.430s][info ][gc,marking] GC(246) Concurrent Cleanup for Next Mark 0.055ms [15.453s][info ][gc,verify ] GC(247) Verifying Before GC 22.825ms [15.484s][info ][gc,verify ] GC(247) Verifying After GC 28.856ms [15.484s][info ][gc ] GC(247) Pause Young (Prepare Mixed) (G1 Preventive Collection) 57M->56M(63M) 54.101ms [15.485s][info ][gc,marking] GC(246) Concurrent Mark Abort [15.485s][info ][gc ] GC(246) Concurrent Mark Cycle 174.276ms [15.510s][info ][gc,verify ] GC(248) Verifying Before GC 22.662ms
03-08-2022

(The dwarf2 debug info seems to be broken for that build, getting a few "<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>" messages in gdb So we are not in full gc, but in a mixed gc; we for some reason aborted marking just now, so the old gen regions do not have their bitmap cleared for eventual evacuation failure as in G1MergeHeapRootsTask::G1ClearBitmapClosure there is this optimization to only clear when during the "concurrent cleanup for next mark phase" (which has been aborted). Would be nice to understand why the concurrent cycle aborted in a mixed gc; note that the abort message also comes after having completed the " [15.430s][info ][gc,marking] GC(246) Concurrent Cleanup for Next Mark 0.055ms ", so the bitmap should be clear already... Concurrent cycle is pretty standard: GC(245) is the concurrent start gc GC(246) the concurrent cycle which is interrupted only by that GC(247) Prepare mixed gc. That mixed gc (GC(248)) seems to be in the merging of the initial evacuation too (if the stack trace can be trusted)
03-08-2022

Likely related to recent JDK 20 changes to the marking cycle (JDK-8210708)
03-08-2022