Bug ID: JDK-8253413 [REDO] [REDO] G1 incorrectly limiting young gen size when using the reserve can result in repeated full gcs

JDK 20
20 b12Fixed

Changeset: a3ec0bb0 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2022-08-22 08:48:52 +0000 URL: https://git.openjdk.org/jdk/commit/a3ec0bb03a5de805fc4b45477925aa18b875bc79
22-08-2022
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9573 Date: 2022-07-20 16:14:07 +0000
20-07-2022
See JDK-8254164.
07-10-2020
It looks like scan_top for the regions that fail is typically bad (ie. NULL), so no cards are not scanned and references not updated. E.g. [5072.804s][1601907910496ms][error][gc,verify ] GC(1985) Missing rem set entry: [5072.804s][1601907910496ms][error][gc,verify ] GC(1985) Field 0x00000005030ebd80 of obj 0x00000005030ebd38 in region 518:(O)[0x0000000503000000,0x0000000503800000,0x0000000503800000] [5072.804s][1601907910496ms][error][gc,verify ] GC(1985) [Ljava.lang.Object; [5072.804s][1601907910496ms][error][gc,verify ] GC(1985) {0x00000005030ebd38} - klass: 'java/lang/Object'[] [5072.804s][1601907910496ms][error][gc,verify ] GC(1985) - length: 2048 [5072.804s][1601907910496ms][error][gc,verify ] GC(1985) points to obj 0x0000000650cfd7e8 in region 1185:(O)[0x0000000650800000,0x0000000651000000,0x0000000651000000] remset Complete \| 518\|0x0000000503000000, 0x0000000503800000, 0x0000000503800000\|100%\| O\| \|TAMS 0x0000000503000000, 0x0000000503000000\| Complete the scan_top is: [...] region-idx top scan_top 518 0x0000000503000000 0x0000000000000000 [...] Which looks wrong for a regular old gen region. Not sure how that can happen.
05-10-2020
The failing remembered sets are always (or at least most of the time) in a single old region apparently unrelated to the collection: - g1 has not allocated into that region in this collection (it's not a new region) - target can be either survivor or old region not in any collection (candidate) set - region itself is not in collection (candidate) set - it's always a single region that is broken
05-10-2020
Adding selective verification of mixed gcs only, in particular only ones with optional evacation, allows reproduction of the issue with after-gc verification failing with missing remembered set entries. Before-gc verification is fine, so something seems to be wrong with G1's optional evacuation. The remembered sets of seemingly completely unrelated (to the evacuation) regions are missing, i.e. always two Old regions not evacuated or allocated(?) in this pause. Need to investigate this further. This crash occurs with jdk-16+0, so is also in jdk 15.
02-10-2020
I have not been able to reproduce the crashes with broken oops in the handle area.
30-09-2020
copy_to_survivor_space crash example like JDK-8253295: These crashes are very rare, like out of 20 crashes, 1 is in copy_to_survivor, others are crashes like above. (lldb) frame select 11 frame #11: 0x000000010a5ef835 libjvm.dylib`G1ParScanThreadState::do_copy_to_survivor_space(this=0x00007fe896510ef0, region_attr=<unavailable>, old=0x0000000587444578, old_mark=<unavailable>) at g1ParScanThreadState.cpp:432 [opt] From the hs_err we can see: RSI=0x0000000587444578 is pointing into object: [B {0x0000000587443278} - klass: {type array byte} - length: 8192 which is in eden: \| 782\|0x0000000587000000, 0x0000000587800000, 0x0000000587800000\|100%\| E\|CS\|TAMS 0x0000000587000000, 0x0000000587800000\| Complete (lldb) print *old (oopDesc) $0 = { _mark = (_value = 0) _metadata = { _klass = 0x0000000000000000 _compressed_klass = 0 } } Now looking at where this "old" pointer came from: (lldb) frame select 12 frame #12: 0x000000010a5f08ec libjvm.dylib`void G1ParScanThreadState::do_oop_evac<unsigned int>(this=0x00007fe896510ef0, p=0x0000000590f2e7ec) at g1ParScanThreadState.cpp:200 [opt] p, i.e. 0x0000000590f2e7ec, is in an old region \| 801\|0x0000000590800000, 0x0000000591000000, 0x0000000591000000\|100%\| O\| \|TAMS 0x0000000590800000, 0x0000000591000000\| Complete Looking at the heap dump contents, the next mark word close to 0x0000000590f2e7ec is 0x590f2e790: 0x00000000 0x00000000 0x000000fc 0x00000000 0x590f2e7a0: 0x000000fc 0x00000000 0xa37006f6 0xc74d4137 0x590f2e7b0: 0x00000000 0x00000000 0x16938c2b 0x00000005 0x590f2e7c0: 0x0005f510 0xb21e5cfa 0xa82ab4f0 0x00007fe8 0x590f2e7d0: >0x1692a25b 0x00000005< 0x0005a9c1 0xb21e5cef 0x590f2e7e0: 0x811fac74 0x00000000 0x00000000 0xb0e888af <—— broken 0x590f2e7f0: 0xb21d617c 0x817c6718 0xb21e5cf7 0x00000000 0x590f2e800: 0x00000001 0x00000000 0x00000890 0x00000001 at 0x590f2e7d0 which is a jdk/internal/ref/CleanerImpl$PhantomCleanableRef In detail: 0x590f2e7d0: >0x1692a25b 0x00000005< mark 0x0005a9c1 compressed klass 0xb21e5cef jdk/internal/ref/CleanerImpl$PhantomCleanableRef 0x590f2e7e0: 0x811fac74 java/lang/ref/ReferenceQueue 0x00000000 0x00000000 0xb0e888af <—— this is our broken 0x587444578 oop reference 0x590f2e7f0: 0xb21d617c jdk/internal/ref/CleanerImpl$PhantomCleanableRef 0x817c6718 jdk/internal/ref/CleanerImpl$PhantomCleanableRef 0xb21e5cf7 $PhantomCleanableRef::action 0x00000000 padding? 0x590f2e800: 0x00000001 0x00000000 0x00000890 0x00000001 Can't find a reason why this reference in the middle of the object (and the only reference at that, surrounded by good references and good objects) not at a particular position would be bad except for a naked oop somewhere. There is no reason why any GC would exclude that kind of reference from processing. Note that the bad oop in jdk/internal/ref/CleanerImpl$PhantomCleanableRef seems to be a coincidence: other bad oop locations seem to be in random (large) object arrays and others.
30-09-2020
On that crash the markWord seems to be just random: #8 0x00007f7e98028618 in markWord::age (this=<optimized out>) at .../open/src/hotspot/share/utilities/globalDefinitions.hpp:942 #9 G1ParScanThreadState::next_region_attr (age=<synthetic pointer>: <optimized out>, m=..., region_attr=..., this=0x7f7db004f260) at .../open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:342 #10 G1ParScanThreadState::do_copy_to_survivor_space (old_mark=..., old=0x6cc1b4500, region_attr=..., this=<optimized out>) at .../open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:415 #11 G1ParScanThreadState::do_oop_evac<unsigned int> (p=0x4f2f803c0, this=<optimized out>) at ... /open/src/hotspot/share/gc/g1/g1ParScanThreadState.cpp:195 [...] old (0x6cc1b4500) is in Eden \|1378\|0x00000006cc000000, 0x00000006cc800000, 0x00000006cc800000\|100%\| E\|CS\|TAMS 0x00000006cc000000, 0x00000006cc800000\| Complete p (0x4f2f803c0) is in Old \| 431\|0x00000004f2800000, 0x00000004f3000000, 0x00000004f3000000\|100%\| O\| \|TAMS 0x00000004f3000000, 0x00000004f3000000\| Untracked 0x4f2f80390: 0x00000001 0x00000000 0x00022426 0x00000000 0x4f2f803a0: 0x9e5f00791 0x000001f3 0x000001f3 0x00000300 0x4f2f803b0: 0x3f400000 0x9e5f3b012 0x00000000 0x00000000 0x4f2f803c0: 0xd98368a0<-- 0x9e5fd2f93 0x00000001 0x00000000 0x4f2f803d0: 0x0017c3c5 0x9e5f0072 0x00000001 0x00000000 This seems to be a java/util/LinkedHashMap starting from 0x4f2f80390; "<--" marks the broken oop. All other references seem to be good, i.e. the 1,2,3 pointers are java/util/LinkedHashMap$LinkedValues, java/util/LinkedHashMap$Entry and [Ljava/util/HashMap$Node; respectively. The 0x00...1 at 0x4f2f803c8 starts another object (a "java/util/LinkedHashMap$LinkedValues"). Again, a random broken oop within an otherwise seemingly valid object at no special location (e.g. no card boundary).
30-09-2020
On a different copy_to_survivor crash (jhsdb errors with "failed to workaround classsharing" on the one above) on Linux jhsdb jstack shows the following java stack trace: "Finalizer" #3 daemon prio=8 tid=0x00007f7e9017cf30 nid=0x729 waiting for monitor entry [0x00007f7e54627000] java.lang.Thread.State: BLOCKED (on object monitor) JavaThread state: _thread_blocked - jdk.nio.zipfs.ZipFileSystem.close() @bci=146, line=484 (Compiled frame) Error occurred during stack walking: sun.jvm.hotspot.debugger.UnmappedAddressException: 2035332034332030 at jdk.hotspot.agent/sun.jvm.hotspot.debugger.PageCache.checkPage(PageCache.java:208) at jdk.hotspot.agent/sun.jvm.hotspot.debugger.PageCache.getData(PageCache.java:63) [...]
30-09-2020
JDK-8248438 crash example: [...] frame #15: 0x0000000107298954 libjvm.dylib`AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<594036ull, G1BarrierSet>, (AccessInterna frame #16: 0x0000000107947e36 libjvm.dylib`ObjectSynchronizer::quick_enter(obj=<unavailable>, self=0x00007fb7e182ec10, lock=0x0000700000b829c0) at frame #17: 0x00000001078c2510 libjvm.dylib`SharedRuntime::monitor_enter_helper(obj=0x0000000618a7f478, lock=0x0000700000b829c0, thread=0x00007fb7e1 frame #18: 0x00000001078c262e libjvm.dylib`SharedRuntime::complete_monitor_locking_C(obj=<unavailable>, lock=<unavailable>, thread=0x00007fb7e182ec [...] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) v ~RuntimeStub::_complete_monitor_locking_Java J 24365 c2 sun.nio.ch.FileChannelImpl.implCloseChannel()V java.base@16-internal (169 bytes) @ 0x000000011c432988 [0x000000011c432240+0x0000000000000748] J 24362 c2 jdk.nio.zipfs.ZipFileSystem.close()V jdk.zipfs@16-internal (390 bytes) @ 0x000000011c434090 [0x000000011c433a80+0x0000000000000610] J 70311 c2 jdk.nio.zipfs.ZipFileSystem.finalize()V jdk.zipfs@16-internal (5 bytes) @ 0x000000011c598724 [0x000000011c598700+0x0000000000000024] J 61911 c2 java.lang.ref.Finalizer.runFinalizer(Ljdk/internal/access/JavaLangAccess;)V java.base@16-internal (117 bytes) @ 0x000000011bc61a5c [0x000000011bc614c0+0x000000000000059c] J 13066% c2 java.lang.ref.Finalizer$FinalizerThread.run()V java.base@16-internal (56 bytes) @ 0x000000011bf82b30 [0x000000011bf82a40+0x00000000000000f0] The Java stack is always very similar, somewhere from jdk.nio.zipfs.ZipFileSystem.close(); the Java stack frame between this one and v ~RuntimeStub::_complete_monitor_locking_Java typically differs. I.e. is - nothing - J 24365 c2 sun.nio.ch.FileChannelImpl.implCloseChannel()V java.base@16-internal (169 bytes) @ 0x000000011c432988 [0x000000011c432240+0x0000000000000748] (the one in the example above) - J 17121 c2 java.nio.channels.spi.AbstractInterruptibleChannel.close()V java.base@16-internal (37 bytes) @ 0x00007f96316763f4 [0x00007f9631675b40+0x00000000000008b4] Looking at the monitor, that is attempted to be locked, it is garbage. Sadly this means that the back-reference to the obj is also garbage. The referenced object obj 0x0000000618a7f478 is into old gen: \|1073\|0x0000000618800000, 0x0000000619000000, 0x0000000619000000\|100%\| O\| \|TAMS 0x0000000618800000, 0x0000000618800000\| Complete It looks like a valid object (i.e. the klass seems good) apart from the mark word which is apparently broken. It is a java.lang.Object. It is referenced by a sun.nio.ch.FileChannelImpl instance.
30-09-2020
There are basically two kinds of crashes: - in copy_to_survivor_space like JDK-8253295 - or some crash when trying to lock a monitor like JDK-8248438 Most (99%) of the crashes are in monitor locking, always with crashes in ObjectSynchronizer::enter trying to access something invalid in the heap (via AccessInternal::PostRuntimeDispatch) like the following: Stack: [0x00007fa124a1d000,0x00007fa124b1e000], sp=0x00007fa124b1c7d0, free space=1021k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xb208ab] ObjectMonitor::enter(Thread)+0x26b V [libjvm.so+0xc98c6c] ObjectSynchronizer::enter(Handle, BasicLock, Thread)+0xcc V [libjvm.so+0xbf96f0] SharedRuntime::complete_monitor_locking_C(oopDesc, BasicLock, JavaThread)+0x60
30-09-2020
Some testing showed that potentially there is some problem with optional evacuation as initial results with a run without optional evac passes. Will update as that testing completes.
25-09-2020

Blocks :	JDK-8253877 - gc/g1/TestGCLogMessages.java fails - missing "Evacuation failure" message
Blocks :	JDK-8253411 - [BACKOUT] [REDO] G1 incorrectly limiting young gen size when using the reserve can result in repeated full gcs
Blocks :	JDK-8254164 - G1 only removes self forwarding pointers for last collection set increment
Relates :	JDK-8253295 - bigapps fail due to bad oop in a JavaThread
Relates :	JDK-8248438 - SIGSEGV in ObjectMonitor::enter
Relates :	JDK-8297186 - G1 triggers unnecessary full GCs when heap utilization is low
Relates :	JDK-8352765 - G1CollectedHeap::expand_and_allocate() may fail to allocate even after heap expansion succeeds
Relates :	JDK-8295669 - Regression ~8% in DaCapo-pmd-large in 20-b12