Bug ID: JDK-8185133 Reference pending list root might not get marked

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 9

Priority: P1
Status: Closed
Resolution: Fixed

Submitted: 2017-07-24
Updated: 2017-10-09
Resolved: 2017-08-01

JDK 10	JDK 9
10Fixed	9 b181Fixed

We've seen the following crash in the JDK 9 nightly testing:
assert(Universe::heap()->is_in_or_null(r)) failed: bad receiver: 0xbaadbabe (-1163019586)

Java VM: Java HotSpot(TM) Server VM (fastdebug 9-internal+0-2017-07-10-212747.vkozlov.8184036, mixed mode, emulated-client, g1 gc, windows-x86)

Stack trace:
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0x8d0169]  VMError::report_and_die+0x409;;  ?report_and_die@VMError@@SAXHPBD0PADPAVThread@@PAEPAX40HI@Z+0x409
V  [jvm.dll+0x8d0667]  VMError::report_and_die+0x27;;  ?report_and_die@VMError@@SAXPAVThread@@PBDH11PAD@Z+0x27
V  [jvm.dll+0x3bb518]  report_vm_error+0x48;;  ?report_vm_error@@YAXPBDH00ZZ+0x48
V  [jvm.dll+0x455d7f]  frame::retrieve_receiver+0xaf;;  ?retrieve_receiver@frame@@QAEPAVoopDesc@@PAVRegisterMap@@@Z+0xaf
V  [jvm.dll+0x8002a2]  SharedRuntime::find_callee_info_helper+0x352;;  ?find_callee_info_helper@SharedRuntime@@CA?AVHandle@@PAVJavaThread@@AAVvframeStream@@AAW4Code@Bytecodes@@AAVCallInfo@@PAVThread@@@Z+0x352
V  [jvm.dll+0x805453]  SharedRuntime::resolve_sub_helper+0x133;;  ?resolve_sub_helper@SharedRuntime@@CA?AVmethodHandle@@PAVJavaThread@@_N1PAVThread@@@Z+0x133
V  [jvm.dll+0x804e15]  SharedRuntime::resolve_helper+0x35;;  ?resolve_helper@SharedRuntime@@SA?AVmethodHandle@@PAVJavaThread@@_N1PAVThread@@@Z+0x35
V  [jvm.dll+0x805c3e]  SharedRuntime::resolve_virtual_call_C+0xae;;  ?resolve_virtual_call_C@SharedRuntime@@SAPAEPAVJavaThread@@@Z+0xae
v  ~RuntimeStub::resolve_virtual_call
J 206 c1 java.lang.ref.Reference.processPendingReferences()V java.base@9-internal (132 bytes) @ 0x02cf2958 [0x02cf2780+0x000001d8]
j  java.lang.ref.Reference.access$000()V+0 java.base@9-internal
j  java.lang.ref.Reference$ReferenceHandler.run()V+0 java.base@9-internal

Approved for JDK 9.
01-08-2017
Fix Request Fixes a (rare) crash introduced by an earlier change in JDK 9, e.g. it's a regression. See comments from 7/28/2017 for a description of the situation where the crash occurs, and for some discussion of the fix. Webrev for the fix is here: http://cr.openjdk.java.net/~mgerdin/8185133/webrev.0 Review discussion is here: http://openjdk.5641.n7.nabble.com/RFR-9-8185133-Reference-pending-list-root-might-not-get-marked-td310763.html
31-07-2017
Label should be jdk9-fix-request Also add link to latest webrev and cofidential comment with link to RBT testing results.
31-07-2017
For the jdk9 fix-request decision, here's my understanding of the encountered scenario: (1) initial state Given SoftReference SR, WeakReference WR, ordinary object O SR => WR => O WR, and O are young WR and O are unreachable except through the chain from SR SR has not expired (2) initial_mark SR is not discovered, because it has not expired. SR was young, and is promoted to oldgen (or alternatively, I think, was already in oldgen) WR was young, and is promoted to oldgen. WR is discovered and enqueued, because O is unreachable. WR happens to end up at the head of the pending list. (3) SR expires We now have an oldgen WR in the pending list, and no certain path by which concurrent marking will reach it, even though it is accessible. (The Java reference processing thread might process and discard it before any damage is actually done, but that's far from certain.) So it requires a fairly unlikely sequence of events. A direct SR => WR reference seems somewhat artificial, and unlikely in real applications. But an intervening sequence of oldgen objects (either prior to or also promoted by the initial-mark pause) can, I think, still reach the problem state.
28-07-2017
For the jdk9 fix-request decision, I think the proposed fix is low risk: - The change to Universe::reference_pending_list() allows it to be called in contexts where Heap_lock is held, but not by the current thread. This means the lock no longer protects it from concurrent calls to set_reference_pending_list or swap_reference_pending_list. But by design, no such concurrent calls occur. The new context for getting the list is the same as the existing swap context, but calls to those never overlap. Swapping is only used within enqueue_discovered_references (to allow parallel processing of the discovered lists), while the new call to get the pending list occurs immediately after the enqueuing is complete. The new context for getting the list is in a safepoint, while the setter is called from a helper function invoked from Java, so not in a safepoint. - The conditional marking of the pending list head is straight forward. The state is verified to be the one where the problem arises, and not some other state where the marking might be inaappropriate. The use of grayRoot() is the normal mechanism for initial marking.
28-07-2017
Testing of the fix so far is looking good, no sightings of this bug or any other previously unknown issue.
28-07-2017
It appears that the bug occurs when a weak reference WR is promoted to old and discovered during an initial mark pause. The WR is the referent of a soft reference SR. The concurrent reference processor determines that SR should be treated as a weak reference due to shortage of memory and now WR is reachable only from the reference pending list but is not explicitly marked in the bitmap. If the reference handler thread is stalled enough such that the concurrent cycle completes then WR can get garbage collected and we crash. I'm currently testing a fix where we explicitly inform concurrent marking of the reference pending list head at the end of initial mark's reference enqueue phase.
27-07-2017
So far I've been trying to instrument the reference processing and pending list code to figure out where this is going wrong, where the dead reference object is coming from. I've now seen a case where the STW reference processor of the initial mark pause promotes a WeakReference to old and discovering it, adding it to the pending list. At this point the WeakReference is not live according to the "next" marking information. When the remark pause occurs the WeakReference is still not live according to "next" marking and at the end of remark we are caught by heap verification. The only object referring to the WeakReference at that point is an "active" (but unreachable) SoftReference which has the WeakReference as its referent. Two things occur to me about this state: 1) If the SoftReference became unreachable after the initial mark pause then it seems we are missing a SATB barrier on the write which detached it from the object graph. 2) There is a comment in HeapRegion::note_start_of_copying in the during_initial_mark case that states that old objects pointed to by roots are explicitly marked. This is not true for the case of Universe::_reference_pending_list since that root is updated after root processing occurred. Maybe the pending list root needs to be explicitly marked if it points to an object promoted by initial mark itself?
26-07-2017
Roots don't need barriers; that's kind of definitional. Somehow we're getting a bad value put there. A SATB barrier wouldn't help with that anyway.
25-07-2017
I have been able to reproduce the crash both on Windows and on Linux now. With -XX:+VerifyDuringGC I've seen that the problem is likely related to the Universe::_reference_pending_list oop being dead. I suspect that we need to issue SATB pre-barriers when updating the pending list root.
25-07-2017
The top lines in the stack printout is the result of the "pusha" in the resolve blob's register saver. %edx = 0x06474a68 is the Reference oop which is pointing to 0xbaadbabe. %edx is not present in the oop map for the resolution call but it may be that it no longer needs to be alive after the queue has been loaded from the ref, not sure.
24-07-2017
So far: The Reference Handher thread is running in a C1 compiled java.lang.ref.Reference::processPendingReferences It looks like it's executing the line if (q != ReferenceQueue.NULL) q.enqueue(ref); And is calling into resolve_virtual_call to resolve the enqueue method. For some reason the ref.queue is 0xbaadbabe and we hit the assert. The Reference instance appears to be pointing into memory containing 0xbaadbabe so when the queue was loaded from the reference it was already broken. I'm trying to reproduce the failure by rerunning the test in the same configuration (but on linux-x86)
24-07-2017
Priority set to P1 during initial investigation
24-07-2017