Bug ID: JDK-8130261 G1 SEGV in MarkSweep::mark_and

JDK-8130261 : G1 SEGV in MarkSweep::mark_and_push()

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 8u60,9

Priority: P2
Status: Closed
Resolution: Duplicate
OS: solaris
CPU: x86,aarch64

Submitted: 2015-07-01
Updated: 2016-03-21
Resolved: 2016-03-18

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 9
9Resolved

Related Reports

Duplicate :	JDK-8141420 - Compiler runtime entries don't hold Klass* from being GCed
Relates :	JDK-8151256 - JVM crash in CompactibleSpace::adjust_pointers(), intermittently
Relates :	JDK-8141420 - Compiler runtime entries don't hold Klass* from being GCed
Relates :	JDK-8130338 - ACCESS_VIOLATION in InstanceKlass::oop_ms_follow_contents()

Description

#  SIGSEGV (0xb) at pc=0xfffffd79d51b9202, pid=5063, tid=0x0000000000000039
#
# JRE version: Java(TM) SE Runtime Environment (9.0) (build 1.9.0-internal-fastdebug-20150630214846.iggy.8079775-and-8079062-b00)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (1.9.0-internal-fastdebug-20150630214846.iggy.8079775-and-8079062-b00 compiled mode solaris-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x19b9202]  void MarkSweep::mark_and_push<oop>(__type_0*)+0xd2

Stack: [0xfffffd7fe4eff000,0xfffffd7fe4fff000],  sp=0xfffffd7fe4ffc3c0,  free space=1012k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x19b9202]  void MarkSweep::mark_and_push<oop>(__type_0*)+0xd2;;  __1cJMarkSweepNmark_and_push4nDoop__6FpTA_v_+0xd2
V  [libjvm.so+0x19b09be]  void InstanceKlass::oop_ms_follow_contents(oop)+0x3e;;  __1cNInstanceKlassWoop_ms_follow_contents6MnDoop__v_+0x3e
V  [libjvm.so+0x19b8b64]  void MarkSweep::follow_object(oop)+0xb4;;  __1cJMarkSweepNfollow_object6FnDoop__v_+0xb4
V  [libjvm.so+0x19b9fcd]  void MarkSweep::follow_root<oop>(__type_0*)+0x24d;;  __1cJMarkSweepLfollow_root4nDoop__6FpTA_v_+0x24d
V  [libjvm.so+0x1ad2145]  void InterpreterOopMap::iterate_oop(OffsetClosure*)const+0xf5;;  __1cRInterpreterOopMapLiterate_oop6kMpnNOffsetClosure__v_+0xf5
V  [libjvm.so+0x1341307]  void frame::oops_interpreted_do(OopClosure*,CLDClosure*,const RegisterMap*,bool)+0xd37;;  __1cFframeToops_interpreted_do6MpnKOopClosure_pnKCLDClosure_pknLRegisterMap_b_v_+0xd37
V  [libjvm.so+0x1d9e54c]  void JavaThread::oops_do(OopClosure*,CLDClosure*,CodeBlobClosure*)+0x22c;;  __1cKJavaThreadHoops_do6MpnKOopClosure_pnKCLDClosure_pnPCodeBlobClosure__v_+0x22c
V  [libjvm.so+0x1da482b]  void Threads::possibly_parallel_oops_do(bool,OopClosure*,CLDClosure*,CodeBlobClosure*)+0xdb;;  __1cHThreadsZpossibly_parallel_oops_do6FbpnKOopClosure_pnKCLDClosure_pnPCodeBlobClosure__v_+0xdb
V  [libjvm.so+0x1416df2]  void G1RootProcessor::process_strong_roots(OopClosure*,CLDClosure*,CodeBlobClosure*)+0x92;;  __1cPG1RootProcessorUprocess_strong_roots6MpnKOopClosure_pnKCLDClosure_pnPCodeBlobClosure__v_+0x92
V  [libjvm.so+0x13ad8db]  void G1MarkSweep::mark_sweep_phase1(bool&,bool)+0xdb;;  __1cLG1MarkSweepRmark_sweep_phase16Frbb_v_+0xdb
V  [libjvm.so+0x13ad654]  void G1MarkSweep::invoke_at_safepoint(ReferenceProcessor*,bool)+0xe4;;  __1cLG1MarkSweepTinvoke_at_safepoint6FpnSReferenceProcessor_b_v_+0xe4
V  [libjvm.so+0x136795b]  bool G1CollectedHeap::do_collection(bool,bool,unsigned long)+0x6fb;;  __1cPG1CollectedHeapNdo_collection6MbbL_b_+0x6fb
V  [libjvm.so+0x1149c75]  void CollectedHeap::collect_as_vm_thread(GCCause::Cause)+0x105;;  __1cNCollectedHeapUcollect_as_vm_thread6MnHGCCauseFCause__v_+0x105
V  [libjvm.so+0x1e5f58a]  void VM_CollectForMetadataAllocation::doit()+0x1aa;;  __1cbFVM_CollectForMetadataAllocationEdoit6M_v_+0x1aa
V  [libjvm.so+0x1e950d2]  void VM_Operation::evaluate()+0x122;;  __1cMVM_OperationIevaluate6M_v_+0x122
V  [libjvm.so+0x1e9151b]  void VMThread::evaluate_operation(VM_Operation*)+0x20b;;  __1cIVMThreadSevaluate_operation6MpnMVM_Operation__v_+0x20b
V  [libjvm.so+0x1e921a1]  void VMThread::loop()+0x7d1;;  __1cIVMThreadEloop6M_v_+0x7d1
V  [libjvm.so+0x1e91074]  void VMThread::run()+0xb4;;  __1cIVMThreadDrun6M_v_+0xb4

Comments

I ran the failing test for almost a week with the fix for JDK-8141420 2692 iterations with no failures. It failed 3 times within 100 iterations without the fix.
21-03-2016
This is a duplicate of JDK-8141420. It has the same stack and I verified that it doesn't reproduce with the fix for that bug. Nice job [~vlivanov] narrowing that one down!
18-03-2016
[~coleenp] thanks for verifying!
18-03-2016
Well, trace_bytecode() is wrong since it pushes things on the expression stack before call_VM saves last_Java_sp. Looking for other calls that do the same or similar, newarray and anewarray may not push TOS correctly.
11-03-2016
It's fastdebug.
10-03-2016
This is with the product build, right? I can't tell from the version string which I think didn't specify for a while.
10-03-2016
8151256 appears to be the same problem, with a different test, on Linux. I just updated the attached patch file to work with current source/compilers.
07-03-2016
> The difference seems to be that with one fewer compilation, the 'last ditch' GC does not happen - > suggesting the last ditch collection can somehow corrupt an interpreter frame exp stack. Unfortunately, I've now seen the crash without the 'last ditch' gc, disproving this theory.
02-09-2015
> suggesting the last ditch collection can somehow corrupt an interpreter frame exp stack. Well.... Not directly. I can call Threads::verify, which should verify the exp stack of interpreter frames, before/after the last ditch collection, and it doesn't find any problem.
31-08-2015
Update: I'm now able to get one of the additional asserts I added to trigger every time. It's in InterpreterFrameClosure::offset_do, and shows that an expression stack entry expected to be an oop points outside the heap. It occurs with either ParallelGC or G1, and regardless of TieredCompilation, C1 or C2, and PreferInterpreterNativeStubs, which was helping to trigger it earlier, isn't necessary. It also occurs without Xcomp and I've trimmed the test and excluded a lot of methods from compilation. One key thing in reproducing it predictably is running with TraceBytecodes enabled, presumably to slow down those methods running interpreted so you're more likely to be in the interpreter when a GC occurs. Now, if I exclude 1 additional compilation in .hotspot_compiler, the assertion doesn't trigger - and it doesn't seem to need to be a specific method. The difference seems to be that with one fewer compilation, the 'last ditch' GC does not happen - suggesting the last ditch collection can somehow corrupt an interpreter frame exp stack.
31-08-2015
I've seen another flavor of crash now, when running with TieredCompilation disabled. Since the tests had run for a couple of days with no crash with that option, I was close to thinking it didn't occur with Tiered disabled. In this case, the method() is found to be NULL in vtableEntry::verify. FWIW, I have not seen a crash (yet) when running with -client and no Tiered. All 4 types of crash I've seen are probably due to an underlying metaspace corruption problem that manifests in different ways depending on mem layout or timing, which the various options affect.
03-08-2015
The failures with +PreferInterpreterNativeStubs look, unfortunately for me, like a different problem. There, the interp expression stack contains a stack address where an oop is expected. Code added to check for that in advance of the failure doesn't trigger when the original test case is run.
30-07-2015
Adding +PreferInterpreterNativeStubs seems to result in more frequent crashes, but with a slightly different signature. It crashes in G1 when a bad oop causes the biased array index to be out of range. A few other observations: I haven't been able to reproduce it with VerifyBeforeGC/AfterGC enabled - neither the crash nor a verification error occurs. I've also tried adding additional checking of things I was suspicious about, such as checking the mark stack guard at every push, or additional checking to catch errors earlier, such as at interpreter store ops and field refs. Also code to look for the pad pattern in the class field in verify_oop, rather than simply checking for non-zero. None of that has panned out. I have seen a very similar crash running with UseParallelGC, rather than G1, so I don't think it's G1 specific, but may be slightly more reproducible with G1. In one case, a crash occurred during a C1 compilation, rather than during GC, due to a similarly corrupt pointer (with the pad pattern in it).
28-07-2015
Update: It looks like a native method call (via Java_sun_reflect_NativeMethodAccessor) which is expected to return an oop is instead returning a pointer to a klass metaspace object. Perhaps it should instead be the java_mirror of the klass, or perhaps something worse than that is happening. Presumably this metaspace object is later freed, resulting in the freeBlockPad/uninitBlockPad found when an object pointer is subsequently read from this (bogus) object address.
27-07-2015
Also occurs with 1.8.0, but without the GuardedMemory uninitBlockPad in the word, since that came along later.
24-07-2015
Also reproducible with 8u40.
21-07-2015
Also reproducible with 8u51 and 8u52.
21-07-2015
> 0xf1f1f1f1nnnnnnnn (IE, the metaspace poison pattern in the upper 32 bits) Correction: The 0xf1's are coming from Guarded Memory uninitBlockPad, not Metaspace poison. It would be good to make them unique.
21-07-2015
Added tracing showed that we are not getting a bad locals/stack count for interpreter frames by using a stale method pointer, as I thought we might be. There are two patterns of failure that all the crashes fall into: A bad object pointer which is 0xf1f1f1f1nnnnnnnn (IE, the metaspace poison pattern in the upper 32 bits), or a bad object pointer which is 0xbabababababababa, which looks like freeblockpad from GuardedMemory.
17-07-2015
A few more tidbits: I hit the problem a couple of times running 8u60, when running the test with G1 explicitly enabled. The segv was instead in InstanceKlass::oop_follow_contents. In the SEGV which happened when processing interp frames, the same methods were at the top of the stack being traversed - newConstructorForSerialization, etc. I removed the +PreserveFramePointer option that the test was using, and it still occurs, and also found that disabling ClassUnloadingWithConcurrentMark didn't change things. Often, but not always, there is a 'last ditch' Full GC immediately before the Full GC which triggers the SEGV. But there are always 'last ditch' full GCs somewhere before the failure.
16-07-2015
Also, enabling metaspace_slow_verify in metaspace.cpp did not trigger any verification errors before the segv.
16-07-2015
G1 is experimental for Embedded hence it is not a showstopper fro 8u60 from the Embedded point of view
15-07-2015
After another crash, I now have 3 that are not processing interpreter frames. These all come out of: V [libjvm.so+0x19b9202] void MarkSweep::mark_and_push<oop>(__type_0)+0xd2 V [libjvm.so+0x19b09be] void InstanceKlass::oop_ms_follow_contents(oop)+0x3e V [libjvm.so+0x19b1507] void MarkSweep::follow_stack()+0x2b7 V [libjvm.so+0x19ba033] void MarkSweep::follow_root<oop>(__type_0)+0x2b3 V [libjvm.so+0x14d4e70] unsigned long chunk_oops_do(OopClosure,Chunk,char)+0x8c0 V [libjvm.so+0x14d4fef] void HandleArea::oops_do(OopClosure)+0xff V [libjvm.so+0x1d9e35f] void JavaThread::oops_do(OopClosure,CLDClosure,CodeBlobClosure)+0x3f V [libjvm.so+0x1da482b] void Threads::possibly_parallel_oops_do(bool,OopClosure,CLDClosure,CodeBlobClosure)+0xdb V [libjvm.so+0x1416df2] void G1RootProcessor::process_strong_roots(OopClosure,CLDClosure,CodeBlobClosure*)+0x92
14-07-2015
I've now had the crash re-occur 3 times on Solaris (in 5 days of running tests). Only one of the 3 log files contains the call stack being processed by GC at the time of the crash, but it again has the same few interpreter frames on top as the two crashes reported here: j sun.reflect.ReflectionFactory.newConstructorForSerialization(Ljava/lang/Class;Ljava/lang/reflect/Constructor;)Ljava/lang/reflect/Constructor;+59 j java.io.ObjectStreamClass.getSerializableConstructor(Ljava/lang/Class;)Ljava/lang/reflect/Constructor;+63 j java.io.ObjectStreamClass.access$1500(Ljava/lang/Class;)Ljava/lang/reflect/Constructor;+1 j java.io.ObjectStreamClass$2.run()Ljava/lang/Void;+176 I also notice that all three of the crashes I got had deopts of these in the last 10 (though the previous crashes don't): method=java.lang.StringCoding.deref(Ljava/lang/ThreadLocal;)Ljava/lang/Object; @ 4 method=java.lang.StringCoding.encode(Ljava/lang/String;[CII)[B @ 6 Tom
14-07-2015
Seen on Solaris hence not an Embedded specific issue. However still not clear if it is applicable for 8u60.
09-07-2015
ILW=H (Crash) M(happened on 2 Aarch64 & Solaris) H (unknown) => P1
06-07-2015