Bug ID: JDK-8029679 SIGSEGV in InstanceKlass::oop_follow

Other
hs25Resolved

I'm fairly certain that this is the same bug as JDK-8028764, see attached generated code printout. Also, the bug has not reproduced with my patched build overnight on my workstation, other runs have reproduced within 10 minutes. The code in incorrect-code-createTree is the C1 compiled version of createTree, which executed and then patched out because we're running with -Xcomp -XX:+TieredCompilation.
03-01-2014
Cool! Thanks for verifying it.
03-01-2014
In order for the bug to occur (if it is the same bug) the compilation of the method in question should happen before the class of the object we're storing into is loaded. Otherwise the offset is known and no patching is required. That should be very rare for this failing test. In fact I was never able to even reproduce the problem.
03-01-2014
Fortunately, enabling assembly printouts also prints the oop maps, that's what I meant by looking at the generated code. Also, since it doesn't look like c1 handles narrow oops in oopmaps whatsoever it's pretty easy to see if it compresses an oop in a register and then jumps to a patching stub that it's possibly the same bug.
03-01-2014
There will be no difference in the generated code. There will be only difference in the oop maps. The problem there is that we end up with two registers containing the same oop reference, one compressed, the other uncompressed. The compressed temporary value wasn't in the oop map. Since these two oops are referring to is the same object it will be kept alive, and VerfiyBefore/After are going to pass just fine. However the unupdated value will end up in memory and crash the VM on the next dereference.
02-01-2014
Erik is still on vacation, but I'm currently trying to reproduce with a build with my experimental patch for JDK-8028764. So far it hasn't reproduced (it seems to reproduce regularly otherwise on my workstation). I can't explain why it fixes the problem though. I haven't found any obvious culprit in the generated code yet.
02-01-2014
Erik, could you please check if it's not a dup of JDK-8028764 (I still have trouble reproducing it)? There is a tentative fix attached as patch to it. The symptoms can be easily explained by missing oop map entries.
20-12-2013
Same here tonight. I was running the test in a loop on hsdev-7 overnight and no crash. Usually I get a crash on this machine after a couple of iterations. It has this weird feeling to it that when the machine is "warmed up" it doesn't happen so frequently. Do you know this feeling? ;)
18-12-2013
Could be also a coincidence. I've been running on 4 boxes (different cmd line options though) for more than 24 hours with zero failures.
18-12-2013
I'm running the test for a couple of hours now with -XX:-DoEscapeAnalysis and the bug does not happen. This is odd because there is no scalar replacing happening for the method in question (nsk.share.gc.NonbranchyTree::createTree).
18-12-2013
-XX:+VerifyRememberedSets didn't bring up anything but crashed in a similar way: # Problematic frame: # V [libjvm.so+0x23f9826] void ParCompactionManager::follow_marking_stacks()+0x912 Stack: [0xfffffd7ffbca2000,0xfffffd7ffbda2000], sp=0xfffffd7ffbda0850, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x23f9826] void ParCompactionManager::follow_marking_stacks()+0x912 V [libjvm.so+0x23aa1c7] void ThreadRootsMarkingTask::do_it(GCTaskManager*,unsigned)+0x1fb V [libjvm.so+0x1265fc1] void GCTaskThread::run()+0x4f5 V [libjvm.so+0x22bb9ba] java_start+0x1ce C [libc.so.1+0x12257d] _thrp_setup+0xa5 C [libc.so.1+0x122820] _lwp_start+0x0
17-12-2013
I could reproduce the crash with: -XX:CompileCommand=dontinline,nsk.share.gc.NonbranchyTree::createTree -XX:+VerifyOops which tells us that it's not a problem in the compiled code that does the field stores. Every encode_heap_oop does a verify_heapbase and verify_oop before the value is stored. This is the emitted field store (using zero-based compressed oops): 0x00007f266f9b2377:��shr $0x3,%r8 0x00007f266f9b237b:��mov %r8d,0xc(%r11)
17-12-2013
Added the output from: gdb> threads apply all bt The stack traces are very deep, most likely because the binary tree is created with recursive calls to createTree. I removed a lot of Java frames and added "... a lot more frames" to make it easier to read the file.
16-12-2013
Moving to compiler team since we are crashing in VerifyBeforeGC, Has the deoptimization of the NonbranchyTree.createTree method anything to do with the bug?
16-12-2013
This is most likely the same crash as reported in JDK-8030210, but we have not had time to look at JDK-8030210 yet, so we can't be 100% sure.
16-12-2013
Attached byte code for NonbranchyTree.createTree (got the byte code with hsdb).
16-12-2013
In the hs_err file, we can see the events: Events (10 events): Event: 42,359 Executing VM operation: ParallelGCFailedAllocation Event: 42,359 Thread 0x0000000001a76000 DEOPT UNPACKING pc=0x00007f266f1ac1dc sp=0x0000000043d6cb28 mode 1 Event: 43,430 Executing VM operation: ParallelGCFailedAllocation done Event: 43,430 Thread 0x0000000001a9d800 DEOPT PACKING pc=0x00007f266f9b2490 sp=0x0000000043464d00 Event: 43,430 Executing VM operation: ParallelGCFailedAllocation Event: 43,430 Thread 0x0000000001a9d800 DEOPT UNPACKING pc=0x00007f266f1ac1dc sp=0x00000000434649b8 mode 1 Event: 44,519 Executing VM operation: ParallelGCFailedAllocation done Event: 44,519 Thread 0x0000000001a82800 DEOPT PACKING pc=0x00007f266f9b2490 sp=0x0000000044272f40 Event: 44,519 Executing VM operation: ParallelGCFailedAllocation Event: 44,519 Thread 0x0000000001a82800 DEOPT UNPACKING pc=0x00007f266f1ac1dc sp=0x0000000044272bf8 mode 1 We are crashing in VerifyBeforeGC in the last ParallelGCFailedAllocation. The event before the ParallelGCFailedAllocation a DEOPT event. Using hsdb, we can find out the method that is being deoptimized: hsdb> findpc 0x00007f266f9b2490 Address 0x00007f266f9b2490: In code in NMethod for nsk/share/gc/NonbranchyTree.createTree(II)Lnsk/share/gc/Node; content: [0x00007f266f9b1d40, 0x00007f266f9b2630), code: [0x00007f266f9b1d40, 0x00007f266f9b2630), data: [0x00007f266f9b2630, 0x00007f266f9b2c28), oops: [0x00007f266f9b2630, 0x00007f266f9b2638), frame size: 80 We know that we are crashing in VerifyBeforeGC when reading a bad oop in the field "left" of a "Node" object. Looking at the source file (attached, see NonbranchyTree.java), there is only one method that is writing the "left" field of a "Node": NonbranchyTree.createTree. The field "left" is also being set by the NonbranchyTree.bend method, but the only test that uses the NonbranchyTree.bend method is gc/gctests/JumbleGC002/JumbleGC002.java, which we aren't running.
16-12-2013
Attached createTree.s which contains the compiled version of NonbranchyTree.createTree. I got hold of the compiled code using HSDB on a core file.
16-12-2013
Attached NonbranchyTree.java which contains the Java source code for the class Node and the createTree method.
16-12-2013
Attached hs_err file from crash
16-12-2013
Reproducer: while fastdebug/bin/java -cp classes/ -Xcomp -XX:+UseParallelGC -XX:-UseGCOverheadLimit gc.gctests.Steal.steal001.steal001 ; do date ; sleep 1 ; done This reproducer has been successful on machines with 32-256GB RAM, 16 cores or more. The reproducer is extremely sensitive and slightest change in the setup causes the test to run longer and/or not reproduce. Reproduced with builds: JDK8-b118 JDK8-b100 Suceeded to reproduce with: * -XX:+UseParallelGC * -XX:+UseSerialGC * -XX:+VerifyBeforeGC <= caught in the heap verification code done before the GC started, wasn't caught by the verification after the previous GC (-XX:+VerifyAfterGC). Failed to reproduce with: * -Xcomp removed * -Xint instead of -Xcomp * -XX:-TieredCompilation * -XX:TieredStopAtLevel=3 * Different heap sizes, nursery sizes, tenuring thresholds * -XX:+DeoptimizeALot * -XX:+TraceDeoptimization From hs_err/core file: The crash is always caused by a broken field in a nsk.share.gc.Node. The hs_err files report: nsk.share.gc.Node - klass: 'nsk/share/gc/Node' - ---- fields (total size 3 words): - 'left' 'Lnsk/share/gc/Node;' @12 [error occurred during error reporting (printing register info), id 0xb] The broken Node object is only pointed to from one single other Node object, at the time of the crash.
15-12-2013
I caught the bug in VerifyBeforeGC. This indicates that it might not a GC bug.
12-12-2013
I can reproduce on my local Linux machine and also with -XX:+UseSerialGC instead of the parallel GC.
10-12-2013
ILW => HMH => P1 Impact: High, crashes. Likelihood: Medium, easy to reproduce with given test. Workaround: High, unknown workaround. I've been able to reproduce the issue with both a product and fastdebug build and also been able to trigger a guarantee with VerifyBeforeGC.
09-12-2013
While reproducing once I got the following assertion # Internal Error (/HUDSON/workspace/8-2-build-linux-amd64/jdk8/872/hotspot/src/share/vm/oops/klass.inline.hpp:63), pid=31115, tid=140660658288384 # assert(!is_null(v)) failed: narrow klass value can never be zero At another time I got assertion three lines later in code: # Internal Error (/HUDSON/workspace/8-2-build-linux-amd64/jdk8/872/hotspot/src/share/vm/oops/klass.inline.hpp:66), pid=9725, tid=140353251411712 # assert(check_klass_alignment(result)) failed: address not aligned: 0x0000000800000015 Steps to reproduce: go to host slcak937.us.oracle.com, dir /export/local/repr1, run rerun.sh script. In the same folder please find core file.
09-12-2013

Duplicate :	JDK-8030210 - HS24 crashing duing GC while following stacks on hosts with a lot of CPU cores and memory
Duplicate :	JDK-8028764 - dtrace/hotspot_jni/ALL/ALL001 crashes the vm on Solaris-amd64, SIGSEGV in MarkSweep::follow_stack()+0x8a
Relates :	JDK-8153580 - Crash - InstanceKlass::oop_follow_contents(ParCompactionManager, oopDesc)
Relates :	JDK-8030210 - HS24 crashing duing GC while following stacks on hosts with a lot of CPU cores and memory