JDK-8028764 : dtrace/hotspot_jni/ALL/ALL001 crashes the vm on Solaris-amd64, SIGSEGV in MarkSweep::follow_stack()+0x8a
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: hs25
  • Priority: P1
  • Status: Closed
  • Resolution: Fixed
  • OS: solaris
  • CPU: generic
  • Submitted: 2013-11-21
  • Updated: 2014-02-18
  • Resolved: 2014-01-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 8 JDK 9 Other
8Fixed 9Fixed hs25Fixed
Related Reports
Duplicate :  
Description
The test crashed the vm in promotion testing of JDK8_b116 (2013-11-14).

<snip>
dtrace_out> pid_jni_return: jni_GetObjectField                            10824
dtrace_out> pid_jni_return: jni_GetStringCritical                         15960
dtrace_out> pid_jni_return: jni_ReleaseStringCritical                     15960
dtrace_out> pid_jni_return: jni_GetStringLength                           25807
dtrace_out> 
dtrace_out> 

java_out> javasoft.sqe.serial.StreamObjectClass@45d84a20
java_out> true
java_out> Test count: 323
java_out> Thread count: 323
java_out> #
java_out> # A fatal error has been detected by the Java Runtime Environment:
java_out> #
java_out> #  SIGSEGV (0xb) at pc=0xfffffd7ffe7c930a, pid=20640, tid=27
java_out> #
java_out> # JRE version: Java(TM) SE Runtime Environment (8.0-b116) (build 1.8.0-ea-b116)
java_out> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b58 compiled mode solaris-amd64 compressed oops)
java_out> # Problematic frame:
java_out> # V  [libjvm.so+0xfc930a]  void MarkSweep::follow_stack()+0x8a
java_out> #
java_out> # Core dump written. Default location: /bpool/local/aurora/sandbox/results/ResultDir/ALL001/core or core.20640
java_out> #
java_out> # An error report file with more information is saved as:
java_out> # /bpool/local/aurora/sandbox/results/ResultDir/ALL001/hs_err_pid20640.log
java_out> #
java_out> # If you would like to submit a bug report, please visit:
java_out> #   http://bugreport.sun.com/bugreport/crash.jsp
java_out> #
</snip>

The test failed on all other platforms due to JDK-6524097

Priority justification:
ILW = HMM => P2

Link to failure: http://vmsqe-app.russia.sun.com/surl/5S

Link to test history: http://vmsqe-app.russia.sun.com/surl/5T

Matching rule:
RULE dtrace/hotspot_jni/ALL/ALL001 Crash SIGSEGV

Comments
$ java -server -showversion -Xcomp -XX:+VerifyOops -XX:+UseSerialGC -XX:+ScavengeALot -Xmn10m -cp $(JCK_6b)/classes/ -Djava.library.path=$(JCK_6b)/lib/SunOS.amd64 -XX:+UseCompressedOops -XX:TieredStopAtLevel=1 javasoft.sqe.tests.vm.jni.call001.call00101m21.call00101m21 -platform.nativeCodeSupported true java version "1.8.0-fastdebug" Java(TM) SE Runtime Environment (build 1.8.0-fastdebug-b126) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b67-fastdebug, compiled mode) +VerifyOops count: 553921 $ java -server -showversion -Xcomp -XX:+VerifyOops -XX:+UseSerialGC -XX:+ScavengeALot -Xmn10m -cp $(pwd)/classes/ -Djava.library.path=$(pwd)/lib/SunOS.amd64 -XX:+UseCompressedOops -XX:TieredStopAtLevel=1 javasoft.sqe.tests.vm.jni.call001.call00101m21.call00101m21 -platform.nativeCodeSupported true; echo $? java version "1.8.0-ea-fastdebug" Java(TM) SE Runtime Environment (build 1.8.0-ea-fastdebug-b125) Java HotSpot(TM) 64-Bit Server VM (build 25.0-b66-fastdebug, compiled mode) # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/genOopClosures.inline.hpp:114 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/tmp/workspace/8-2-build-solaris-amd64/jdk8/1590/hotspot/src/share/vm/memory/genOopClosures.inline.hpp:114), pid=15159, tid=3 # assert(!_g->to()->is_in_reserved(obj)) failed: Scanning field twice?
30-01-2014

regarding last of test case: require GC to trigger at a very specific point during method execution, a point that is hit only en first method invocation. A reliable test case is hard to implement.
14-01-2014

Release team: Approved for fixing
07-01-2014

ILW=HMH=P1 Impact: Unavoidable crash Likelihood: Not a 100% reproducible case but it occurs frequently Workaround: None Justification: The cause for this bug is that we can hit a safepoint in C1's PatchingStub when we are emitting code for a "putstatic" If we are running with UseCompressedOops enabled we temporarily store the oop we are going to store in the static field in rscratch1 to compress it before we enter the patching stub. If we get a GC when executing in the patching stub the object may have been moved and the compressed oop will point to the wrong location. Risk assessment: Low risk to add a check for the patching code point and add a scratch register as a narrow oop. Not fixing this means users will see JVM crashes in common configuration runs. Testing done: jtreg: java/lang, java/util, hotspot/compiler, hotspot/gc, hotspot/runtime ute: nsk.stress, vm.compiler, vm.regression, nsk.regression, nsk.monitoring All of that with -Xcomp -XX:TieredStopAtLevel=1 -XX:+ScavengeALot
06-01-2014

Attached an example with incorrect oopmaps (%r10d should be marked as a narrowoop across the patch call) Attached a crude fix.
19-12-2013

The cause for this bug is that we can hit a safepoint in C1's PatchingStub when we are emitting code for a "putstatic" If we are running with UseCompressedOops enabled we temporarily store the oop we are going to store in the static field in rscratch1 to compress it before we enter the patching stub. If we get a GC when executing in the patching stub the object may have been moved and the compressed oop will point to the wrong location. We should either make sure to add rscratch1 to the oop map (as a narrow oop) or we should delay the encoding of the oop until we are finished with the patching.
19-12-2013

It reproduces on 7u45 with -XX:+TieredCompilation on Linux as well.
19-12-2013

This reproduces with 7u45 if I enable +TieredCompilation
19-12-2013

This crash also occurs with +ScavengeALot, at an early point in the test. Someone has written an incorrect compressed pointer to an object in sun.security.provider.PolicyParser.parseGrantEntry, the field e.codeBase is broken. There has not been any allocations since the previous GC and the object which should be pointed to by the broken reference has been scavenged 7 times.
18-12-2013

We almost always assert in VerifyBeforeGC, and in the case when we assert in VerifyAfterGC, then we should have asserted in VerifyBeforeGC (see my comment above).
18-12-2013

So, we've managed to reproduce the crash with the following GC combinations: - ParNew + CMS - ParNew + Serial - DefNew + Serial We have reproduced on the following operating systems: - Linux OEL x86-64 - Solaris x86-64
18-12-2013

The crash with DefNew and Serial failed in VerifyAfterGC but _should_ have failed in VerifyBeforeGC. When looking at the object with the forward pointer in it, we can see that the field was wrong for the object in from space as well. However, the assertions in VerifyBeforeGC are too weak, so even though the field points into the middle of a char array, it passes all the checks.
18-12-2013

This issue has now reproduced on a similar machine as well. I've been trying different instrumented builds to try to catch the issue earlier but without luck so far.
17-12-2013

One common pattern seen in the crashing runs is that we've performed more than 1 young collection before the test starts calling System.gc() repeatedly. I'm trying to rerun the test with a smaller young gen size to see if multiple young collections are somehow related to the problem.
13-12-2013

The second crash has the following situation: Object1 : a call00101m21 Object2 : the java.lang.Class instance for call00101m21_2 Object1 and Object2 have been promoted into the CMS generation as part of the full gc we are running, the original versions of Object1 and Object2 in from-space both have forwarding pointers installed which point to their respective locations in the CMS gen. Object2 has a field (which is in fact a static field on the class call00101m21_2) which points to Object1. This field, however, points to the location of Object2 in from-space instead of its new location in the CMS gen. There is one more reference to Object2, from the stack of the function call00101m21.testChecks. This reference has been successfully updated to point to the new location. The other static fields in the java.lang.Class instance seem to be correct, the field immediately prior to the broken one has been successfully updated by the compaction.
12-12-2013

I'm trying to reproduce this with a reduced set of VM options, just using ParNew+Serial since CMS had not actually done any work when we crashed. So far this crash has only been observed on the host with the original crash, but I have not spent a large effort on reproducing it on another host. Since this bug is reproducible and there is no known work-around I suggest a new ILW for this bug: I=H (VM crash) L=M (reproduces, but not every run) W=H (no work around known)
12-12-2013

I've been able to reproduce this crash overnight. My theory was that this may not be related to dtrace, so I ran the test program without dtrace overnight and caught the problem with +VerifyAfterGC. The symptom is the same as the previously analyzed crash but this time I've caught this in a live debugger just at the end of the "bad" GC.
12-12-2013

I've reverse-engineered the memory around the object reference that we crash on and it looks like we found the bad oop in a static field for the class "call00101m28_2" The mirror which contains this oop was copied into the old gen at the previous full GC (the forwarding pointer still remains at the old copy of the object in the survivor area). Somehow the field with the compressed value 0x04474e36 was not handled by the previous full gc, and therefore still points into the survivor area. 0x04474e36 points to the object just before the mirror's previous location (in the survivor area) but the klass pointer of that object is invalid: 0x16 I can, however, find the previous location for the broken object by looking for a forwarding pointer pointing to its location in the survivor area, that object has the klass pointer 0x16d0. This uncompresses to java/lang/String and is consistent with what the java sources for the test do, they assign the field with a "new String()". I cannot explain why: 1) We somehow "forgot" to move the object and update its reference accordingly. 2) The bad object has a klass* with a value of (in platform native endian): 0x16 0x0 0x0 0x0 instead of (again, native endian) 0xd0 0x16 0x0 0x0
11-12-2013

I've started looking at the core-file and is seems like we crash after decompressing an invalid compressed class pointer. We read the compressed class pointer from R14, which according to the core-file is an unallocated location on the heap. The compressed class pointer we get has the value 0x16, which is invalid (must be 8 byte aligned), decompressing this leads to storing 0xfffffd77c0000016 in RDI. This is expected to be the correct class pointer and we read its v-table to RCX and tries to make a call, and that's when we crash. Below is the registers from the hs_err and and some comments on the disassembly near the crash. Registers: RAX=0x0000000000000000, RBX=0xfffffd7ffeec0f9b, RCX=0x000001188f000000, RDX=0x000000000000017d RSP=0xfffffd7fc3813050, RBP=0xfffffd7fc3813160, RSI=0xfffffd78625a61b0, RDI=0xfffffd77c0000016 R8 =0x0000000002897480, R9 =0xfffffd7fc3813240, R10=0x0000000000000000, R11=0xfffffffffbc01d78 R12=0xfffffd7ffef072d0, R13=0xfffffd7ffef793d8, R14=0xfffffd78625a61b0, R15=0xfffffd7ffef071c0 RIP=0xfffffd7ffe7c930a, RFLAGS=0x0000000000010246 0xfffffd7ffe7c92e7: follow_stack+0x0067: cmpb $0x0000000000000000,(%rbx) # if (!UseCompressedClassPointers) 0xfffffd7ffe7c92ea: follow_stack+0x006a: jne follow_stack+0x72 [ 0xfffffd7ffe7c92f2, .+8 ] 0xfffffd7ffe7c92ec: follow_stack+0x006c: movq 0x0000000000000008(%r14),%rdi 0xfffffd7ffe7c92f0: follow_stack+0x0070: jmp follow_stack+0x81 [ 0xfffffd7ffe7c9301, .+0x11 ] 0xfffffd7ffe7c92f2: follow_stack+0x0072: movl 0x0000000000000008(%r14),%edi # edi = 0x16 (the compressed class pointer read from the trashed object) 0xfffffd7ffe7c92f6: follow_stack+0x0076: movl 0x0000000000000008(%r13),%ecx # ecx = 0 (narrow_klass shift) 0xfffffd7ffe7c92fa: follow_stack+0x007a: shlq %cl,%rdi 0xfffffd7ffe7c92fd: follow_stack+0x007d: addq 0x0000000000000000(%r13),%rdi # rdi+= 0xfffffd77c0000000 (narrow_klass base) => 0xfffffd77c0000016 0xfffffd7ffe7c9301: follow_stack+0x0081: movq (%rdi),%rcx # rcx = 0x000001188f000000 (expecting to read the v-table) 0xfffffd7ffe7c9304: follow_stack+0x0084: movq %r14,%rsi # rsi = 0xfffffd78625a61b0 (the trashed object) 0xfffffd7ffe7c9307: follow_stack+0x0087: xorq %rax,%rax 0xfffffd7ffe7c930a: follow_stack+0x008a: call *0x0000000000000108(%rcx) # SIGSEGV
26-11-2013