JDK-8293648 : Kitchensink.java failed with SIGSEGV in compiled frame with ZGC
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 20
  • Priority: P2
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: x86_64
  • Submitted: 2022-09-12
  • Updated: 2022-10-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20Unresolved
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following test failed in the JDK20 CI:

applications/kitchensink/Kitchensink.java

Here's a snippet from the log file:

[2022-09-12T00:42:08.014733969Z] Waiting for completion for process 27291
[2022-09-12T00:42:08.014879193Z] Waiting for completion finished for process 27291
Output and diagnostic info for process 27291 was saved into 'pid-27291-output.log'
[stress.process.out] #
[stress.process.out] # A fatal error has been detected by the Java Runtime Environment:
[stress.process.out] #
[stress.process.out] #  SIGSEGV (0xb) at pc=0x00007ffae9aff6d2, pid=20136, tid=29472
[stress.process.out] #
[stress.process.out] # JRE version: Java(TM) SE Runtime Environment (20.0+15) (build 20-ea+15-982)
[stress.process.out] # Java VM: Java HotSpot(TM) 64-Bit Server VM (20-ea+15-982, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-amd64)
[stress.process.out] # Problematic frame:
[stress.process.out] # J 71946 c2 java.net.URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V java.base@20-ea (543 bytes) @ 0x00007ffae9aff6d2 [0x00007ffae9afb920+0x0000000000003db2]
[stress.process.out] #
[stress.process.out] # Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8517/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/2b146f44-80bc-455b-a4d2-ff92e9196cc0/runs/3e63afe4-a31a-4f2b-891d-56aab567a64e/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/core.20136)
[stress.process.out] #
[stress.process.out] # JFR recording file will be written. Location: /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8517/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/2b146f44-80bc-455b-a4d2-ff92e9196cc0/runs/3e63afe4-a31a-4f2b-891d-56aab567a64e/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/hs_err_pid20136.jfr
[stress.process.out] #
[stress.process.out] Unsupported internal testing APIs have been used.
[stress.process.out] 
[stress.process.out] # An error report file with more information is saved as:
[stress.process.out] # /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S8517/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/2b146f44-80bc-455b-a4d2-ff92e9196cc0/runs/3e63afe4-a31a-4f2b-891d-56aab567a64e/testoutput/test-support/jtreg_closed_test_hotspot_jtreg_applications_kitchensink_Kitchensink_java/scratch/0/hs_err_pid20136.log
[stress.process.out] [481.935s][warning][os] Loading hsdis library failed
[stress.process.out] #
[stress.process.out] # If you would like to submit a bug report, please visit:
[stress.process.out] #   https://bugreport.java.com/bugreport/crash.jsp
[stress.process.out] #
[2022-09-12T00:43:08.019639314Z] Gathering output for process 29816
[2022-09-12T00:43:08.025734419Z] Waiting for completion for process 29816
[2022-09-12T00:43:08.025829980Z] Waiting for completion finished for process 29816
Output and diagnostic info for process 29816 was saved into 'pid-29816-output.log'

Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x00007ff8100225c0):  JavaThread "Program Runner for serial" [_thread_in_Java, id=29472, stack(0x00007ff85d452000,0x00007ff85d553000)]

Stack: [0x00007ff85d452000,0x00007ff85d553000],  sp=0x00007ff85d550628,  free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 71946 c2 java.net.URL.<init>(Ljava/net/URL;Ljava/lang/String;Ljava/net/URLStreamHandler;)V java.base@20-ea (543 bytes) @ 0x00007ffae9aff6d2 [0x00007ffae9afb920+0x0000000000003db2]
C  0x00007ff85d5507d8


siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000081
Comments
I filed JDK-8294538 for the missing is_unloading() check in SharedRuntime::fixup_callers_callsite().
28-09-2022

Let's do a separate bug for that issue. We can always close this one later as a duplicate if we are confident that it is the same issue.
22-09-2022

Or maybe we should use this bug for that. [~dlong] what do you think?
22-09-2022

Should I file a bug for the lack of is_unloading() check in SharedRuntime::fixup_callers_callsite?
20-09-2022

Ah, I see. Thanks for the context. I also noticed that SharedRuntime::fixup_callers_callsite() patches callsites if the c2i adapter's Method's code is is_in_use(). It seems to miss an "&& !is_unloading()" in there. In practice that has not made a big difference until I removed the sweeper. Because nmethods would typically be is_unloading, because the Method is also unloading, in which case the c2i adapter entry barrier will take care of it. But now that an nmethod can become is_unloading also because it's "cold" or such, its Method is way more likely to be is_alive. And then we can get past the c2i adapter entry barrier, and end up calling fixup_callers_callsite, and miss the is_unloading check on the code, and break unlinking monotonicity of the inline cache cleaning. That could have pretty disasterous consequences when using ZGC.
19-09-2022

0x2000 is the value in the RSI register. The caller thinks it is making a static call to ByteBuffer.allocate(size), with size = 0x2000, but the target nmethod is not static so 0x2000 is treated as the "this" object (receiver).
19-09-2022

Wow, it seems like the c2i adapter would call SharedRuntime::fixup_callers_callsite to patch the inline cache to call the code of the garbage Method. That could mess up an inline cache after it has been cleaned by the GC, to point into an is_unloading nmethod, meaning that unlinking isn't monotonic. I guess many interesting things can happen if we call through such a static call stub from C2, that is being concurrently but non-atomically cleared by concurrent code cache unloading code.
19-09-2022

Where do you see 0x2000? Anyway, I just spoke to Tobias about another thing and randomly remembered that static call stubs seem to be carefully aligned for C1 but not for C2. Filed a bug for that here: https://bugs.openjdk.org/browse/JDK-8294002 An interesting observation is that if C2 uses CHA to emit an optimized virtual call inline cache, and its static call stub is not appropriately aligned, then you can end up with an interesting problem with a GC with concurrent class unloading, such as ZGC, which is what seems to crash. Let's say the target method is unloading, and we wake up from the marking termination pause, but have not yet started concurrent code cache unloading. A new class is loaded that inherits from the abstract class we did CHA on. When trying to find CHA violations, the class hierarchy iterator only exposes the live classes, and hence no deopt is triggered for the C2 nmethod. This is typically fine. Racing calls in the C2 nmethod catch the staleness of the static call stub of the callsite with the BarrierSetAssembler::c2i_entry_barrier. It checks if rbx is NULL or an !is_alive method, and takes a slow path. When the GC starts unlinking nmethods, it will clear the Method* in the instruction stream that sets rbx in the static call stub. However, for this to work as expected, the non-zero parts of the Method* being concurrently cleared, need to not cross a cache line. And since it seems like we only align C1 static call stub call sites, and not C2 ones, a racing mutator calling through this static call stub, could observe out of thin air Method pointers, which the BarrierSetAssembler::c2i_entry_barrier may or may not send to the slow path. After that point, we have called into a dead method, and I suppose a lot of strange things could happen at that point. It strikes me that in the second crash. the caller frame that called the wrong method is C2 compiled, and we are running with ZGC. Not sure this is what is actually happening, but it beats me that theoretically, this is a way in which we could call the wrong method from C2 frames.
19-09-2022

[~eosterlund], do you have any theories on these crashes?
17-09-2022

For crash #2, we have this strange situation: J 8167823 c1 jdk.internal.org.objectweb.asm.ClassReader.readLong(I)J java.base@20-ea (29 bytes) @ 0x00007f08d505a01c [0x00007f08d5059fa0+0x000000000000007c] J 86342 c2 java.io.OutputStreamWriter.write([CII)V java.base@20-ea (11 bytes) @ 0x00007f08dce6b0b8 [0x00007f08dce6a900+0x00000000000007b8] But OutputStreamWriter.write() does NOT call ClassReader.readLong(). That call site is doing this: 0x00007f08dce6b0b0: ; {static_call} 0x00007f08dce6b0b0: 6666 90e8 0x00007f08dce6b0b4: ; ImmutableOopMap {[0]=Oop [8]=Oop [24]=Oop [32]=Oop [40]=Oop [48]=Oop } ;*invokestatic allocate {reexecute=0 rethrow=0 return_oop=1} ; - sun.nio.cs.StreamEncoder::growByteBufferIfNeeded@52 (line 402) ; - sun.nio.cs.StreamEncoder::implWrite@18 (line 371) ; - sun.nio.cs.StreamEncoder::implWrite@11 (line 361) ; - sun.nio.cs.StreamEncoder::lockedWrite@49 (line 162) ; - sun.nio.cs.StreamEncoder::write@30 (line 143) ; - java.io.OutputStreamWriter::write@7 (line 220) So we appear to have a call site calling the wrong method. This explains where the 0x2000 value came from.
17-09-2022

For crash #1, the slot at SP+0 is a return address for a call site in the nmethod for jdk.internal.loader.BuiltinClassLoader::defineClass, which is calling directly into the "scopes pcs" section of the nmethod for java.net.URL::<init>. Somehow the call site seems to have a stale target.
17-09-2022

Here's the crashing thread's stack for the jdk-20+15-1005-tier8 sighting: applications/kitchensink/Kitchensink24HStress.java --------------- T H R E A D --------------- Current thread (0x00007f060815c4f0): JavaThread "Program Runner for xml.transform" [_thread_in_Java, id=2181635, stack(0x00007f06710aa000,0x00007f06711ab000)] Stack: [0x00007f06710aa000,0x00007f06711ab000], sp=0x00007f06711a8c70, free space=1019k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) J 8167748 c2 jdk.internal.org.objectweb.asm.ClassReader.readInt(I)I java.base@20-ea (52 bytes) @ 0x00007f08dd163fbd [0x00007f08dd163f80+0x000000000000003d] J 8167823 c1 jdk.internal.org.objectweb.asm.ClassReader.readLong(I)J java.base@20-ea (29 bytes) @ 0x00007f08d505a01c [0x00007f08d5059fa0+0x000000000000007c] J 86342 c2 java.io.OutputStreamWriter.write([CII)V java.base@20-ea (11 bytes) @ 0x00007f08dce6b0b8 [0x00007f08dce6a900+0x00000000000007b8] J 8183901 c1 nu.xom.UnsynchronizedBufferedWriter.write(Ljava/lang/String;II)V (74 bytes) @ 0x00007f08d562bafc [0x00007f08d562b9c0+0x000000000000013c] j nu.xom.UnsynchronizedBufferedWriter.write(Ljava/lang/String;)V+7 J 8183894 c1 nu.xom.UnicodeWriter.writeMarkup(Ljava/lang/String;)V (93 bytes) @ 0x00007f08d55b90d4 [0x00007f08d55b8fe0+0x00000000000000f4] J 8183895 c1 nu.xom.Serializer.writeRaw(Ljava/lang/String;)V (9 bytes) @ 0x00007f08d562b634 [0x00007f08d562b5a0+0x0000000000000094] j nu.xom.canonical.Canonicalizer$CanonicalXMLSerializer.writeEndTag(Lnu/xom/Element;)V+29 j nu.xom.canonical.Canonicalizer$CanonicalXMLSerializer.write(Lnu/xom/Element;)V+102 j nu.xom.Serializer.writeChild(Lnu/xom/Node;)V+12 j nu.xom.canonical.Canonicalizer$CanonicalXMLSerializer.write(Lnu/xom/Document;)V+46 j nu.xom.canonical.Canonicalizer$CanonicalXMLSerializer.write(Lnu/xom/Node;)V+12 j nu.xom.canonical.Canonicalizer.write(Lnu/xom/Node;)V+122 j spec.benchmarks.xml.transform.ExtOutputStream.canonicalize(Ljava/lang/String;Ljava/lang/String;)V+43 j spec.benchmarks.xml.transform.ExtOutputStream.check(Ljava/lang/String;)V+18 j spec.benchmarks.xml.transform.ExtOutputStream.checkResult(I)V+34 j spec.benchmarks.xml.transform.Main.transform(Ljavax/xml/transform/Transformer;Ljavax/xml/transform/Source;Ljava/lang/String;I)V+27 j spec.benchmarks.xml.transform.Main.executeWorkload()V+99 j spec.benchmarks.xml.transform.Main.harnessMain()V+8 j spec.benchmarks.xml.transform.Main.setupBenchmark()V+182 j java.lang.invoke.DirectMethodHandle$Holder.invokeStatic(Ljava/lang/Object;)V+9 java.base@20-ea j java.lang.invoke.LambdaForm$MH+0x000000080152d000.invoke(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+12 java.base@20-ea J 719982 c2 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; java.base@20-ea (108 bytes) @ 0x00007f08dce32b4c [0x00007f08dce30f60+0x0000000000001bec] j spec.harness.ProgramRunner.invokeBmSetupBenchmark()V+9 j spec.harness.ProgramRunner.runBenchmark()Z+646 j spec.harness.ProgramRunner.run()V+16 v ~StubRoutines::call_stub 0x00007f08dc52bcbf V [libjvm.so+0x8a4435] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x315 (javaCalls.cpp:417) V [libjvm.so+0x8a5c62] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x1d2 (javaCalls.cpp:331) V [libjvm.so+0x97c3ce] thread_entry(JavaThread*, JavaThread*)+0x8e (jvm.cpp:2876) V [libjvm.so+0x8bdb8d] JavaThread::thread_main_inner()+0xed (javaThread.cpp:700) V [libjvm.so+0xe47240] Thread::call_run()+0xc0 (thread.cpp:224) V [libjvm.so+0xc7f778] thread_native_entry(Thread*)+0xd8 (os_linux.cpp:710) siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000002020
16-09-2022

ILW = crash; intermittent; no workaround = HLH = P2
12-09-2022

The crash is in the "scopes pcs" section of the nmethod, not the code section.
12-09-2022