JDK-8295969 : compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java fails with SIGILL with ZGC
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 20
  • Priority: P2
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: aarch64
  • Submitted: 2022-10-27
  • Updated: 2022-10-30
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20Unresolved
Related Reports
Relates :  
Relates :  
Description
The following test failed in the JDK20 CI:

compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java

Here's a snippet from the log file:

#section:main
----------messages:(6/530)----------
command: main -XX:TieredCompileTaskTimeout=1000 -XX:CompileThresholdScaling=0.001 -XX:CICompilerCount=2 compiler.classUnloading.methodUnloading.TestOverloadCompileQueues
reason: User specified action: run main/othervm/timeout=300 -XX:TieredCompileTaskTimeout=1000 -XX:CompileThresholdScaling=0.001 -XX:CICompilerCount=2 compiler.classUnloading.methodUnloading.TestOverloadCompileQueues 
started: Thu Oct 27 00:47:45 UTC 2022
Mode: othervm [/othervm specified]
finished: Thu Oct 27 00:47:58 UTC 2022
elapsed time (seconds): 13.201
----------configuration:(0/0)----------
----------System.out:(19/1413)----------
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x0000fffcc4f14e00, pid=2403650, tid=2403681
#
# JRE version: Java(TM) SE Runtime Environment (20.0+21) (fastdebug build 20-ea+21-1539)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-ea+21-1539, mixed mode, sharing, tiered, compressed class ptrs, z gc, linux-aarch64)
# Problematic frame:
# C  0x0000fffcc4f14e00
#
# Core dump will be written. Default location: Core dumps may be processed with "/opt/core.sh %p" (or dumping to /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S107584/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/b50f3271-dfbb-4586-8f1e-ddfbfdfbf7d7/runs/754ba970-a091-4840-ab6a-862728d54cbf/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_compiler_all_gcs/scratch/3/core.2403650)
#
# An error report file with more information is saved as:
# /opt/mach5/mesos/work_dir/slaves/0c72054a-24ab-4dbb-944f-97f9341a1b96-S107584/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/b50f3271-dfbb-4586-8f1e-ddfbfdfbf7d7/runs/754ba970-a091-4840-ab6a-862728d54cbf/testoutput/test-support/jtreg_open_test_hotspot_jtreg_hotspot_compiler_all_gcs/scratch/3/hs_err_pid2403650.log
[2.669s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
----------System.err:(0/0)----------
----------rerun:(39/6215)*----------

Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x0000fffcdc42b160):  JavaThread "MainThread" [_thread_in_Java, id=2403681, stack(0x0000fffa8a7c0000,0x0000fffa8a9c0000)]

Stack: [0x0000fffa8a7c0000,0x0000fffa8a9c0000],  sp=0x0000fffa8a9bdd50,  free space=2039k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x0000fffcc4f14e00
j  compiler.classUnloading.methodUnloading.TestOverloadCompileQueues.main([Ljava/lang/String;)V+0
j  java.lang.invoke.LambdaForm$DMH+0x0000000801002000.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;)V+10 java.base@20-ea
j  java.lang.invoke.LambdaForm$MH+0x0000000801003400.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+33 java.base@20-ea
j  java.lang.invoke.Invokers$Holder.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+20 java.base@20-ea
j  jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+55 java.base@20-ea
j  jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 java.base@20-ea
j  java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+102 java.base@20-ea
j  com.sun.javatest.regtest.agent.MainWrapper$MainThread.run()V+134
j  java.lang.Thread.run()V+13 java.base@20-ea
v  ~StubRoutines::call_stub 0x0000fffccc3c016c
V  [libjvm.so+0xea691c]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x5ac  (javaCalls.cpp:417)
V  [libjvm.so+0xea6f40]  JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x3b0  (javaCalls.cpp:331)
V  [libjvm.so+0xea73b8]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x64  (javaCalls.cpp:193)
V  [libjvm.so+0x1023008]  thread_entry(JavaThread*, JavaThread*)+0x118  (jvm.cpp:2876)
V  [libjvm.so+0xedcb40]  JavaThread::thread_main_inner()+0x21c  (javaThread.cpp:699)
V  [libjvm.so+0x17f2a98]  Thread::call_run()+0xf8  (thread.cpp:224)
V  [libjvm.so+0x14ffbd4]  thread_native_entry(Thread*)+0x100  (os_linux.cpp:710)
C  [libpthread.so.0+0x7908]  start_thread+0x188


siginfo: si_signo: 4 (SIGILL), si_code: 1 (ILL_ILLOPC), si_addr: 0x0000fffcc4f14e00
Comments
You are right Dean. The is_not_entrant ingredient of is_cold is flaky and makes the is_unloading check racy and unstable. I thought the is_unloading check when making nmethods not_entrant would stabilize that, but it stopped stabilizing when I added more things into is_cold. And I didn’t realize I made is_unloading unstable in the process. Great job figuring this out Dean! Sorry I caused the issue. :C I suppose a CAS when setting the is_unloading_state fixes this.
29-10-2022

I have been assuming that the result of is_unloading() is stable, but that might not be the case. Consider 3 threads, where one thread Patch is calling is_unloading() before patching a compiled call site, thread Clear is calling is_unloading() to see if it the call site should be cleared, and thread Compiler has just set the nmethod state to not_entrant. Now let's say the Patch thread sees the nmethod as is_use and not unloading and decides to patch the call site. The Clear thread sees the nmethod as unloading and decides to clear the call site. The Patch thread writes the cached is_unloading state first with "not unloading", then the Clear thread overwrites it with "is unloading". The call site is first cleared by thread Clear and then patched by thread Patch. The GC unlinks the nmethod, and soon it gets released and the memory recycled. Now we have a compiled call site calling into bad memory. That memory can get overwritten with a new nmethod, and then we have a call into the middle of an nmethod. If the above scenario is indeed possible, then I think the is_cold/make_not_entrant race can be fixed with memory barriers. There may be a similar race with the has_dead_oop path of is_unloading, because C1 patching can add new nmethod oops. It might be better to fix the race in is_unloading. When is_unloading writes the cached result, it could use a compare-and-swap so that the first value written always wins.
29-10-2022

So, the above race seems harmless, because is_cold() is only used by is_unloading(), and make_not_entrant() already locks in the is_unloading() state for the current epoch/unloading cycle.
28-10-2022

I don't have a theory for how exactly this SIGILL crash happened, but I do see a possible problem with memory ordering in nmethod::is_cold() and nmethod::make_not_entrant(). As far as I can tell, neither member function uses memory barriers when reading or writing the "maybe on stack" value or the "not entrant" value. So in is_cold(), we could return true here: if (!is_maybe_on_stack() && is_not_entrant()) { // Not entrant nmethods that are not on any stack can just // be removed return true; } even though the nmethod is on the stack, because even though make_not_entrant() sets "maybe on stack" first, it doesn't use a memory barrier, so another thread could read is_maybe_on_stack() as false but is_not_entrant() as true. [~eosterlund], do you agree?
28-10-2022

Digging a little deeper, it appears that the call to StringConcatHelper::prepend was through a compiled MethodHandle intrinsic. There are two different nmethods for StringConcatHelper::prepend(). The method->code() points to the new one, while we crashed calling from the old not_entrant one.
28-10-2022

I checked and the value of JavaThread::_requires_cross_modify_fence is false. The caller nmethod state is "not_entrant" while the "is_unloading" state has been set to false in the current unloading cycle. My guess is that we skipped over fixing up the call site because the nmethod was not_entrant, but there were still active frames on the stack.
27-10-2022

In this latest crash we have a call from: Compiled method (c1) 2666 3265 2 java.lang.StringConcatHelper::prepend (22 bytes) to: Compiled method (c1) 2669 2560 2 java.lang.invoke.MethodHandleNatives::refKindIsGetter (32 bytes) However, rather than calling into the code at the entry point, we are calling into the middle of the nmethod's code. The instruction at that location is valid in the core dump, but since we got a SIGILL, it's plausable that the thread saw something else. This sounds like it could be caused by a missing ISB barrier, so it's likely this is a duplicate of JDK-8293328.
27-10-2022

[~dlong] Since you analyzed a similar failure recently (JDK-8293328), could you have a look at this one?
27-10-2022

ILW = crash, regression; intermittent so far; no workaround (same as JDK-8293328) = HLH = P2
27-10-2022

This weird crash with TestOverloadCompileQueues.java made me think of this bug: JDK-8293328 compiler/classUnloading/methodUnloading/TestOverloadCompileQueues.java fails with SIGSEGV in JVM_handle_linux_signal with ZGC
27-10-2022