JDK-8271251 : JavaThread::java_suspend() fails with "fatal error: Illegal threadstate encountered: 6"
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17,18
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: windows
  • CPU: x86_64
  • Submitted: 2021-07-24
  • Updated: 2021-08-05
  • Resolved: 2021-07-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 18
17 b34Fixed 18Fixed
Related Reports
Blocks :  
Relates :  
Relates :  
Description
The following test failed in the JDK17 CI:

vmTestbase/nsk/jvmti/scenarios/hotswap/HS202/hs202t002/hs202t002.java

Here's a snippet from the log file:

#section:main
----------messages:(4/436)----------
command: main -agentlib:hs202t002=pathToNewByteCode=./bin,-waittime=5,package=nsk,samples=100,mode=compiled nsk.jvmti.scenarios.hotswap.HS202.hs202t002.hs202t002
reason: User specified action: run main/othervm/native -agentlib:hs202t002=pathToNewByteCode=./bin,-waittime=5,package=nsk,samples=100,mode=compiled nsk.jvmti.scenarios.hotswap.HS202.hs202t002.hs202t002 
Mode: othervm [/othervm specified]
elapsed time (seconds): 15.111
----------configuration:(0/0)----------
----------System.out:(26/1480)*----------
Agent:: VM.. Started..
 Agent :: NOTIFICATIONS ARE ENABLED 
# info :: File = ./bin/newclass00/nsk/jvmti/scenarios/hotswap/HS202/hs202t002/MyThread.class 
#  info **Agent:: opening file ./bin/newclass00/nsk/jvmti/scenarios/hotswap/HS202/hs202t002/MyThread.class 
# info file size= 732
 File red completely 
 Agent:: redefine class success ..
Agent::SUSPENDING>> 
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=\\safepoint.cpp:745
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (t:\\workspace\\open\\src\\hotspot\\share\\runtime\\safepoint.cpp:745), pid=43336, tid=12820
#  fatal error: Illegal threadstate encountered: 6
#
# JRE version: Java(TM) SE Runtime Environment (17.0+33) (fastdebug build 17-ea+33-LTS-2690)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 17-ea+33-LTS-2690, compiled mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, windows-amd64)
# Core dump will be written. Default location: T:\\testoutput\\test-support\\jtreg_open_test_hotspot_jtreg_vmTestbase_nsk_jvmti_quick\\scratch\\0\\hs_err_pid43336.mdmp
#
# An error report file with more information is saved as:
# T:\\testoutput\\test-support\\jtreg_open_test_hotspot_jtreg_vmTestbase_nsk_jvmti_quick\\scratch\\0\\hs_err_pid43336.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
#
----------System.err:(0/0)----------
----------rerun:(49/7264)*----------


Here's the crashing thread's stack:

---------------  T H R E A D  ---------------

Current thread (0x000002a2cead3490):  JavaThread "Thread-1" [_thread_in_vm, id=12820, stack(0x000000dba9700000,0x000000dba9800000)], _nested_threads_hazard_ptr_cnt=1, _nested_threads_hazard_ptr=0x000002a2cf0aa5a0

Stack: [0x000000dba9700000,0x000000dba9800000]
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0xadf271]  os::platform_print_native_stack+0xf1  (os_windows_x86.cpp:235)
V  [jvm.dll+0xd06e55]  VMError::report+0x1005  (vmError.cpp:739)
V  [jvm.dll+0xd087de]  VMError::report_and_die+0x7fe  (vmError.cpp:1549)
V  [jvm.dll+0x4c7e3e]  report_fatal+0xde  (debug.cpp:304)
V  [jvm.dll+0xb85255]  SafepointSynchronize::block+0xc5  (safepoint.cpp:745)
V  [jvm.dll+0xb88ad2]  SafepointMechanism::process+0x32  (safepointMechanism.cpp:134)
V  [jvm.dll+0x64dabd]  HandshakeState::suspend+0x9d  (handshake.cpp:693)
V  [jvm.dll+0xc8e2ec]  JavaThread::java_suspend+0x6c  (thread.cpp:1791)
V  [jvm.dll+0x8b1624]  JvmtiEnv::SuspendThread+0x34  (jvmtiEnv.cpp:946)
V  [jvm.dll+0x86d3c0]  jvmti_SuspendThread+0x160  (jvmtiEnter.cpp:536)
C  [hs202t002.dll+0x1f83]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  nsk.jvmti.scenarios.hotswap.HS202.hs202t002.MyThread.display()V+19
j  nsk.jvmti.scenarios.hotswap.HS202.hs202t002.MyThread.playWithThis()V+1
j  nsk.jvmti.scenarios.hotswap.HS202.hs202t002.MyThread.run()V+1
v  ~StubRoutines::call_stub

I *think* this failure is related to the recent integration of:

    JDK-8270085 Suspend during block transition may deadlock if lock held

So far this failure has only been since once on Windows and
it did not show up in my stress testing of JDK-8270085 prior
to integration. Since this is a (possible) regression, I'm starting
this bug at P2.
Comments
Fix in jdk-18+8-359.
30-07-2021

Fixed in CI jdk-17+33-2708
30-07-2021

Changeset: 6878b05f Author: Patricio Chilano Mateo <pchilanomate@openjdk.org> Date: 2021-07-28 16:59:21 +0000 URL: https://git.openjdk.java.net/jdk17/commit/6878b05f8fbc7bd72e79ec29a868008dde2321c6
28-07-2021

Fix request for JDK 17 is approved.
28-07-2021

Fix Request When self-suspending, a JavaThread might reach SafepointSynchronize::block() with a state of _thread_in_vm which is not listed as a valid state for safepoint polling. This happened after 8270085 where we had to explicitly add a call to process_if_requested() in HandshakeState::suspend() so that self-suspension cases would actually suspend, since TBIVM was changed to not process suspend requests to avoid possible deadlocks. Since the failure is intermittent in tier7, dependant on whether a safepoint happened at the right time, it escaped the testing of 8270085. The fix is straightforward and involves removing the call to process_if_requested() and replacing it with a direct call to the self-suspend logic. The patch only affects self-suspension cases so the risk should be minimal. The fix was tested by taking the two failing tests from tier7, ThreadSuspendSelf.java and hs202t002.java, reproducing the failure locally and making sure the tests now pass with the patch. I also run 2 full cycles of tiers1-7 just to be conservative, plus currently running another tier7 job. The fix has been reviewed by��[~dholmes] and [~dcubed]. Webrev: https://openjdk.github.io/cr/?repo=jdk17&pr=283&range=02 PR: https://github.com/openjdk/jdk17/pull/283
27-07-2021

My fastdebug bits run with the following options: -Xcomp -XX:+CreateCoredumpOnCrash -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+DeoptimizeALot -XX:+SafepointALot had a failure at run #75 that matches the original failure mode so the original failure mode is very rare on my T7600.
27-07-2021

Added a small bit of debugging code to show if we call SafepointMechanism::process() with the problematic thread state (_thread_in_vm), but don't happen to be unlucky enough to have a pending safepoint at this time: $ git diff diff --git a/src/hotspot/share/runtime/safepointMechanism.cpp b/src/hotspot/share/runtime/safepointMechanism.cpp index 20e163a7f6c..61fd6532382 100644 --- a/src/hotspot/share/runtime/safepointMechanism.cpp +++ b/src/hotspot/share/runtime/safepointMechanism.cpp @@ -118,6 +118,10 @@ void SafepointMechanism::process(JavaThread *thread, bool allow_suspend) { // local poll already checked, if used. bool need_rechecking; do { +if (UseNewCode) { +JavaThreadState state = thread->thread_state(); +guarantee(state != _thread_in_vm, "XXX - bad news if we're _thread_in_vm and we go to a safepoint here"); +} if (global_poll()) { // Any load in ::block() must not pass the global poll load. // Otherwise we might load an old safepoint counter (for example). OrderAccess::loadload(); SafepointSynchronize::block(thread); }
26-07-2021

I'm running vmTestbase/nsk/jvmti/scenarios/hotswap/HS202/hs202t002/hs202t002.java with fastdebug bits and the following options on my Ubuntu 20.04 machine: -Xcomp -XX:+CreateCoredumpOnCrash -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+DeoptimizeALot It takes almost 8 minutes per run. So far 20 runs without reproducing. Update: No failures in 25 runs.
26-07-2021

I kicked off a new run and added -XX:+SafepointALot. I'll let that run overnight.
26-07-2021

Try running it with -Xcomp -XX:+SafepointALot -XX:+DeoptimizeALot. I was able to reproduce it locally pretty consistently(1/5) with those flags.
26-07-2021

ILW = HLH = P2
26-07-2021

I think the issues of using TBIVM in do_self_suspend() were the recursive call to process_by_self() in ~TBIVM and possibly blocking for a safepoint with _lock held. > That said, for 17 the more conservative the fix the better so the TBIVM addition to my suggestion may be the way to go. Ok, I like this approach because we are also fixing all the overhead when self-suspending. A little less trivial than ThreadBlockInVM tbivm(self, true /* allow_suspend */); but still simple enough.
26-07-2021

That said, for 17 the more conservative the fix the better so the TBIVM addition to my suggestion may be the way to go.
26-07-2021

Sorry I missed that do_self_suspend() does a manual state transition to _thread_blocked rather than using TBIVM - so now I have to wonder why it does that rather than using TBIVM?
26-07-2021

Testing: https://github.com/pchilano/jdk17/commit/e040266f6a6889f8e518d075a515eee5876fee55
26-07-2021

So in do_self_suspend() we do a manual transition to thread_blocked. After being resumed and manually transitioning back to thread_in_vm there could be an ongoing safepoint, or a safepoint already finished and we need to call StackWatermarkSet::on_safepoint() to possibly fix the stack. Now, I don't see much code being executed by the JT since returning from HandshakeState::suspend() and transitioning back to native so it might be okay even if we don't stop for the safepoint, but I'm not sure. It might work if we add a TBIVM above MutexLocker. The manual transitions in do_self_suspend() will be both to thread_blocked so they are benign. bool HandshakeState::suspend() { JavaThread* self = JavaThread::current(); if (_handshakee == self) { // If target is the current thread we can bypass the handshake machinery // and just suspend directly log_trace(...)(...); ThreadBlockInVM(self); MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag); do_self_suspend(); return true; } else { SuspendThreadHandshake st; Handshake::execute(&st, _handshakee); return st.did_suspend(); } }
26-07-2021

> the problem is that after the JT is resumed and returns from do_self_suspend() to process_by_self() I don't follow. With what I suggested above process_by_self() is not in the stack. Can't we simply do: bool HandshakeState::suspend() { JavaThread* self = JavaThread::current(); if (_handshakee == self) { // If target is the current thread we can bypass the handshake machinery // and just suspend directly log_trace(...)(...); MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag); do_self_suspend(); return true; } else { SuspendThreadHandshake st; Handshake::execute(&st, _handshakee); return st.did_suspend(); } }
26-07-2021

As to whether _thread_in_vm is a valid state or not when polling for safepoints, I think the question is what is the criteria to establish if certain state is valid when polling for safepoints. I think in SafepointSynchronize::block() we should only check that we are not coming from a safe state already(thread_blocked, thread_in_native), since that might indicate issues that we were/will be executing in VM in a safe state while a safepoint operation was/is in progress. Other than that I don't see why we have to differentiate the other states, and less so _thread_in_vm. But if it looks risky to add _thread_in_vm to the list we can leave that for 18. Another simple fix using TBIVM is: --- a/src/hotspot/share/runtime/handshake.cpp +++ b/src/hotspot/share/runtime/handshake.cpp @@ -687,7 +687,7 @@ bool HandshakeState::suspend() { // If target is the current thread we need to call this to do the // actual suspend since Handshake::execute() above only installed // the asynchronous handshake. - SafepointMechanism::process_if_requested(self); + ThreadBlockInVM(self, true /* allow_suspend */); } return st.did_suspend(); } --- a/src/hotspot/share/runtime/interfaceSupport.inline.hpp +++ b/src/hotspot/share/runtime/interfaceSupport.inline.hpp @@ -290,6 +290,8 @@ class ThreadBlockInVM { InFlightMutexRelease _ifmr; ThreadBlockInVMPreprocess<InFlightMutexRelease> _tbivmpp; public: + ThreadBlockInVM(JavaThread* thread, bool allow_suspend) + : _ifmr(NULL), _tbivmpp(thread, _ifmr, allow_suspend) {} ThreadBlockInVM(JavaThread* thread, Mutex** in_flight_mutex_addr = NULL) : _ifmr(in_flight_mutex_addr), _tbivmpp(thread, _ifmr, /* allow_suspend= */ false) {} };
25-07-2021

I tried to think of something like that when I was working on 8270085, the problem is that after the JT is resumed and returns from do_self_suspend() to process_by_self(), we need to stop processing handshake operations and return true(as with the async branch) so that in SafepointMechanism::process() we do the re-checking(need_rechecking) due to a possible safepoint while the JT was blocked. I didn't find a straightforward way to do that without complicating the code so I left it as it is now.
25-07-2021

> The state _thread_in_vm should always be valid for safepoint polling. The only reason I think it's not there is because we didn't have a case for it. Hmmmm ... we actually use the trans state for the majority of cases where we check for safepoints (now handshakes). I find it hard to understand how we can now say "oh we don't need any of that we can just do this when _thread_in_vm". But that is a different issue. The current problem is much more specific IMO - we need a special case for self-suspension. The way self-suspension currently works is as a two step process: 1. Install an async handshake on ourselves 2. Hit a polling point where we see that async handshake and so actually suspend This seems somewhat artificial - though general purpose. Can't we short-circuit that process by detecting self-suspension when we would install the async handshake and immediately suspend instead? Something like: bool HandshakeState::suspend_with_handshake() { assert(_handshakee->threadObj() != NULL, "cannot suspend with a NULL threadObj"); + if (_handshakee == JavaThread::current()) { + do_self_suspend(); + } else { if (_handshakee->is_exiting()) { log_trace(thread, suspend)("JavaThread:" INTPTR_FORMAT " exiting", p2i(_handshakee)); return false; } if (has_async_suspend_handshake()) { if (is_suspended()) { // Target is already suspended. log_trace(thread, suspend)("JavaThread:" INTPTR_FORMAT " already suspended", p2i(_handshakee)); return false; } else { // Target is going to wake up and leave suspension. // Let's just stop the thread from doing that. log_trace(thread, suspend)("JavaThread:" INTPTR_FORMAT " re-suspended", p2i(_handshakee)); set_suspended(true); return true; } } // no suspend request assert(!is_suspended(), "cannot be suspended without a suspend request"); // Thread is safe, so it must execute the request, thus we can count it as suspended // from this point. set_suspended(true); set_async_suspend_handshake(true); log_trace(thread, suspend)("JavaThread:" INTPTR_FORMAT " suspended, arming ThreadSuspension", p2i(_handshakee)); ThreadSelfSuspensionHandshake* ts = new ThreadSelfSuspensionHandshake(); Handshake::execute(ts, _handshakee); + } return true; }
25-07-2021

The state _thread_in_vm should always be valid for safepoint polling. The only reason I think it's not there is because we didn't have a case for it. We could also use ThreadBlockInVM to artificially cause a transition and a poll. We would have to provide another constructor for ThreadBlockInVM to set allow_suspend which is always false now. As to why this wasn't detected I see in my jobs I only run once tier 7(I thought I run it more times, maybe I deleted some run). Since I knew Robbin also run it at least once in tier7 I thought 2/3 runs on tier7 without issues was enough. Should have run more times tiers7, my fault.
25-07-2021

I'm not quite following the details at this stage but I'd be happier ensuring we always have the expected state than simply adding a new state to the allowed set - the full implications of that are not at all clear to me. I'm also unclear why this wasn't detected during any of the stress testing.
24-07-2021

To fix this we can just add _thread_in_vm to the valid states in SafepointSynchronize::block(). As a future change, one of my goals with JDK-8260376 was to remove that switch statement and assert we only reach SafepointSynchronize::block() with a state of _thread_in_vm. All the transition states are not needed since _thread_in_vm is already unsafe. The only trans state that makes some sense is _thread_in_native_trans because we are setting it from macro assembler code and we might not need to call the vm if the poll is not armed. The case of _thread_in_Java can also be removed if we make the proper transitions in handle_polling_page_exception() and fix the zero interpreter SAFEPOINT macro.
24-07-2021

The issue is that method SafepointSynchronize::block() doesn't consider _thread_in_vm to be a valid state when polling for the safepoint. State _thread_in_vm is an unsafe state so it should also be added to the valid cases in the switch statement. This can happen now when self-suspending in HandshakeState::suspend() when calling SafepointMechanism::process_if_requested(self). If there is a current safepoint we will call SafepointSynchronize::block() with a state of _thread_in_vm. I checked the JVMTI specs for SuspendThread and we cannot just remove that process_if_requested() call in HandshakeState::suspend() relying on the fact that the JT will still suspend on the transition to Java. The specs says: "Suspend the specified thread. If the calling thread is specified, this function will not return until some other thread calls ResumeThread."
24-07-2021

Here's the crashing stack trace for the jdk-17+33-2691-tier7 sighting: vmTestbase/runtime/threads/ThreadSuspendSelf/ThreadSuspendSelf.java --------------- T H R E A D --------------- Current thread (0x00007f433c001670): JavaThread "ThreadSuspendSelf" [_thread_in_vm, id=12372, stack(0x00007f43755f5000,0x00007f43756f6000)], _nested_threads_hazard_ptr_cnt=1, _nested_threads_hazard_ptr=0x00007f4330006390 Stack: [0x00007f43755f5000,0x00007f43756f6000], sp=0x00007f43756f4700, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x171306d] SafepointSynchronize::block(JavaThread*)+0x23d V [libjvm.so+0x171ebc8] SafepointMechanism::process(JavaThread*, bool)+0x58 V [libjvm.so+0xd93889] HandshakeState::suspend()+0xd9 V [libjvm.so+0x18e825c] JavaThread::java_suspend()+0x8c V [libjvm.so+0x101ce2c] JVM_SuspendThread+0x16c J 2684 java.lang.Thread.suspend0()V java.base@17-ea (0 bytes) @ 0x00007f43914b3db5 [0x00007f43914b3ce0+0x00000000000000d5] J 2680 c2 runtime.threads.ThreadSuspendSelf.ThreadSuspendSelf.suspendSelf()V (28 bytes) @ 0x00007f43914b369c [0x00007f43914b3500+0x000000000000019c] j runtime.threads.ThreadSuspendSelf.ThreadSuspendSelf.run()V+20 v ~StubRoutines::call_stub V [libjvm.so+0xe95164] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x504 V [libjvm.so+0xe95a14] JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0x4b4 V [libjvm.so+0xe95e67] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, JavaThread*)+0x77 V [libjvm.so+0x100e19b] thread_entry(JavaThread*, JavaThread*)+0x12b V [libjvm.so+0x18dc131] JavaThread::thread_main_inner()+0x271 V [libjvm.so+0x18e3720] Thread::call_run()+0x100 V [libjvm.so+0x15a93c4] thread_native_entry(Thread*)+0x104 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) J 2684 java.lang.Thread.suspend0()V java.base@17-ea (0 bytes) @ 0x00007f43914b3d44 [0x00007f43914b3ce0+0x0000000000000064] J 2680 c2 runtime.threads.ThreadSuspendSelf.ThreadSuspendSelf.suspendSelf()V (28 bytes) @ 0x00007f43914b369c [0x00007f43914b3500+0x000000000000019c] j runtime.threads.ThreadSuspendSelf.ThreadSuspendSelf.run()V+20 v ~StubRoutines::call_stub The test task's JVM args are: -Xcomp -XX:+CreateCoredumpOnCrash -ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:+TieredCompilation -XX:+DeoptimizeALot
24-07-2021

[~pchilanomate] - Can you take a look at this failure? It's a possible regression due to: JDK-8270085 Suspend during block transition may deadlock if lock held
24-07-2021