JDK-8286830 : ~HandshakeState should not touch oops
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 19
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • Submitted: 2022-05-16
  • Updated: 2022-07-18
  • Resolved: 2022-06-02
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 19
19 b26Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
#
#  Internal Error (/home/zhengyu/ws/jdk/src/hotspot/share/gc/shared/ptrQueue.cpp:38), pid=3890859, tid=3893922
#  assert(_buf == __null) failed: queue must be flushed before delete

and

#  Internal Error (/home/zhengyu/ws/jdk/src/hotspot/share/gc/shared/satbMarkQueue.cpp:230), pid=319590, tid=686903
#  assert(queue.is_active()) failed: precondition
#


Comments
Great. Yes, I would have expected the crashes with any of those changes removed. Thanks for the extra testing Dan!
15-06-2022

[~pchilanomate] I continued to stress test this fix. It's been bothering me that we couldn't reproduce this failure mode without Shenandoah. While chasing this other barrier bug: JDK-8288139 JavaThread touches oop after GC barrier is detached I developed some code to help diagnose these types of issues and extracted that code into the following subtask: JDK-8288497 add support for JavaThread::is_gc_barrier_detached() I added the code from JDK-8288497 to my repo for stressing this bug (JDK-8286830) and I optionally disabled the two key fixes from JDK-8286830: diff --git a/src/hotspot/share/runtime/thread.cpp b/src/hotspot/share/runtime/thread.cpp index 0c24f595cc0..6dee7844d68 100644 --- a/src/hotspot/share/runtime/thread.cpp +++ b/src/hotspot/share/runtime/thread.cpp @@ -1431,11 +1431,13 @@ void JavaThread::exit(bool destroy_vm, ExitType exit_type) { // before_exit() has already posted JVMTI THREAD_END events } +if (UseNewCode2) { // Cleanup any pending async exception now since we cannot access oops after // BarrierSet::barrier_set()->on_thread_detach() has been executed. if (has_async_exception_condition()) { handshake_state()->clean_async_exception_operation(); } +} // The careful dance between thread suspension and exit is handled here. // Since we are in thread_in_vm state and suspension is done with handshakes, @@ -1708,7 +1710,7 @@ void JavaThread::handle_async_exception(oop java_throwable) { void JavaThread::install_async_exception(AsyncExceptionHandshake* aeh) { // Do not throw asynchronous exceptions against the compiler thread // or if the thread is already exiting. - if (!can_call_java() || is_exiting()) { + if (!can_call_java() || (UseNewCode3 && is_exiting())) { delete aeh; return; } -XX:+UseNewCode2 enables the call to clean_async_exception_operation() and -XX:+UseNewCode3 enables the detection of an is_exiting() thread and deletes the AsyncExceptionHandshake object. I also added code to the AsyncExceptionHandshake destructor: AsyncExceptionHandshake::~AsyncExceptionHandshake() { + Thread* current = Thread::current(); + if (current->is_Java_thread()) { + guarantee(!JavaThread::cast(current)->is_gc_barrier_detached(), + "JavaThread cannot touch oops after its GC barrier is detached."); + } assert(!_exception.is_empty(), "invariant"); _exception.release(Universe::vm_global()); } As I was hoping, disabling both main parts to your fix in this bug (JDK-8286830) was nice enough to trigger the new check. Here's the results of my experimenting with -XX:+UseNewCode2 and -XX:+UseNewCode3: Code2..Code3..build-cfg..Run Duration =====..=====..=========..============ false..false..release....crash 1.12m false..false..fastdebug..crash 1.00m false..false..slowdebug..crash 1.42m false..true...release....crash 1.05m false..true...fastdebug..crash 0.99m false..true...slowdebug..crash 1.45m true...false..release....crash 11.70m true...false..fastdebug..crash 8.23m true...false..slowdebug..crash 73.85m true...true...release....pass 240.22m true...true...fastdebug..pass 240.49m true...true...slowdebug..pass 241.30m For the three passing runs: release bits: About to execute for 7200 seconds. Executed 37973860 loops in 7200 seconds. About to execute for 7200 seconds. Executed 15700828 loops in 7200 seconds. vm_exit: dcubed_async_global_alloc_cnt=118815345 vm_exit: dcubed_async_global_release_cnt=118815345 fastdebug bits: About to execute for 7200 seconds. Executed 18276326 loops in 7200 seconds. About to execute for 7200 seconds. Executed 8869218 loops in 7200 seconds. vm_exit: dcubed_async_global_alloc_cnt=30788882 vm_exit: dcubed_async_global_release_cnt=30788882 slowdebug bits: About to execute for 7200 seconds. Executed 7402912 loops in 7200 seconds. About to execute for 7200 seconds. Executed 3149075 loops in 7200 seconds. vm_exit: dcubed_async_global_alloc_cnt=10395932 vm_exit: dcubed_async_global_release_cnt=10395932 No leaks detected and no crashes.
15-06-2022

Changeset: 5acac223 Author: Patricio Chilano Mateo <pchilanomate@openjdk.org> Date: 2022-06-02 13:32:25 +0000 URL: https://git.openjdk.java.net/jdk/commit/5acac2238fdc4ffe6ef290456e01cc559d811557
02-06-2022

A pull request was submitted for review. URL: https://git.openjdk.java.net/jdk/pull/8795 Date: 2022-05-19 19:17:09 +0000
19-05-2022

In HandshakeOperation::do_handshake() we check !is_terminated(), which basically means we execute the handshake if the state is _not_terminated or _thread_exiting. Only when the status is _thread_terminated or _vm_exited then we don't execute it. The HandshakeState::suspend_with_handshake() closure has its own check for _thread_exiting. I actually never understood why in do_handshake() we allow executing the closure for the _thread_exiting case. Maybe we should change that. A handshaker could already be in Handshake::execute() with a valid reference to the target while the target is in Thread::remove() removing itself from the list. Depending on timing it could be that the operation was not added to the queue yet, or the handshaker already added the operation and is just looping waiting to find the target in a safe state.
18-05-2022

[~pchilanomate] I thought we already had an is_exiting check to prevent any handshake operation from occurring with a target thread? That is the point where the initiator of the handshake should simply be told "sorry the target has terminated". I don't see how synchronous handshakes can be in the queue at that point. ??
18-05-2022

[~dholmes], regarding your question about when a thread stops being responsive to handshake requests: Once the exiting thread sets itself as _thread_terminated in Threads::remove() no more handshake closures will be executed against it, either by itself or the handshaker. Since the two async handshakes we have(suspend and async exceptions) are set inside a synchronous handshake that means no more of those two operations will be added to the queue anymore(for the suspend case is even before that, once the thread moves to _thread_exiting). We could do the cleanup of the async exceptions there before the detach() call, but we cannot guarantee anything about synchronous handshakes in the queue. There could still be synchronous handshakes in the queue, or even new ones added, it's just that the closures won't be executed. Even async handshakes could be added in theory, although that doesn't happen today. I think a straightforward way to fix this is clean up async exceptions before the on_thread_detach() call as Dan had it initially. Then add a is_exiting() check in install_async_exception() to avoid adding new async operations after that cleanup (same check that SuspendThreadHandshake has). We could still keep the ~HandshakeState() code as it is but at least today it won't find any operation. Note: Currently we don't set _thread_exiting when JavaThread::exit() is called with destroy_vm=true but we should. Another way to solve this could be to change ~AsyncExceptionHandshake to identify if it is the current thread exiting and handle the release to the ServiceThread as we do in ~JavaThread() with _threadObj for example. But I'm not convinced on that.
18-05-2022

BTW I thought we had logic to detect touching an oop/oop-storage after on_thread_detach() is called ??
18-05-2022

I think we need to carefully rethink this one. Dan's initial patch was specific to async-exceptions; the final patch was intended to be generic. There is a point in thread termination where it stops being responsive to handshake requests. I'm not sure exactly where that is in relation to on_thread_detach() but it may be we can explicitly call the HandshakeState destructor prior to on_thread_detach().
18-05-2022

Although I can only reproduce with Shenandoah, I believe it is an upstream bug introduced by JDK-8283044. BarrierSet::on_thread_detach() is called from Threads::remove(), before destruction of the actual JavaThread. After JDK-8283044, JavaThread's HandshakeState can hold AsyncExceptionHandshake, when JavaThread's destructor is called, it triggers destruction of AsyncExceptionHandshake via destruction of JavaThread's HandshakeState, but AsyncExceptionHandshake's destructor releases exception handle, which can triggers GC barrier. I don't think any oop operation should be allowed after a JavaThread is detached. @kbarrett @tschatzl @stefank and @rkennke could you please comment?
18-05-2022

Testing patch based on Dan's original version where cleanup is done before executing BarrierSet::on_thread_detach().
17-05-2022

[~zgu] is correct - a Thread must not touch oops before on_thread_attach or after on_thread_detach.
17-05-2022

ILW=HMM=P2
17-05-2022

The addition of ~HandshakeState to delete any remaining AsyncExceptionHandshake operations was added in the JDK-8284632 later fix, so changing the title. If somebody from GC team can confirm we cannot do _exception.release(Universe::vm_global()), where _exception is an OopHandle, after BarrierSet::on_thread_detach() then we will have to redo that fix. The actual initial version of that fix posted by Dan (plus adding also a check in install_async_exception() to avoid installing an async exception after the JT moved to _thread_exiting) should work.
17-05-2022

I am convinced that JDK-8283044 has fatal fault. It manipulates oop after JavaThread is removed from threads list, that means it no longer participants safepoints and its destructor can run through GC safepoints, where GCs do not anticipate any other threads to mutate heap.
17-05-2022