Bug ID: JDK-4510838 self-suspend race with GetCallTrace

Type: Bug
Component: vm-legacy
Sub-Component: jvmpi
Affected Version: 1.4.0

Priority: P3
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2001-10-04
Updated: 2002-09-06
Resolved: 2002-09-06

Other	Other
1.4.0 rc1Fixed	1.4.1Fixed


Name: dd4877			Date: 10/04/2001


daniel.daugherty@Sun 2001-10-04

A few of my profiling stress runs crashed on the following assertion
failures:

# assert(pd_post_Java_state_is_pc(), "must be a program counter")
#
# Error ID: src/os_cpu/solaris_sparc/vm/thread_solaris_sparc.hpp, 55 [ Patched ]

# assert(Thread::current() == this, "only current thread can flush its own regis
ter windows")
#
# Error ID: baseline-stress/src/os_cpu/solaris_sparc/vm/thread_solaris_sparc.hpp
, 150 [ Patched ]

I also ran into a few SIGSEGV crashes due to the same problem.

wait_for_ext_suspend_completion() appears to be returning to
GetCallTrace() before the thread is truly self-suspended. Accessing
the not yet suspended thread's stack frames then becomes risky.

======================================================================

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: hopper merlin-rc1 FIXED IN: hopper merlin-rc1 INTEGRATED IN: merlin-rc1
14-06-2004
EVALUATION Name: dd4877 Date: 10/09/2001 daniel.daugherty@Sun 2001-10-09 Backing out the change for 4500875 isn't a good option because there are valid cases where a thread can be in state _thread_blocked when external suspension is requested. Because the thread state is not _thread_native, the JavaThread is marked _external_suspend and self suspend will happen later. The JVM/DI test frameloc001 is good example of this type of case. The thread being suspended is blocked on an ObjectMonitor::enter() call and the wait_for_ext_suspend_completion() can only return without timing out if the _thread_blocked check is present. ====================================================================== Name: dd4877 Date: 10/11/2001 daniel.daugherty@Sun 2001-10-11 Mandy suggested that we self suspend the thread when we use ThreadBlockInVM and realize that we have been suspended. This would be similar to the changes made for 4333847 and 4413752: ThreadBlockInVM tbiv(thread); bool threadIsSuspended; do { mutex_lock threadIsSuspended = check for external suspend if (threadIsSuspended) { mutex_unlock yield } while (threadIsSuspended); Instead of the yield, we do a self-suspend. Karen and I had discussed doing that late in the Ladybird release, but decided that the retry loop was risky enough at the time. I consulted with Karen again and she remembers this pretty much the same way I do. We also agreed that it is time to try the self-suspension so that _thread_blocked can be a legitimate reason to return from wait_for_ext_suspend_completion(). Here are the places that use ThreadBlockInVM in Solaris SPARC: Mutex::wait_for_lock_blocking_implementation() Will need to call unlock(), self-suspend and come around for another call to wait_for_lock_implementation() Monitor::wait() Will need to call _Lock_Event->unlock(), self-suspend and come around for another call to _Lock_Event->wait() or _Lock_Event->timedwait(timeout) ObjectMonitor::enter2() Replace yield with self-suspension. ObjectMonitor::raw_enter() Replace yield with self-suspension. Also needs a conditional StateSaver. check_pending_signals() Will need to call ::sema_post(), self-suspend and come around for another call to ::sema_wait(). os::sleep() Will need to self-suspend() and then return to the caller. Similar changes will need to be made for both Linux and Win32. ====================================================================== Name: dd4877 Date: 10/11/2001 daniel.daugherty@Sun 2001-10-11 The previous entry for Monitor::wait() is incorrect. Here is the right one: Monitor::wait() Will need to call _Lock_Event->unlock(), self-suspend and return to the caller SafepointSynchronize::block() sets the thread state to _thread_blocked before blocking on the Interrupt_lock. We can't self-suspend there, but we can limit the use of _thread_blocked to non-safepoint values in wait_for_ext_suspend_completion(). ====================================================================== Name: dd4877 Date: 10/15/2001 daniel.daugherty@Sun 2001-10-15 Further testing has revealed yet another race between self-suspension and internal suspension. The _is_suspended field assumes an orthogonality between initial suspension of a thread by an external suspend request and initial suspension of a thread by an internal suspend request. In reality, no such orthogonality exists and this assumption can lead to rare race conditions. Threads suspended for internal purposes are kept suspended for very short periods of time. For example, during safepointing a thread may be internally suspended, its state examined, and then the thread is resumed. Internal suspension is non-cooperative so the thread is suspended for as brief a time as possible to reduce the possibility of deadlock. External suspension of threads not in state _thread_in_native is done through a cooperative mechanism called self-suspension. The thread is marked for external suspension and when it reaches a good point, the thread suspends itself. External suspension of threads in state _thread_in_native is done non-cooperatively. Consider the following scenario: thread A thread B VM thread ============== ============= =============== no suspension : ext_suspend(A) : : Suspend-OP external_suspend=1 : : : : return : return : : get_call_trace : : wait_for_completion : : : int_suspend(A) internal_suspend=1 : : is_suspended=1 : : : // suspended return return : access frames int_resume(A) is_suspended=0 access frames : internal_suspend=0 access frames : : access frames return : // resumed access frames : CRASH self-suspend The problem is that first suspension of thread A was completed by an internal suspend request and not an external (cooperative) suspend request. Notice that this happened even though the external suspend started first. Because the thread was not yet externally suspended, there was nothing to nest when the internal suspend came along. Internal suspend correctly set is_suspend=1 for the short duration of the internal suspend. Meanwhile, thread B is waiting for the external suspend request to complete and it does. Well, the thread is suspended anyway. Thread B goes about its business when the internal resume happens and suddenly thread A is working again which thread B does not accept. Thread B typically crashes because it assumes the frame data is stable. If the first suspension of thread A was completed by the external suspend request, then when the internal suspend request comes along things nest as expected: external suspend request made thread self-suspends in response to request (depth++) internal suspend nests (depth++) internal resume unnests (depth--) // thread remains suspended external resume unnests (depth--) external suspend request made first suspend completed by internal request (depth++) // external suspend hasn't completed yet // wait_for_ext_suspend_completion is fooled internal resume unnests (depth--) // thread starts running again // thread that suspended this thread typically crashes thread self-suspends in response to request (depth++) ====================================================================== Name: dd4877 Date: 10/16/2001 daniel.daugherty@Sun 2001-10-16 Life just continues to get complicated. Last weekend's stress testing showed a rare hang due to a "dangling" _external_suspend value. By dangling, I mean that a thread self-suspended and there was no sign of any pending resume operation anywhere. This became particularly very apparent when I changed handle_special_runtime_exit_condition() to do direct checks for an _external_suspend value instead of relying on _async_suspend. I have come up with a scenario that explains how this could happen: thread A thread B VM thread ============== ============= =============== no suspension : ext_suspend(A) : : Suspend-OP external_suspend=1 : : is_suspended=1 depth++ : // suspended : return : return : : : int_suspend(A) internal_suspend=1 : : depth++ : : : // nested : return : ext_resume(A) : depth-- : : : // unnest return : : : int_resume(A) depth-- : : is_suspended=0 : : internal_suspend=0 : : : // resumed : return In the above transaction diagram, the external_suspend field is not cleared by the ext_resume() call. The current logic just assumes that with a nested resume all it has to do is decrement the depth. This can leave a "dangling" external_suspend value for the next handle_special_runtime_exit_condition() call to find. I've added a temporary assertion that will catch this case. BTW, it needs to be clarified that an external suspend request can only be nested on top of an internal suspend request. An external suspend request on top of another external suspend request is an error. ======================================================================
11-06-2004
PUBLIC COMMENTS Name: dd4877 Date: 10/04/2001 . ======================================================================
10-06-2004
SUGGESTED FIX Name: dd4877 Date: 10/19/2001 daniel.daugherty@Sun 2001-10-19 See attached webrev.tar.Z for the changes to fix this bug. ======================================================================
19-10-2001

Relates :	JDK-4753168 - self-suspend still has race with wait_for_ext_suspend_completion
Relates :	JDK-6876794 - sp07t002 hangs very intermittently
Relates :	JDK-4500875 - Very few CPU sampling due to failure in wait_for_suspend_completion