Bug ID: JDK-6876794 sp07t002 hangs very intermittently

JDK-6876794 : sp07t002 hangs very intermittently

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: hs16

Priority: P4
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2009-08-28
Updated: 2013-07-25
Resolved: 2009-09-30

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 6	JDK 7	Other
6u21Fixed	7Fixed	hs17Fixed

Related Reports

Duplicate :	JDK-6820575 - Threads suspending leads to hang
Relates :	JDK-4379031 - java -Xrunhprof:cpu=samples PepTest intermittently hangs
Relates :	JDK-6470570 - Modify suspend/resume code to work better with the safepoint protocol
Relates :	JDK-4510838 - self-suspend race with GetCallTrace

Description

During my testing of my fix for 6419370, the following VM/NSK
test hung:

    nsk/jvmti/scenarios/sampling/SP07/sp07t002

on my Solaris X86 machine running the following config:

java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b70)
Java HotSpot(TM) Client VM (build 17.0-b01-internal, mixed mode)

I created a doit.sh script from the .tlog file and used my
doit_loop script to reproduce the hang once:

Results: 318 runs (1 FAILed, 317 PASSed)
Failure details: hangs: 1
Run duration:   3 hour(s)  51 second(s)

I've attached the thread dump in threads.log.295 and a
30 second truss snapshot in truss.log.295.

Only two threads were active in the 30 second snapshot:

% sed 's/[        ].*//' truss.log.295 | sort | uniq -c
   1 /1:
   1 /10:
   1 /11:
  43 /2:
   1 /3:
   1 /4:
   1 /5:
   1 /6:
   1 /7:
   1 /8:
 499 /9:

Thread t@2 has the following stack snippet:

t@2 (l@2) stopped in ___nanosleep at 0xfeec8095
0xfeec8095: ___nanosleep+0x0015:        jae      ___nanosleep+0x23      [ 0xfeec80a3, .+0xe ]
current thread: t@2
  [1] ___nanosleep(0x1), at 0xfeec8095
  [2] THREAD_sleep(0x1), at 0xfe413424
  [3] nsk_jvmti_sleep(0x3e8, 0x0), at 0xfe4166cd
  [4] syncDebugeeStatus(0x8066510, 0x8064b98, 0x0), at 0xfe4167fb
  [5] Java_nsk_share_jvmti_DebugeeClass_checkStatus(0x8066510, 0xfe55ebf0, 0x0,
0xfe55ebc0, 0x1, 0xf81c10c0, 0xfe55ebc0, 0xf81c0e10, 0xfe55ebf8, 0xf81c17a0, 0x0, 0xf81c0e10, 0x0, 0xfe55ebf8, 0xfe55ec20, 0xfc202ea9, 0xf81c10c0, 0xfc207fd9, 0x0, 0xf44f2f38), at 0xfe4169b3
  [6] 0xfc20a23e(0xf81c10c0, 0xfc207fd9, 0x0, 0xf44f2f38, 0xfe55ec00, 0xf81c0a94), at 0xfc20a23e


Thread t@9 has the following stack snippet:

t@9 (l@9) stopped in __pollsys at 0xfeec8da5
0xfeec8da5: __pollsys+0x0015:   jb       __cerror       [ 0xfee405a0, .-0x88805
]
current thread: t@9
=>[1] __pollsys(0x0, 0x0, 0xf3e5ad60, 0x0), at 0xfeec8da5
  [2] _poll(0x0, 0x0, 0x32, 0xfe9ce7d6), at 0xfee6f502
  [3] os_sleep(0x32, 0x0, 0x0, 0xfe9ceaa2), at 0xfe9ce922
  [4] os::sleep(0x8116400, 0x32, 0x0, 0x0), at 0xfe9ceb40
  [5] WatcherThread::run(0x8116400, 0xfebfa000, 0xf3e5afd8, 0xfe9cb38e), at 0xfe6d820e


It is interesting to see that the VMThread is blocked:

t@3 (l@3) stopped in ___lwp_cond_wait at 0xfeec9455
0xfeec9455: ___lwp_cond_wait+0x0015:    jae      ___lwp_cond_wait+0x23  [ 0xfeec9463, .+0xe ]
current thread: t@3
=>[1] ___lwp_cond_wait(0x80f5448, 0x80f5430, 0x0, 0x1), at 0xfeec9455
  [2] os::PlatformEvent::park(0x80f5400, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22
  [3] ParkCommon(0x80f5400, 0x0, 0x0, 0x1), at 0xfe9ba4cd
  [4] Monitor::ILock(0x81495a8, 0x80f4800, 0xfec13758, 0xfe9bad24), at 0xfe9ba657
  [5] Monitor::lock_without_safepoint_check(0x81495a8, 0x80f4800, 0x1, 0xfe9bad58), at 0xfe9bad39
  [6] Monitor::lock_without_safepoint_check(0x81495a8, 0x1, 0x3, 0xfea0186a), at 0xfe9bada3
  [7] ThreadSafepointState::examine_state_of_thread(0x8111fb0, 0xfebfa000, 0xf3ffeca8, 0xfe702eab), at 0xfea01895
  [8] SafepointSynchronize::begin(0x8067648, 0x8067620), at 0xfe702f2a
  [9] VMThread::loop(0x80f4800, 0x7f, 0x0, 0xfe6b471a), at 0xfe6b4b38
  [10] VMThread::run(0x80f4800, 0xfebfa000, 0xf3ffefd8, 0xfe9cb38e), at 0xfe6b47a1


Not sure if it means anything, but the first address passed
to the Monitor::ILock() call above appears in the stack traces
for the following threads:

t@10 (l@10) stopped in ___lwp_cond_wait at 0xfeec9455
0xfeec9455: ___lwp_cond_wait+0x0015:    jae      ___lwp_cond_wait+0x23  [ 0xfeec9463, .+0xe ]
current thread: t@10
=>[1] ___lwp_cond_wait(0x815b448, 0x815b430, 0x0, 0x1), at 0xfeec9455
  [2] os::PlatformEvent::park(0x815b400, 0x81495a8, 0x0, 0xfe9ba482), at 0xfe9d0a22
  [3] ParkCommon(0x815b400, 0x0, 0x0, 0xfe9ba98a), at 0xfe9ba4cd
  [4] Monitor::IWait(0x81495a8, 0x815a800, 0x0, 0x0), at 0xfe9baa21
  [5] Monitor::wait(0x81495a8, 0x1, 0x0, 0x0), at 0xfe9bb2ce
  [6] JavaThread::java_suspend_self(0x815a800, 0x0, 0xfebfa000, 0xfe715aa5), at
0xfeaaeebc
  [7] JavaThread::check_safepoint_and_suspend_for_native_trans(0x815a800, 0x815a910, 0xf3dd7538, 0xfeaaef3c), at 0xfe715b25
  [8] JavaThread::check_special_condition_for_native_trans(0x815a800), at 0xfeaaef4e
  [9] 0xfc27b4ff(0xfe6dbf47, 0xf3dd74f8, 0xfebfa000, 0xf3dd759c, 0xf4093398, 0xfc27a588), at 0xfc27b4ff


t@11 (l@11) stopped in ___lwp_cond_wait at 0xfeec9455
0xfeec9455: ___lwp_cond_wait+0x0015:    jae      ___lwp_cond_wait+0x23  [ 0xfeec9463, .+0xe ]
current thread: t@11
=>[1] ___lwp_cond_wait(0x815ca48, 0x815ca30, 0x0, 0x1), at 0xfeec9455
  [2] os::PlatformEvent::park(0x815ca00, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22
  [3] ParkCommon(0x815ca00, 0x0, 0x0, 0x1), at 0xfe9ba4cd
  [4] Monitor::ILock(0x80658c0, 0x815bc00, 0x1, 0xfe9bad24), at 0xfe9ba657
  [5] Monitor::lock_without_safepoint_check(0x80658c0, 0x815bc00, 0x0, 0xfe9bad58), at 0xfe9bad39
  [6] Monitor::lock_without_safepoint_check(0x80658c0, 0xfebfa000, 0xf3d88984, 0xfe70324a), at 0xfe9bada3
  [7] SafepointSynchronize::block(0x815bc00, 0x815bc00, 0x5, 0x0), at 0xfe7033a7
  [8] Monitor::wait(0x81495a8, 0x0, 0x5, 0x0), at 0xfe9bb2a4
  [9] JavaThread::wait_for_ext_suspend_completion(0x815a800, 0x32, 0x5, 0xf3d88b08), at 0xfeaac3dd
  [10] JvmtiEnvBase::is_thread_fully_suspended(0x815a800, 0x1, 0xf3d88b08, 0xfe9365e8), at 0xfe93dc44
  [11] JvmtiEnv::GetStackTrace(0x8064b98, 0x815a800, 0x0, 0x400, 0xfe438500), at 0xfe936607
  [12] jvmti_GetStackTrace(0x8064b98, 0x815ce24, 0x0, 0x400, 0xfe438500, 0xfe435b38), at 0xfe9114ea
  [13] checkStackTrace(0x8064b98, 0x815bd10), at 0xfe41a932
  [14] agentProc(0x8064b98, 0x815bd10, 0x0), at 0xfe41acf9
  [15] agentThreadWrapper(0x8064b98, 0x815bd10, 0x0, 0xfe9532b2), at 0xfe4162a3
  [16] JvmtiAgentThread::call_start_function(0x815bc00, 0xfee646de, 0xfef3d5d0,
0xfe953288), at 0xfe953374
  [17] JvmtiAgentThread::start_function_wrapper(0x815bc00), at 0xfe95329a
  [18] JavaThread::thread_main_inner(0x815bc00, 0x0, 0x0, 0xfe6c2dbe), at 0xfe6c2fa9


It looks like the VMThread (t@3) is trying to begin a
safepoint and is blocked trying to examine the state
of another thread in a Monitor::ILock(0x81495a8) call.

Thread t@10 is trying to suspend itself and is calling
Monitor::IWait(0x81495a8).

thread t@11 is waiting for a suspend to complete, calls
Monitor::wait(0x81495a8) and blocks on the pending safepoint.
Uh oh....

I suspect that t@11 is still holding the lock it is trying
to wait() on. Look at the t@10 waiting sequence:

=>[1] ___lwp_cond_wait(0x815b448, 0x815b430, 0x0, 0x1), at 0xfeec9455
  [2] os::PlatformEvent::park(0x815b400, 0x81495a8, 0x0, 0xfe9ba482), at 0xfe9d0a22
  [3] ParkCommon(0x815b400, 0x0, 0x0, 0xfe9ba98a), at 0xfe9ba4cd
  [4] Monitor::IWait(0x81495a8, 0x815a800, 0x0, 0x0), at 0xfe9baa21
  [5] Monitor::wait(0x81495a8, 0x1, 0x0, 0x0), at 0xfe9bb2ce
  [6] JavaThread::java_suspend_self(0x815a800, 0x0, 0xfebfa000, 0xfe715aa5), at
0xfeaaeebc

In t@10 we've gotten all the way down to lwp_cond_wait().

Look again at the t@11 waiting sequence:

=>[1] ___lwp_cond_wait(0x815ca48, 0x815ca30, 0x0, 0x1), at 0xfeec9455
  [2] os::PlatformEvent::park(0x815ca00, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22
  [3] ParkCommon(0x815ca00, 0x0, 0x0, 0x1), at 0xfe9ba4cd
  [4] Monitor::ILock(0x80658c0, 0x815bc00, 0x1, 0xfe9bad24), at 0xfe9ba657
  [5] Monitor::lock_without_safepoint_check(0x80658c0, 0x815bc00, 0x0, 0xfe9bad58), at 0xfe9bad39
  [6] Monitor::lock_without_safepoint_check(0x80658c0, 0xfebfa000, 0xf3d88984, 0xfe70324a), at 0xfe9bada3
  [7] SafepointSynchronize::block(0x815bc00, 0x815bc00, 0x5, 0x0), at 0xfe7033a7
  [8] Monitor::wait(0x81495a8, 0x0, 0x5, 0x0), at 0xfe9bb2a4
  [9] JavaThread::wait_for_ext_suspend_completion(0x815a800, 0x32, 0x5, 0xf3d88b08), at 0xfeaac3dd

Our Monitor::wait(0x81495a8) didn't call Monitor:IWait(...)
and get all the way down to lwp_cond_wait(). However, it
did call SafepointSynchronize::block()...

I'll have to check the code to see if these traces make sense...
I did an overnight doit_loop run with baseline JDK7-B70:

    java version "1.7.0-ea"
    Java(TM) SE Runtime Environment (build 1.7.0-ea-b70)
    Java HotSpot(TM) Client VM (build 16.0-b07, mixed mode)

    Java HotSpot(TM) Client VM (16.0-b07) for solaris-x86 JRE (1.7.0-ea-b70), built on Aug 20 2009 03:38:34 by "" with Workshop 5.9

Here's the summary:

Results: 1083 runs (4 FAILed, 1079 PASSed)
Failure details: hangs: 4
Run duration:   9 hour(s)  37 minute(s)  16 second(s)

The truss logs are similar in that only threads t@2 and t@9
show activity. The thread dumps have the same function names,
but different address (as expected). I've attached all the
logs for the record.
I did a search of nightly testing results. Here are some
hangs of this test observed in RT_Baseline:

http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-08-19/RT_Baseline/vm/solaris-amd64/server/mixed/solaris-amd64_server_mixed_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

So the config is Solaris AMD64 Server VM -Xmixed.

http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-08-10/RT_Baseline/vm/solaris-sparcv9/server/comp/solaris-sparcv9_server_comp_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

So the config is Solaris SPARC-64 Server VM -Xcomp.

http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-06-16/RT_Baseline/vm/linux-amd64/server/comp/linux-amd64_server_comp_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

So the config is Linux AMD64 Server VM -Xcomp.

http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-04-21/RT_Baseline/vm/windows-i586/client/mixed/windows-i586_client_mixed_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

So the config is Windows X86 Client VM -Xmixed.

I'm changing my call record to generic/generic.
Data from service_hs nightly analysis 2008.12.24 (final)

Unexpected Timeouts
-------------------

nsk.jvmti
    nsk/jvmti/scenarios/sampling/SP07/sp07t002
        2008.11.25 Win32 Client VM (machine colfax005)
        2008.11.05 Win32 Server VM (machine dantooine)
        2008.10.29 Solaris AMD64 Server VM (machine vm-x86-1)
        2008.06.18 Solaris SPARC-64 Server VM (machine jtg-s111)
        2008.05.15 Solaris X86 Client VM (machine vm-v20z-30)
        2008.02.04 Win32 Client VM (machine vm-v20z-19)
Data from service_hs nightly analysis 2006.11.09 (partial)

URL: http://gtee.sfbay/gtee/results/MUSTANG/NIGHTLY/VM-MAIN/2006-11-09/Serv_Baseline/
VM Build: 20061109061001.dcubed.service_hs_b03_merge.1 (new VM)
Last main/baseline putback: 20061109050230.ik199011.rt_merge
JDK: Dolphin-B02 (new JDK)
Testbase: v160r08 (same testbase as previous version)

Known nsk.jvmti UFOYOYOs:
    nsk/jvmti/scenarios/sampling/SP07/sp07t002
        Last failure on 2006.11.06 with Solaris SPARC-64 Server VM.
        Previous failure on 2006.10.31 with Solaris SPARC Server VM.

The fix for 6470570 was putback to rt_baseline on 2006.09.26.
rt_baseline was pushed to main/baseline on 2006.09.27.
main/baseline was pushed to serv_hs_baseline on 2006.09.27.

In the 2006.09.27 nightly report, the sp07t002 entry looked like

    nsk/jvmti/scenarios/sampling/SP07/sp07t002
        Last failure on 2005.10.04 with Linux IA32 Client VM.
        Previous failure on 2005.10.01 with Linux IA32 Client VM.

A new hang was first noticed on 2006.10.31:

*   nsk/jvmti/scenarios/sampling/SP07/sp07t002
        Last failure on 2006.10.31 with Solaris SPARC Server VM.
        Previous failure on 2005.10.04 with Linux IA32 Client VM.

I don't know whether I just missed any earlier hangs or not.

Comments

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-rt/hotspot/rev/87770dcf831b
23-09-2009
SUGGESTED FIX Please see the attached 6876794-webrev-cr0.tgz for the proposed fix.
16-09-2009
EVALUATION I think I've figured out why this deadlock is bothering me. SafepointSynchronize::begin() is the only caller of examine_state_of_thread() and only the VMThread calls SafepointSynchronize::begin(). On top of that, examine_state_of_thread() is called from the safepoint polling loop. So why do we need such a precise answer as to whether the target thread is suspended or not? If the safepoint polling loop doesn't see the thread as suspended the first time around, then it will come around again. In other words, why can't we use is_any_suspended() instead of is_any_suspended_with_lock()? Or better yet: - replace the single is_any_suspended_with_lock() call that remains in the system with an is_ext_suspended() call - replace the single is_any_suspended() call that remains in the system with an is_ext_suspended() call - delete is_any_suspended_with_lock() and is_any_suspended() The Thread suspend flags are declared: volatile uint32_t _suspend_flags; and the external suspend flag is only set when the SR_lock is held by an Atomic::cmpxchg() call. It seems to me that this is a case of over locking...
02-09-2009
EVALUATION The VMThread has initiated a safepoint, holds the Threads_lock and is examining the state of the JavaThreads. In particular it is trying to acquire the SR_lock of a suspended thread (t@10). Meanwhile another thread (t@11) is doing JavaThread::wait_for_ext_suspend_completion and it holds the SR_lock() of t@10 and is doing a Monitor::wait on the SR_lock() with a safepoint check. However the safepoint check happens while the SR_lock is still owned and so t@11 blocks for the safepoint, while the VMThread blocks on the SR_lock that t@11 owns - deadlock. The fix/changes put in for 6470570 made a mistake in assuming that the wait operation would act as-if it did: - release lock - safepoint check - actual wait where what it actually does is: - set lock owner to null - safepoint check - release low-level lock - actual wait This issue is skirting the edges of the lock-sneaking mechanism, but the sneak-check in the VMThread's attempted acquisition of the SR_lock fails: bool can_sneak = Self->is_VM_thread() && SafepointSynchronize::is_at_safepoint() because the system is not yet at a safepoint. It would be safe to sneak in this case, but there is no obvious check that this is the case. It appears that the changes for 6470570 need to be partly regressed: - restore the explicit safepoint check by acquiring the Threads_lock (even though it causes that lock to be acquired and released twice!) - remove the safepoint check from the lock() and wait() calls - ensure the SR_lock is not held when doing the safepoint check I don't, at this point, seen an alternative. Either the SR_lock must be fully dropped before the safepoint check; or else the VMThread must be able to sneak the SR_lock. Restructuring the Wait/IWait internal logic to release the internal-lock prior to the safepoint check is not a feasible option.
29-08-2009