Duplicate :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
During my testing of my fix for 6419370, the following VM/NSK test hung: nsk/jvmti/scenarios/sampling/SP07/sp07t002 on my Solaris X86 machine running the following config: java version "1.7.0-ea" Java(TM) SE Runtime Environment (build 1.7.0-ea-b70) Java HotSpot(TM) Client VM (build 17.0-b01-internal, mixed mode) I created a doit.sh script from the .tlog file and used my doit_loop script to reproduce the hang once: Results: 318 runs (1 FAILed, 317 PASSed) Failure details: hangs: 1 Run duration: 3 hour(s) 51 second(s) I've attached the thread dump in threads.log.295 and a 30 second truss snapshot in truss.log.295. Only two threads were active in the 30 second snapshot: % sed 's/[ ].*//' truss.log.295 | sort | uniq -c 1 /1: 1 /10: 1 /11: 43 /2: 1 /3: 1 /4: 1 /5: 1 /6: 1 /7: 1 /8: 499 /9: Thread t@2 has the following stack snippet: t@2 (l@2) stopped in ___nanosleep at 0xfeec8095 0xfeec8095: ___nanosleep+0x0015: jae ___nanosleep+0x23 [ 0xfeec80a3, .+0xe ] current thread: t@2 [1] ___nanosleep(0x1), at 0xfeec8095 [2] THREAD_sleep(0x1), at 0xfe413424 [3] nsk_jvmti_sleep(0x3e8, 0x0), at 0xfe4166cd [4] syncDebugeeStatus(0x8066510, 0x8064b98, 0x0), at 0xfe4167fb [5] Java_nsk_share_jvmti_DebugeeClass_checkStatus(0x8066510, 0xfe55ebf0, 0x0, 0xfe55ebc0, 0x1, 0xf81c10c0, 0xfe55ebc0, 0xf81c0e10, 0xfe55ebf8, 0xf81c17a0, 0x0, 0xf81c0e10, 0x0, 0xfe55ebf8, 0xfe55ec20, 0xfc202ea9, 0xf81c10c0, 0xfc207fd9, 0x0, 0xf44f2f38), at 0xfe4169b3 [6] 0xfc20a23e(0xf81c10c0, 0xfc207fd9, 0x0, 0xf44f2f38, 0xfe55ec00, 0xf81c0a94), at 0xfc20a23e Thread t@9 has the following stack snippet: t@9 (l@9) stopped in __pollsys at 0xfeec8da5 0xfeec8da5: __pollsys+0x0015: jb __cerror [ 0xfee405a0, .-0x88805 ] current thread: t@9 =>[1] __pollsys(0x0, 0x0, 0xf3e5ad60, 0x0), at 0xfeec8da5 [2] _poll(0x0, 0x0, 0x32, 0xfe9ce7d6), at 0xfee6f502 [3] os_sleep(0x32, 0x0, 0x0, 0xfe9ceaa2), at 0xfe9ce922 [4] os::sleep(0x8116400, 0x32, 0x0, 0x0), at 0xfe9ceb40 [5] WatcherThread::run(0x8116400, 0xfebfa000, 0xf3e5afd8, 0xfe9cb38e), at 0xfe6d820e It is interesting to see that the VMThread is blocked: t@3 (l@3) stopped in ___lwp_cond_wait at 0xfeec9455 0xfeec9455: ___lwp_cond_wait+0x0015: jae ___lwp_cond_wait+0x23 [ 0xfeec9463, .+0xe ] current thread: t@3 =>[1] ___lwp_cond_wait(0x80f5448, 0x80f5430, 0x0, 0x1), at 0xfeec9455 [2] os::PlatformEvent::park(0x80f5400, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22 [3] ParkCommon(0x80f5400, 0x0, 0x0, 0x1), at 0xfe9ba4cd [4] Monitor::ILock(0x81495a8, 0x80f4800, 0xfec13758, 0xfe9bad24), at 0xfe9ba657 [5] Monitor::lock_without_safepoint_check(0x81495a8, 0x80f4800, 0x1, 0xfe9bad58), at 0xfe9bad39 [6] Monitor::lock_without_safepoint_check(0x81495a8, 0x1, 0x3, 0xfea0186a), at 0xfe9bada3 [7] ThreadSafepointState::examine_state_of_thread(0x8111fb0, 0xfebfa000, 0xf3ffeca8, 0xfe702eab), at 0xfea01895 [8] SafepointSynchronize::begin(0x8067648, 0x8067620), at 0xfe702f2a [9] VMThread::loop(0x80f4800, 0x7f, 0x0, 0xfe6b471a), at 0xfe6b4b38 [10] VMThread::run(0x80f4800, 0xfebfa000, 0xf3ffefd8, 0xfe9cb38e), at 0xfe6b47a1 Not sure if it means anything, but the first address passed to the Monitor::ILock() call above appears in the stack traces for the following threads: t@10 (l@10) stopped in ___lwp_cond_wait at 0xfeec9455 0xfeec9455: ___lwp_cond_wait+0x0015: jae ___lwp_cond_wait+0x23 [ 0xfeec9463, .+0xe ] current thread: t@10 =>[1] ___lwp_cond_wait(0x815b448, 0x815b430, 0x0, 0x1), at 0xfeec9455 [2] os::PlatformEvent::park(0x815b400, 0x81495a8, 0x0, 0xfe9ba482), at 0xfe9d0a22 [3] ParkCommon(0x815b400, 0x0, 0x0, 0xfe9ba98a), at 0xfe9ba4cd [4] Monitor::IWait(0x81495a8, 0x815a800, 0x0, 0x0), at 0xfe9baa21 [5] Monitor::wait(0x81495a8, 0x1, 0x0, 0x0), at 0xfe9bb2ce [6] JavaThread::java_suspend_self(0x815a800, 0x0, 0xfebfa000, 0xfe715aa5), at 0xfeaaeebc [7] JavaThread::check_safepoint_and_suspend_for_native_trans(0x815a800, 0x815a910, 0xf3dd7538, 0xfeaaef3c), at 0xfe715b25 [8] JavaThread::check_special_condition_for_native_trans(0x815a800), at 0xfeaaef4e [9] 0xfc27b4ff(0xfe6dbf47, 0xf3dd74f8, 0xfebfa000, 0xf3dd759c, 0xf4093398, 0xfc27a588), at 0xfc27b4ff t@11 (l@11) stopped in ___lwp_cond_wait at 0xfeec9455 0xfeec9455: ___lwp_cond_wait+0x0015: jae ___lwp_cond_wait+0x23 [ 0xfeec9463, .+0xe ] current thread: t@11 =>[1] ___lwp_cond_wait(0x815ca48, 0x815ca30, 0x0, 0x1), at 0xfeec9455 [2] os::PlatformEvent::park(0x815ca00, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22 [3] ParkCommon(0x815ca00, 0x0, 0x0, 0x1), at 0xfe9ba4cd [4] Monitor::ILock(0x80658c0, 0x815bc00, 0x1, 0xfe9bad24), at 0xfe9ba657 [5] Monitor::lock_without_safepoint_check(0x80658c0, 0x815bc00, 0x0, 0xfe9bad58), at 0xfe9bad39 [6] Monitor::lock_without_safepoint_check(0x80658c0, 0xfebfa000, 0xf3d88984, 0xfe70324a), at 0xfe9bada3 [7] SafepointSynchronize::block(0x815bc00, 0x815bc00, 0x5, 0x0), at 0xfe7033a7 [8] Monitor::wait(0x81495a8, 0x0, 0x5, 0x0), at 0xfe9bb2a4 [9] JavaThread::wait_for_ext_suspend_completion(0x815a800, 0x32, 0x5, 0xf3d88b08), at 0xfeaac3dd [10] JvmtiEnvBase::is_thread_fully_suspended(0x815a800, 0x1, 0xf3d88b08, 0xfe9365e8), at 0xfe93dc44 [11] JvmtiEnv::GetStackTrace(0x8064b98, 0x815a800, 0x0, 0x400, 0xfe438500), at 0xfe936607 [12] jvmti_GetStackTrace(0x8064b98, 0x815ce24, 0x0, 0x400, 0xfe438500, 0xfe435b38), at 0xfe9114ea [13] checkStackTrace(0x8064b98, 0x815bd10), at 0xfe41a932 [14] agentProc(0x8064b98, 0x815bd10, 0x0), at 0xfe41acf9 [15] agentThreadWrapper(0x8064b98, 0x815bd10, 0x0, 0xfe9532b2), at 0xfe4162a3 [16] JvmtiAgentThread::call_start_function(0x815bc00, 0xfee646de, 0xfef3d5d0, 0xfe953288), at 0xfe953374 [17] JvmtiAgentThread::start_function_wrapper(0x815bc00), at 0xfe95329a [18] JavaThread::thread_main_inner(0x815bc00, 0x0, 0x0, 0xfe6c2dbe), at 0xfe6c2fa9 It looks like the VMThread (t@3) is trying to begin a safepoint and is blocked trying to examine the state of another thread in a Monitor::ILock(0x81495a8) call. Thread t@10 is trying to suspend itself and is calling Monitor::IWait(0x81495a8). thread t@11 is waiting for a suspend to complete, calls Monitor::wait(0x81495a8) and blocks on the pending safepoint. Uh oh.... I suspect that t@11 is still holding the lock it is trying to wait() on. Look at the t@10 waiting sequence: =>[1] ___lwp_cond_wait(0x815b448, 0x815b430, 0x0, 0x1), at 0xfeec9455 [2] os::PlatformEvent::park(0x815b400, 0x81495a8, 0x0, 0xfe9ba482), at 0xfe9d0a22 [3] ParkCommon(0x815b400, 0x0, 0x0, 0xfe9ba98a), at 0xfe9ba4cd [4] Monitor::IWait(0x81495a8, 0x815a800, 0x0, 0x0), at 0xfe9baa21 [5] Monitor::wait(0x81495a8, 0x1, 0x0, 0x0), at 0xfe9bb2ce [6] JavaThread::java_suspend_self(0x815a800, 0x0, 0xfebfa000, 0xfe715aa5), at 0xfeaaeebc In t@10 we've gotten all the way down to lwp_cond_wait(). Look again at the t@11 waiting sequence: =>[1] ___lwp_cond_wait(0x815ca48, 0x815ca30, 0x0, 0x1), at 0xfeec9455 [2] os::PlatformEvent::park(0x815ca00, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22 [3] ParkCommon(0x815ca00, 0x0, 0x0, 0x1), at 0xfe9ba4cd [4] Monitor::ILock(0x80658c0, 0x815bc00, 0x1, 0xfe9bad24), at 0xfe9ba657 [5] Monitor::lock_without_safepoint_check(0x80658c0, 0x815bc00, 0x0, 0xfe9bad58), at 0xfe9bad39 [6] Monitor::lock_without_safepoint_check(0x80658c0, 0xfebfa000, 0xf3d88984, 0xfe70324a), at 0xfe9bada3 [7] SafepointSynchronize::block(0x815bc00, 0x815bc00, 0x5, 0x0), at 0xfe7033a7 [8] Monitor::wait(0x81495a8, 0x0, 0x5, 0x0), at 0xfe9bb2a4 [9] JavaThread::wait_for_ext_suspend_completion(0x815a800, 0x32, 0x5, 0xf3d88b08), at 0xfeaac3dd Our Monitor::wait(0x81495a8) didn't call Monitor:IWait(...) and get all the way down to lwp_cond_wait(). However, it did call SafepointSynchronize::block()... I'll have to check the code to see if these traces make sense... I did an overnight doit_loop run with baseline JDK7-B70: java version "1.7.0-ea" Java(TM) SE Runtime Environment (build 1.7.0-ea-b70) Java HotSpot(TM) Client VM (build 16.0-b07, mixed mode) Java HotSpot(TM) Client VM (16.0-b07) for solaris-x86 JRE (1.7.0-ea-b70), built on Aug 20 2009 03:38:34 by "" with Workshop 5.9 Here's the summary: Results: 1083 runs (4 FAILed, 1079 PASSed) Failure details: hangs: 4 Run duration: 9 hour(s) 37 minute(s) 16 second(s) The truss logs are similar in that only threads t@2 and t@9 show activity. The thread dumps have the same function names, but different address (as expected). I've attached all the logs for the record. I did a search of nightly testing results. Here are some hangs of this test observed in RT_Baseline: http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-08-19/RT_Baseline/vm/solaris-amd64/server/mixed/solaris-amd64_server_mixed_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log So the config is Solaris AMD64 Server VM -Xmixed. http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-08-10/RT_Baseline/vm/solaris-sparcv9/server/comp/solaris-sparcv9_server_comp_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log So the config is Solaris SPARC-64 Server VM -Xcomp. http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-06-16/RT_Baseline/vm/linux-amd64/server/comp/linux-amd64_server_comp_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log So the config is Linux AMD64 Server VM -Xcomp. http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-04-21/RT_Baseline/vm/windows-i586/client/mixed/windows-i586_client_mixed_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log So the config is Windows X86 Client VM -Xmixed. I'm changing my call record to generic/generic. Data from service_hs nightly analysis 2008.12.24 (final) Unexpected Timeouts ------------------- nsk.jvmti nsk/jvmti/scenarios/sampling/SP07/sp07t002 2008.11.25 Win32 Client VM (machine colfax005) 2008.11.05 Win32 Server VM (machine dantooine) 2008.10.29 Solaris AMD64 Server VM (machine vm-x86-1) 2008.06.18 Solaris SPARC-64 Server VM (machine jtg-s111) 2008.05.15 Solaris X86 Client VM (machine vm-v20z-30) 2008.02.04 Win32 Client VM (machine vm-v20z-19) Data from service_hs nightly analysis 2006.11.09 (partial) URL: http://gtee.sfbay/gtee/results/MUSTANG/NIGHTLY/VM-MAIN/2006-11-09/Serv_Baseline/ VM Build: 20061109061001.dcubed.service_hs_b03_merge.1 (new VM) Last main/baseline putback: 20061109050230.ik199011.rt_merge JDK: Dolphin-B02 (new JDK) Testbase: v160r08 (same testbase as previous version) Known nsk.jvmti UFOYOYOs: nsk/jvmti/scenarios/sampling/SP07/sp07t002 Last failure on 2006.11.06 with Solaris SPARC-64 Server VM. Previous failure on 2006.10.31 with Solaris SPARC Server VM. The fix for 6470570 was putback to rt_baseline on 2006.09.26. rt_baseline was pushed to main/baseline on 2006.09.27. main/baseline was pushed to serv_hs_baseline on 2006.09.27. In the 2006.09.27 nightly report, the sp07t002 entry looked like nsk/jvmti/scenarios/sampling/SP07/sp07t002 Last failure on 2005.10.04 with Linux IA32 Client VM. Previous failure on 2005.10.01 with Linux IA32 Client VM. A new hang was first noticed on 2006.10.31: * nsk/jvmti/scenarios/sampling/SP07/sp07t002 Last failure on 2006.10.31 with Solaris SPARC Server VM. Previous failure on 2005.10.04 with Linux IA32 Client VM. I don't know whether I just missed any earlier hangs or not.
|