JDK-4040218 : thread.suspend() hangs on native threads
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: unknown
  • Priority: P2
  • Status: Closed
  • Resolution: Won't Fix
  • OS: solaris_9
  • CPU: other
  • Submitted: 1997-03-20
  • Updated: 1997-07-02
  • Resolved: 1997-07-02
Description
Tests such as ThreadGroupTests62 are failing on native threads jvm.

This mail from sheng and devang explains the problem. Java and dbx traces follow.

From sheng.

You are right that suspend() is a problem on all platforms. (On 
green threads this problem occurs much less often.)

Thread.suspend is inherently unsafe. The suspended thread could
be holding Java locks or system locks. I don't see there is any
way around this other than relaxing the JCK tests.

-Sheng

> From devang@jurassic Wed Mar 19 17:59:06 1997
> Date: Wed, 19 Mar 1997 17:57:23 -0800
> From: devang@jurassic (Devang K. Shah)
> To: sl@jurassic, lindholm@jurassic
> Subject: suspend()/resume()
> Cc: never@jurassic, david.a.brown@jurassic, dbowen@jurassic, jvm-eng@carolina
> 
> 
> Hi folks,
> 
> The JCK runs with native threads, on a dual processor Ultra, show 5 new
> failures, i.e. they do not occur when run with green threads.
> 
> On analyzing these failures, a majority of them (at least 3 of them: 
> ThreadGroupTests{4,35,62}.java) occur due to the illegal use of suspend().
> 
> As we know, the JVM is not safe with respect to suspend(), on all platforms
> (including Win95, NT, green threads, native threads).
> 
> See bugid 4030677, which was closed as not a bug, because the application
> hung due to an illegal use of suspend().
> 
> The problem is that several JCK tests make use of suspend() in this manner
> causing hangs.
> 
> We should:
> 
> (1) deprecate these interfaces from the JDK (1.1.1 ??)
> (2) in the meantime, fix these tests or exclude them from the JCK
> 
> The suspend() problem is much more visible on Solaris/native threads since
> we run the tests on a multi-processor, where hard stops are issued from
> a suspend() method call, and the target thread(s) could be stopped inside
> the JVM while holding an internal JVM lock. When the suspending thread tries
> to do anything which might require the same lock, there is a deadlock.
> 
> In all the failures, the "Name and type hash table lock" was held by a stopped
> thread. The suspending thread tried to acquire this, before the owner was
> resumed.
>>  java -Djava.compiler=none javasoft.sqe.tests.api.java.lang.ThreadGroup.ThreadGroupTests62
>>  done
1

Full thread dump:
    "t2" (TID:0xed7049a8, sys_thread_t:0xa6ad8, state:S, thread_t: t@13, sp:0xef173520 pc:0xef4392c8 threadID:0xef173df8, stack_base:0xef173d8c, stack_size:0x22000) prio=5
    "t2" (TID:0xed7047e0, sys_thread_t:0xa6a68, state:S, thread_t: t@11, sp:0xef1d3468 pc:0xef786fd4 threadID:0xef1d3df8, stack_base:0xef1d3d8c, stack_size:0x22000) prio=5
        javasoft.sqe.tests.api.java.lang.ThreadGroup.LoopingThread.run(LoopingThread.java:64)
    "t2" (TID:0xed704408, sys_thread_t:0xa6988, state:S, thread_t: t@7, sp:0xef2933a8 pc:0xef786fd4 threadID:0xef293df8, stack_base:0xef293d8c, stack_size:0x22000) prio=5
        javasoft.sqe.tests.api.java.lang.ThreadGroup.LoopingThread.run(LoopingThread.java:70)
    "SIGQUIT handler" (TID:0xed700190, sys_thread_t:0x79230, state:R, thread_t: t@5, sp:0xef2f3b10 pc:0xef786fd4 threadID:0xef2f3df8, stack_base:0xef2f3d8c, stack_size:0x22000) prio=0 *current thread*
    "Finalizer thread" (TID:0xed7000d0, sys_thread_t:0x79cc8, state:CW, thread_t: t@4, sp:0xef3d3a58 pc:0xef785854 threadID:0xef3d3df8, stack_base:0xef3d3d8c, stack_size:0x22000) prio=1
    "main" (TID:0xed7000a8, sys_thread_t:0x4cd18, state:MW, thread_t: t@1, sp:0xeffff178 pc:0xef785854 threadID:0x20a30, stack_base:0xeffff9d0, stack_size:0x4000) prio=5
        javasoft.sqe.tests.api.java.lang.ThreadGroup.ThreadGroupTests64.provide_OUT(ThreadGroupTests62.java:186)
        javasoft.sqe.tests.api.java.lang.ThreadGroup.ThreadGroupTests64.provider(ThreadGroupTests62.java:113)
        suntest.quicktest.runtime.QT_GridTest.testAllSingle(QT_GridTest.java:215)
        suntest.quicktest.runtime.QT_GridTest.testAll(QT_GridTest.java:159)
        suntest.quicktest.runtime.JT_TestFactory.run(JT_TestFactory.java:40)
        javasoft.sqe.tests.api.java.lang.ThreadGroup.ThreadGroupTests62.main(ThreadGroupTests62.java:254)
Monitor Cache Dump:
    java.lang.Class@ED703C38/ED74E5F0: owner "main" (0x4cd18, 1 entry)
Registered Monitor Dump:
    Thread queue lock: owner "SIGQUIT handler" (0x79230, 1 entry)
    Name and type hash table lock: owner "t2" (0xa6ad8, 1 entry)
        Waiting to acquire:
            "main"
    String intern lock: <unowned>
    JNI pinning lock: <unowned>
    JNI global reference lock: <unowned>
    BinClass lock: <unowned>
    Class loading lock: <unowned>
    Java stack lock: <unowned>
    Code rewrite lock: <unowned>
    Heap lock: <unowned>
    Has finalization queue lock: <unowned>
    Finalize me queue lock: <unowned>
        Waiting to be notified:
            "Finalizer thread"
    Monitor cache expansion lock: <unowned>
    Monitor registry: owner "SIGQUIT handler" (0x79230, 1 entry)
    
    Attached to process 27850 with 4 LWPs
t@5 (l@1) stopped in __lwp_sema_wait at 0xef4392c8
0xef4392c8: __lwp_sema_wait+0x0008:     ta      %icc,%g0 + 8
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) lwps                                
 >l@1 running          in __lwp_sema_wait()
  l@2 running          in __signotifywait()
  l@3 running          in __lwp_sema_wait()
  l@5 running          in __door_return()
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) threads
      t@1         ?()   sleep on 0x25c70        in _swtch()
      t@2  b l@2  ?()   running                 in __signotifywait()
      t@3  a l@3  ?()   sleep on 0xef5c65c8     in __lwp_sema_wait()
      t@4         _start()      sleep on 0x23728        in _swtch()
 >    t@5  a l@1  _start()      sleep on 0xef5c6290     in __lwp_sema_wait()
      t@7         _start()      suspended       in _swtch()
     t@11         _start()      suspended       in _swtch()
     t@13         _start()      suspended       in _swtch()
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) thread t@13
t@13 (l@X) stopped in _swtch at 0xef5a7360
0xef5a7360: _swtch+0x0308:      call    0xef5c50d4 [PLT 35: _resume]
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) where      
current thread: t@13
=>[1] _swtch(0xeef0dd98, 0xef5c4aec, 0xef173e6c, 0xef173e68, 0xef173e64, 0xef173e60), at 0xef5a7360
  [2] _dopreempt(0xef5c4aec, 0x19b80, 0x0, 0x0, 0x0, 0x0), at 0xef5a891c
  [3] _siglwp(0xef5c4aec, 0xef173a80, 0xef1737c8, 0xef173708, 0xef173e60, 0xef173e40), at 0xef5aafec
  ---- called from signal handler with signal 33 (SIGLWP) ------
  [4] 0xc0(), at 0xbf
  [5] do_execute_java_method_vararg(0xef173cbc, 0xef7abec0, 0xef7abc00, 0x0, 0x0, 0x0), at 0xef75d348
  [6] execute_java_dynamic_method(0xef173cbc, 0xed7049a8, 0xef7a5a18, 0xef7a5a1c, 0xef7a8000, 0xef5ae488), at 0xef75cf2c
  [7] ThreadRT0(0xed7049a8, 0xef7a5a1c, 0x22000, 0xef5c4aec, 0x4, 0xa6ad8), at 0xef77798c
  [8] _start(0x0, 0xeef0dd98, 0x1, 0xef5cd704, 0x0, 0xeed09df8), at 0xef783e88
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) thread t@1
t@1 (l@X) stopped in _swtch at 0xef5a7360
0xef5a7360: _swtch+0x0308:      call    0xef5c50d4 [PLT 35: _resume]
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) where     
current thread: t@1
=>[1] _swtch(0xef5c83b0, 0xef5c4aec, 0x20aa4, 0x20aa0, 0x20a9c, 0x20a98), at 0xef5a7360
  [2] cond_wait(0x25c70, 0x25c50, 0x4356, 0xef5c4aec, 0x8, 0x20a78), at 0xef5a6118
  [3] condvarWait(0x4cd18, 0x25c50, 0x3, 0x0, 0xa97f0, 0x23b3c), at 0xef78589c
  [4] sysMonitorEnter(0x25c50, 0x25c70, 0x4cd18, 0x25c90, 0x0, 0x0), at 0xef7833a8
  [5] NameAndTypeToHash(0xaa1a8, 0xa9b08, 0xffff0000, 0xef7abc00, 0x0, 0x0), at 0xef74d23c
  [6] Locked_ResolveClassConstant(0x118, 0xef7a0658, 0x46, 0xef7a0658, 0xa9a50, 0xa97f0), at 0xef745a78
  [7] Locked_ResolveClassConstantField(0xa, 0xed703c38, 0xa97f0, 0x20, 0xef7a0658, 0xa98cc), at 0xef745b24
  [8] Locked_ResolveClassConstant(0xed703c38, 0xef7a0658, 0x37, 0xef7a0658, 0xa9a50, 0xa97f0), at 0xef745a98
  [9] ResolveClassConstant(0xa97f0, 0xed703c38, 0x37, 0x400, 0x400, 0xef7a0658), at 0xef74572c
  [10] ResolveClassConstantFromPC(0x37, 0xb6, 0xa97f0, 0xef7a0658, 0x400, 0xb6), at 0xef75ec1c
  [11] invokevirtual_0(0xaa7f1, 0xef7a0658, 0x23b68, 0x23b08, 0xa97f0, 0x23b3c), at 0xef78b090
  [12] do_execute_java_method_vararg(0xef7a0658, 0x9c644, 0x0, 0x0, 0x0, 0x23990), at 0xef75d888
  [13] do_execute_java_method(0xef7a0658, 0xed703ab0, 0x0, 0x0, 0x9c998, 0x1), at 0xef75d238
  [14] java_main(0x0, 0x9c998, 0xeffffb4c, 0x0, 0xed703ad0, 0xed703ab0), at 0xef77f808
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) thread t@13
t@13 (l@X) stopped in _swtch at 0xef5a7360
0xef5a7360: _swtch+0x0308:      call    0xef5c50d4 [PLT 35: _resume]
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) where     
current thread: t@13
=>[1] _swtch(0xeef0dd98, 0xef5c4aec, 0xef173e6c, 0xef173e68, 0xef173e64, 0xef173e60), at 0xef5a7360
  [2] _dopreempt(0xef5c4aec, 0x19b80, 0x0, 0x0, 0x0, 0x0), at 0xef5a891c
  [3] _siglwp(0xef5c4aec, 0xef173a80, 0xef1737c8, 0xef173708, 0xef173e60, 0xef173e40), at 0xef5aafec
  ---- called from signal handler with signal 33 (SIGLWP) ------
  [4] 0xc0(), at 0xbf
  [5] do_execute_java_method_vararg(0xef173cbc, 0xef7abec0, 0xef7abc00, 0x0, 0x0, 0x0), at 0xef75d348
  [6] execute_java_dynamic_method(0xef173cbc, 0xed7049a8, 0xef7a5a18, 0xef7a5a1c, 0xef7a8000, 0xef5ae488), at 0xef75cf2c
  [7] ThreadRT0(0xed7049a8, 0xef7a5a1c, 0x22000, 0xef5c4aec, 0x4, 0xa6ad8), at 0xef77798c
  [8] _start(0x0, 0xeef0dd98, 0x1, 0xef5cd704, 0x0, 0xeed09df8), at 0xef783e88
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) x 0xef1737c8/40X
dbx: warning: unknown language, 'ansic' assumed
0xef1737c8:      0x0000000f 0xef1d36b0 0x00000000 0x00000000
0xef1737d8:      0x00000000 0x00000000 0xefffc000 0x00004000
0xef1737e8:      0x00000000 0x609e09c0 0xfe401007 0xef4a7130
0xef1737f8:      0xef4a7134 0x00000000 0x00000002 0x000272f9
0xef173808:      0x001397c8 0x0000001f 0xef7fa3a4 0x00000000
0xef173818:      0xef173df8 0x000272f9 0x000007d3 0xef173b60
0xef173828:      0xef778928 0x28295600 0x28295600 0xef173b00
0xef173838:      0xef778a38 0x00000000 0xffffffff 0xffffffff
0xef173848:      0xffffffff 0xffffffff 0xffffffff 0xffffffff
0xef173858:      0xffffffff 0xffffffff 0xffffffff 0xffffffff
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) x 0xef4a7130/i  
0xef4a7130: ..urem+0x0010:      udiv    %o0, %o1, %o2
(/ws/on297-tools/SUNWspro/SC4.2/bin/dbx) 


Comments
WORK AROUND This will be documented for the Native threads relase in Solaris 2.6
11-06-2004

SUGGESTED FIX A couple of alternatives: 1. Deprecate the suspend()/resume() methods in the Threads class by inserting the following line in the comment before the suspend()/resume() methods in the src/share/classes/java/lang/Thread.java (1.2 location) file or src/share/java/java/lang/Thread.java file (pre-1.2 location): * @deprecated 2. Disable the default Java application threads from calling suspend() resume() by using the checkAccess method. A SecurityException would be raised in general use. Only special threads, such as the debugger thread would have the access priviliges to issue the suspend()/resume() methods on threads.
11-06-2004

PUBLIC COMMENTS The java Thread.suspend() method may cause java application programs to hang. The suspend() and resume() methods should not be used. The hangs may occur because a thread holding a lock is suspended and the thread responsible for resuming this thread needs this lock. This is a general problem with threaded programming, due to improper usage of these primitives which lead to application deadlocks. These methods are under review by Javasoft. Workaround: Other more appropriate synchronization methods such as wait() and notify() should be used.
10-06-2004

EVALUATION This bug represents a fundamental problem with the unsafety of the JVM with respect to stopping, as described in the Description. The question is: should this bug be fixed? i.e. should the JVM be made safe with respect to thread stopping? It is possible to fix it by demarcating stop-safe and stop-unsafe regions in the JVM, and to ensure that internal critical sections are stop-unsafe regions. This could be ensured by implementing a mechanism similar to the cancellation mechanism in POSIX threads, as implemented on Solaris. On entry to a monitor, disable thread suspension, and on exit, check for pending suspend requests - this desription of a fix is somewhat simplistic - the mechanism would be somewhat more complex because of the race between the suspender detecting the target's position, and the target entering an unsafe region. The suspender needs to issue a hard suspend if the target is in a safe region to prevent run-away threads which never enter an unsafe region and check for pending suspend's on exit from the region. The answer to the above question should take into account decisions made by other threads interfaces/standards which represent years of experience with multi-threaded programming. Quoting from section 16.1.8.6 ("Omitted and Rejected Functions"), and sub-section 16.1.8.6.1 ("Thread Suspend/Resume") of the POSIX threads standard: "Suspend/resume is considered error-prone when generally used as a thread synchronization mechanism by applications, as there are race conditions which are difficult to avoid. In addition, there is a large amount of exisitng practice which does not provide these interfaces in application thread packages and so leads you to believe that they are not vital in an application interface. Therefore, it was thought that these functions should not be made available at the application interface and to let experience tell if they are really necessary." The above rationale in the POSIX threads standard, for excluding these interfaces, is supported by our experience with the usage of thr_suspend(3t)/thr_resume(3t), which are present in the Solaris threads interface. Note that on Solaris, thr_suspend() is implemented to be safe with respect to internal libthread locks. Even so, most programmers run into problems when they try to program with thr_suspend()/thr_resume(3t) because of deadlocks in the application. So, even if we do the hard work of fixing this bug in the JVM, it would probably not be worth the trouble, since the deadlock would just be moved one level higher: to the application level. It is better, instead, to do the following: 1. Deprecate these interfaces: suspend() and resume() in the Threads class 2. Introduce an alternative interface to suspend/resume Java threads, available only to lower-level systems, such as Java debuggers. Since these would be used by a lower level system (a debugger), the deadlock issue for which this bug is filed, should not be relevant. e.g. libthread_db in Solaris uses /proc to suspend LWPs in a Solaris threads program to stop the world synchronously when a thread hits a break point. Since it is at a lower level, it does not care if a thread is stopped while holding an internal libthread lock. Similarly, a Java debugger should not care if a Java thread is stopped holding a JVM-internal monitor, and so can use a suspend/resume interface effectively. devang.shah@Eng 1997-03-25 The decision was made to deprecate Thread.suspend and Thread.resume in 1.2, for the reasons mentioned in this bug report. We will not try to make thread suspension safe in the meanwhile, as that would be difficult and platform-specific without being too useful. If equivalent functionality is needed by the debugger, that will be supplied other ways. timothy.lindholm@Eng 1997-07-02
02-07-1997