Bug ID: JDK-4692906 Hotspot JVM's hang if thread suspend/resume executed by non-Java code

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 1.3.1
Priority: P4
Status: Closed
Resolution: Fixed
OS: windows_2000
CPU: x86
Submitted: 2002-05-29
Updated: 2012-10-08
Resolved: 2002-07-15
Other	Other	Other
1.3.1_05 05Fixed	1.4.0_03Fixed	1.4.1Fixed
FULL PRODUCT VERSION :
java version "1.3.1"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1-b24)
Java HotSpot(TM) Client VM (build 1.3.1-internal, mixed mode)

FULL OPERATING SYSTEM VERSION :

Microsoft Windows 2000 [Version 5.00.2195]
Service pack 2

ADDITIONAL OPERATING SYSTEMS :

Likely any win32 NT-derived system.


EXTRA RELEVANT SYSTEM CONFIGURATION :
Problem occurs only on multiple processor configurations.

A DESCRIPTION OF THE PROBLEM :
Hotspot jvm's on win32 platforms mistakenly assume that
suspend count for threads will only reach depth of 1.

If any third party code (such as JNI-reachable DLL's)
invokes win32 apis to suspend and resume java threads,
the JVM will falsely interpret the situation as an error
condition and slowly but surely leave threads in a hung
  state, eventually hanging the entire JVM process.

This situation occurs only as a race condition on multiple-
cpu window machines.  It doesn't arise in the -classic jvm
implementation, however the deprecation of that jvm means
that we must have a fix for the hotspot jvm to avoid
process hangs.

We have traced the problem to
the win32 implementation of the jvm's
Thread::resume_thread_impl and
Thread::suspend_thread_impl bodies, in conjunction with the
os::pd_resume_thread and os::pd_suspend_thread counterparts.

We patched a jvm using the diffs below.  The patched JVM
appears to
run all our apps without problems, though with race
conditions there is always the chance of false positives.

Here is a description of the changes:

Changes are in Windows-specific code and have to do with
how the
JVM handles the return value from the Windows system calls
SuspendThread
and ResumeThread.

os::pd_suspend_thread() was changed so that it returns 0 if
the call to
SuspendThread was successful and 1 if it was not. This is
the documented
behavior of the method and it is the behavior that the
single caller of
this method expects. Prior to this change, the method
treated any non-zero
return value from SuspendThread() as an error. But
SuspendThread is
documented to return values >= 0 on success.

os::pd_resume_thread() was changed so that it returns 0 if
the call to
ResumeThread was successful and 1 if it was not. This is the
documented behavior of the method and it is the behavior
that the
single caller of this method expects. The change also sets
the thread
  state to RUNNABLE if the call to ResumeThread was
successful. Strictly
speaking, a thread is not runnable if the suspend count is
greater
than zero, but for the JVM's purposes, the thread is
runnable. When
the other entity (in our case, database client DLL) updates
the
suspend count so that the thread can run, then the JVM will
already be
in the correct state.

*** hotspot1.3.1\src\os\win32\vm\os_win32.cpp.orig Sun May
6 03:04:54 2001
--- hotspot1.3.1\src\os\win32\vm\os_win32.cpp Fri Apr 19
13:19:42 2002
***************
*** 1536,1543 ****
        ret = SuspendThread(handle);
      }
      assert(ret != 0xffffffffUL, "SuspendThread
failed"); // should
propagate back
!     assert(ret == 0, "Win32 nested suspend");
!     return ret;
  }

  // Resume a thread by one level.  This method assumes
that consecutive
--- 1536,1542 ----
        ret = SuspendThread(handle);
      }
      assert(ret != 0xffffffffUL, "SuspendThread
failed"); // should
propagate back
!     return (ret == 0xffffffffUL);
  }

  // Resume a thread by one level.  This method assumes
that consecutive
***************
*** 1554,1564 ****
  long os::pd_resume_thread(Thread* thread) {
    OSThread* osthread = thread->osthread();
    DWORD ret = ResumeThread(osthread->thread_handle());
!   assert(ret != 0xffffffffUL, "ResumeThread failed"); //
should propagate
back
!   if (ret == 1) {
!     osthread->set_state(RUNNABLE);
    }
!   return ret - 1;
  }


--- 1553,1564 ----
  long os::pd_resume_thread(Thread* thread) {
    OSThread* osthread = thread->osthread();
    DWORD ret = ResumeThread(osthread->thread_handle());
!   if (ret == 0xffffffffUL) {
!     assert(false, "ResumeThread failed");
!     return 1; // error return value
    }
!   osthread->set_state(RUNNABLE);
!   return 0; // success return value
  }




REGRESSION.  Last worked in version 1.3.1

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
There is no simple test case.  My company will be happy to
provide the test framework we used for isolating and fixing
the fault, but it involves use of licensed native code.

  To build a test case involves designing a system where the
right pattern of externally applied or JNI-invoked thread
suspend and resume operations are performed on java threads.

It also requires a bit of luck and a multi-processor
configuration since it is a race condition generally
triggered during hotspot safepoint processing.

EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected results: process doesn't hang.
Actual results: process eventually hangs, with all evidence
of the cause obscured since the cause of the all-threads-
waiting symptoms at the time of hang has long since passed.

ERROR MESSAGES/STACK TRACES THAT OCCUR :
# HotSpot Virtual Machine Error, assertion failure
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# assert(ret == 0, "Win32 nested suspend")
#
# Error happened during: scavenge
#
# Error ID: E:\jdk131src\hotspot1.3.1\src\os\win32\vm\os_win32.cpp, 1539
#
# Problematic Thread: prio=5 tid=0x9cee98 nid=0x102c runnable

or

# HotSpot Virtual Machine Error, assertion failure
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# assert(v_false, "resume thread failed")
#
# Error happened during: scavenge
#
# Error ID: E:\jdk131src\hotspot1.3.1\src\share\vm\runtime\thread.cpp,
503
#
# Problematic Thread: prio=5 tid=0x9cee98 nid=0x1250 runnable

Windbg C++ and java stack traces available on request.  They break the
web-
based bug submission if included here.

This bug can be reproduced occasionally.

---------- BEGIN SOURCE ----------
Simple test source code is unavailable.  However diffs to the JVM that
fix
the
bug are available, and included in the description if they didn't cause
the
bug
submission to break.  If not in the description, please contact the
submitter
for fix diffs.

We're also happy to make available the test bed to reproduce the
problem,
but
it isn't a simple test case.
---------- END SOURCE ----------

CUSTOMER WORKAROUND :
This won't work for our customers who need high performance
and scalable deployments, but this bug can be worked around
in three ways:
1) use -classic jvm
2) use a single processor cpu
3) bind the java process affinity to one cpu on windows
4) use a non-windows platform.
workaround: 
comments: (company - eXcelon Corporation , email - ###@###.###)
CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.3.1_05 1.4.0_03 hopper-rc FIXED IN: 1.3.1_05 1.4.0_03 hopper-rc INTEGRATED IN: 1.3.1_05 1.4.0_03 hopper-rc
14-06-2004
EVALUATION commit to Hopper ###@###.### 2002-06-24 Removed mantis from integrated in release per p-team ###@###.### 2002-10-18
24-06-2002