Bug ID: JDK-4650839 RAS: Vtest hang after 38 hrs 19 mins in hopper

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 1.4.1

Priority: P1
Status: Closed
Resolution: Won't Fix
OS: linux
CPU: generic,x86

Submitted: 2002-03-11
Updated: 2002-03-28
Resolved: 2002-03-27

RAS: Vtest hang after 38 hrs 19 mins in hopper_04 c1 on linux redhat 7.1

JDK version
=============
java version "1.4.1-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-beta-b04)
Java HotSpot(TM) Client VM (build 1.4.1-beta-b04, mixed mode)

Platform
=============
Linux 2.4.2-2smp #1 SMP i686 unknown

Error message
=============
Vtest hang after 38 hrs 19 mins in hopper_04 c1 on jtg-linux13

Notes
=============
please check http://jtgb4u4c.sfbay/bigapps for test results tables

How to reproduce bug:
telnet to the hosts shown in web for linux test machine with root/[host name] as id/passwd
goto /bt to execute script and /bs to get the running results
for example:
telnet to jtg-i114 with root/jtg-i114 as id/passwd
execute /bs/runatg.ksh -server
cd to /bt/atgxxx.xxx-server to get the results
###@###.### 2002-03-11

EVALUATION One thread is in _thread_new, however, the creating thread has already called os::start_thread() on it. Looks like the start thread event got lost. ###@###.### 2002-03-12 There is another type of hang in vmark. VM thread couldn't grab the Threads_lock in SafepointSynchronize::begin(), although the _owner of Threads_lock is 0x0. Looking into the pthread_mutex_lock frames, it appears the underlying _mutex of Threads_lock is indeed locked by some thread. This probably will need a different bugid. To reproduce the hang: > java COM.volano.Main > repeat 1000 java COM.volano.Mark -count 1 ###@###.### 2002-03-14 I am tracking the second type hang with bug id 4654490 ###@###.### 2002-03-18 Both this hang and 4654490 are caused by a bug in 2.4 SMP kernel. It appears 2.4 SMP kernel sometimes may hand out duplicate PID if two processes are creating threads at the same time. Indeed, I can reproduce the problem of duplicate PID with a C testcase just using "fork". Note that each thread on Linux is essentially a process and must have a unique PID. If two threads are created with the same PID, signals that are meant to start a newly created thread or to wake up a thread blocked in pthread_mutex_lock() or pthread_cond_wait() may get delivered to the wrong thread (LinuxThreads uses "kill(PID, )" to implement pthread_kill() and to restart a sleeping thread). If that happens, we may end up with a hanging VM because some of its threads never wake up. In the Java testcase, when VMark hangs, I can see duplicate PIDs with this command: [root@jtg-linux1 /root]# ps -A|sort|uniq -D 26829 ? 00:00:00 java 26829 ? 00:00:00 java It looks like this kernel race has been fixed in kernel 2.4.18. The changelog of 2.4.18 contains: - Fix SMP race on PID allocation (Erik A. Hendriks) This hang and 4654490 are not reproducible when vmark is run on kernel 2.4.18. Note that kernel 2.4.18 is included in Redhat 7.3 beta. If you want to change to RedHat 7.3, please also see bug 4654443. ###@###.### 2002-03-26

26-03-2002