United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-4650839 RAS: Vtest hang after 38 hrs 19 mins in hopper_04 c1 on linux redhat 7.1
JDK-4650839 : RAS: Vtest hang after 38 hrs 19 mins in hopper_04 c1 on linux redhat 7.1

Details
Type:
Bug
Submit Date:
2002-03-11
Status:
Closed
Updated Date:
2002-03-28
Project Name:
JDK
Resolved Date:
2002-03-27
Component:
hotspot
OS:
linux
Sub-Component:
runtime
CPU:
x86,generic
Priority:
P1
Resolution:
Won't Fix
Affected Versions:
1.4.1
Fixed Versions:

Related Reports
Duplicate:

Sub Tasks

Description
RAS: Vtest hang after 38 hrs 19 mins in hopper_04 c1 on linux redhat 7.1

JDK version
=============
java version "1.4.1-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-beta-b04)
Java HotSpot(TM) Client VM (build 1.4.1-beta-b04, mixed mode)

Platform
=============
Linux 2.4.2-2smp #1 SMP i686 unknown

Error message
=============
Vtest hang after 38 hrs 19 mins in hopper_04 c1 on jtg-linux13

Notes
=============
please check http://jtgb4u4c.sfbay/bigapps for test results tables

How to reproduce bug:
telnet to the hosts shown in web for linux test machine with root/[host name] as id/passwd
goto /bt to execute script and /bs to get the running results
for example:
telnet to jtg-i114 with root/jtg-i114 as id/passwd
execute /bs/runatg.ksh -server
cd to /bt/atgxxx.xxx-server to get the results
###@###.### 2002-03-11

                                    

Comments
EVALUATION

One thread is in _thread_new, however, the creating thread has already called
os::start_thread() on it. Looks like the start thread event got lost.

###@###.### 2002-03-12

There is another type of hang in vmark. VM thread couldn't grab the 
Threads_lock in SafepointSynchronize::begin(), although the _owner 
of Threads_lock is 0x0. Looking into the pthread_mutex_lock frames,
it appears the underlying _mutex of Threads_lock is indeed locked
by some thread. This probably will need a different bugid. 
To reproduce the hang:

> java COM.volano.Main
> repeat 1000 java COM.volano.Mark -count 1

###@###.### 2002-03-14

I am tracking the second type hang with bug id 4654490

###@###.### 2002-03-18

Both this hang and 4654490 are caused by a bug in 2.4 SMP kernel. It appears
2.4 SMP kernel sometimes may hand out duplicate PID if two processes are
creating threads at the same time. Indeed, I can reproduce the problem
of duplicate PID with a C testcase just using "fork".

Note that each thread on Linux is essentially a process and must have a
unique PID. If two threads are created with the same PID, signals that 
are meant to start a newly created thread or to wake up a thread blocked in 
pthread_mutex_lock() or pthread_cond_wait() may get delivered to the
wrong thread (LinuxThreads uses "kill(PID, )" to implement pthread_kill() 
and to restart a sleeping thread). If that happens, we may end up with a 
hanging VM because some of its threads never wake up.

In the Java testcase, when VMark hangs, I can see duplicate PIDs with 
this command:

[root@jtg-linux1 /root]# ps -A|sort|uniq -D
26829 ?        00:00:00 java
26829 ?        00:00:00 java

It looks like this kernel race has been fixed in kernel 2.4.18. The changelog
of 2.4.18 contains:

   - Fix SMP race on PID allocation                (Erik A. Hendriks)

This hang and 4654490 are not reproducible when vmark is run on kernel 2.4.18.

Note that kernel 2.4.18 is included in Redhat 7.3 beta. If you want to change
to RedHat 7.3, please also see bug 4654443.
 
###@###.### 2002-03-26
                                     
2002-03-26



Hardware and Software, Engineered to Work Together