JDK-5061769 : bigapps crashes on amd64 RHEL AS3.0 SP2
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 5.0
  • Priority: P5
  • Status: Closed
  • Resolution: Duplicate
  • OS: generic
  • CPU: generic
  • Submitted: 2004-06-11
  • Updated: 2005-09-02
  • Resolved: 2005-09-02
Related Reports
Duplicate :  
Description
###@###.### 2004-06-10

Bigapps crashes on amd64 RHEL A3.0 SP2 with 64 bits jvm.

Test machine: jtg-amd5.sfbay
It's a colfax tower ( which is supposely to be a stable system ), 2 cpus* 1992 MHZ, 2GB memory. 
The OS is RHEL AS3.0 SP2.

With build b54, tomcat crashed after 62 hours.
vtest crashed from 6 minutes to a couple of hours.
vmark crashed after 4 hours.

With b55, vtest crashed after 8 hours.

All the crashes were observed when CMS collector was used.  The stack trace
shows that there is no preceding gc messags so the failures are not likely
specific to CMS.

All crashes segv at  pc=0x0000000000000000, and only primodial thread is left, 
which does not leave much to look at. I added the trace here anyway.
#0  0x0000002a9592cbc2 in __nanosleep_nocancel () from  /lib64/tls/libc.so.6
#1  0x0000002a9592ca5d in sleep () from /lib64/tls/libc.so.6
#2  0x0000002a95fc094e in os::message_box ()
    from /usr/j2se/jre/lib/amd64/server/libjvm.so
#3  0x0000002a9609c956 in VMError::show_message_box ()
    from /usr/j2se/jre/lib/amd64/server/libjvm.so
#4  0x0000002a9609c626 in VMError::report_and_die ()
   from /usr/j2se/jre/lib/amd64/server/libjvm.so
 #5  0x0000002a95fc1aac in JVM_handle_linux_signal ()
    from /usr/j2se/jre/lib/amd64/server/libjvm.so
#6  0x0000002a95fbf9fe in signalHandler ()
    from /usr/j2se/jre/lib/amd64/server/libjvm.so
#7  <signal handler called>
#8  0x0000000000000000 in ?? ()


I am running vmark using parallel collector on jtg-amd5.sfbay right now.
I may reinstall RHEL AS3.0 and rerun the test to compare if bigapps behave
differently on RHEL AS3.0 and RHEL AS3.0 SP2 on amd64. 

At this time it's not clear if the crashes are due to faults in linux kernel
or it's a real bug in jvm. I am using "hotspot/runtime_systems" as a template.
I temporarily set the priority to P1, feel free to downgrade the bug if 
you believe the bug is more like a linux-isa issue.

Only one amd64 RHEL AS3.0 SP2 is available for now. It may take a while for me to do more experiments. 
Experiments to do: 
-- how does 32 bits jvm behave on the same system ? ( I recently installed SP2 on jtg-amd5.sfbay, before the upgrade, 32bits jvm was stable on the system.
I have not got a chance to run 32 bits jvm on the system after the upgrade. ) 
-- reinstall RHEL AS 3.0 without SP2, and rerun 64 bits jvm.

###@###.### 2004-06-14
Update:
-- the crash is reproducible using Parallel collector. Confirmed that the failure is not specific to CMS collector.
-- It took longer to reproduce the failure with b55 64 bits vm. I tried b51, b54 and b55. With b51 and b54, the crash was easily reproduced within a couple of 
hours. With b55, the crash was reproducible within 2 days.
-- The crash showed up in tomat, volanomark and volanotest runs. It's easier to
reproduce the problem using volanotest. 



###@###.### 2004-06-15
Occasionally test iteration failed because of "memory fault" or "java.lang.StackOverFlow". 
All three types of failures, segv at 0x0000000000000000, "memory fault" and
"java.lang.StackOverFlow" look like memory corruption problems.


###@###.### 2004-06-15

The crash can be reproduced faster using the attached script.
Based on the fact that the crash appears to have occured at start-up of the process, in the script, I killed the client process once it started successfully (indicated by the text pattern "Creating users" in the output file "vtest.temp" ) and jumped to the next test iteration. 

To use the attache script:
0. "tar xf vtest.tar" will create a directory named "vtest"
1. cd vtest
2. export JAVA_HOME=<your java home>
3. start volanomark server
   ./run.server &
4. run volanotest client in a loop
   ./run.client &
5. "run.client" logs the current iteration in the
   file "vtest.temp". logs from previous iterations are saved in
   "run.vtest.out".
   If the file "vtest.temp" is not up-to-date, probably the client process
   crashes.

###@###.### 2004-06-16
See comment section. The crash is reproducible with a small test case within
an hour.


###@###.### 2004-06-17
updates:
1.  Bigapps ran well on opteron003.sfbay, a 2 way RHEL AS3.0 FCS. (2.4.21-4.ELsmp #1 SMP)
2.  The small test case ThreadTest ran successfully for over 24 hours on opteron003.sfbay. On jtg-amd5.sfbay (SP2) the test case failed within an hour.
3.  bigapps test using 32 bits jvm has been running well on jtg-amd5.sfbay for over 24 hours since yesterday.

from 1) and 2) it appears that the bug is in RHEL3 SP2 kernel.
Hui Huang send us the bug ID in redhat's bugzilla. (thanks, Hui!)
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=104688

Comments
EVALUATION There is no reliable way to detect primordial thread's stack location on Linux. Instead, I'll fix the problem by changing Java launcher to create JVM from non-primordial thread. It's tracked as 6316197.
02-09-2005

EVALUATION I would assume we could remove the keyword red from this bug.
11-08-2005

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: dragon
13-09-2004

EVALUATION This is a bug in the RedHat 2.4.21-15 kernel. The bug does not appear to be in 2.4.21-9. The problem is that randomly, our Java executable is started with a stack pointer set incorrectly. The SP is actually in our yellow or red zones and we fault or stack overflow on program startup. According to Hui Huang , this bug was in RHEL3-beta on x86. Please see: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=104688 I guess somehow the buggy code got merged into the amd64 tree. ###@###.### 2004-06-18 We should detect an invalid stack and give a proper warning when this situation occurs. Lowering the priority and defering this fix until the next update release. ###@###.### 2004-06-21 Should this be done in update release.. Please let us know..I'v updated it for mustang for now. ###@###.### 2005-1-28 13:40:19 GMT
21-06-2004