###@###.### 2004-06-10
Bigapps crashes on amd64 RHEL A3.0 SP2 with 64 bits jvm.
Test machine: jtg-amd5.sfbay
It's a colfax tower ( which is supposely to be a stable system ), 2 cpus* 1992 MHZ, 2GB memory.
The OS is RHEL AS3.0 SP2.
With build b54, tomcat crashed after 62 hours.
vtest crashed from 6 minutes to a couple of hours.
vmark crashed after 4 hours.
With b55, vtest crashed after 8 hours.
All the crashes were observed when CMS collector was used. The stack trace
shows that there is no preceding gc messags so the failures are not likely
specific to CMS.
All crashes segv at pc=0x0000000000000000, and only primodial thread is left,
which does not leave much to look at. I added the trace here anyway.
#0 0x0000002a9592cbc2 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1 0x0000002a9592ca5d in sleep () from /lib64/tls/libc.so.6
#2 0x0000002a95fc094e in os::message_box ()
from /usr/j2se/jre/lib/amd64/server/libjvm.so
#3 0x0000002a9609c956 in VMError::show_message_box ()
from /usr/j2se/jre/lib/amd64/server/libjvm.so
#4 0x0000002a9609c626 in VMError::report_and_die ()
from /usr/j2se/jre/lib/amd64/server/libjvm.so
#5 0x0000002a95fc1aac in JVM_handle_linux_signal ()
from /usr/j2se/jre/lib/amd64/server/libjvm.so
#6 0x0000002a95fbf9fe in signalHandler ()
from /usr/j2se/jre/lib/amd64/server/libjvm.so
#7 <signal handler called>
#8 0x0000000000000000 in ?? ()
I am running vmark using parallel collector on jtg-amd5.sfbay right now.
I may reinstall RHEL AS3.0 and rerun the test to compare if bigapps behave
differently on RHEL AS3.0 and RHEL AS3.0 SP2 on amd64.
At this time it's not clear if the crashes are due to faults in linux kernel
or it's a real bug in jvm. I am using "hotspot/runtime_systems" as a template.
I temporarily set the priority to P1, feel free to downgrade the bug if
you believe the bug is more like a linux-isa issue.
Only one amd64 RHEL AS3.0 SP2 is available for now. It may take a while for me to do more experiments.
Experiments to do:
-- how does 32 bits jvm behave on the same system ? ( I recently installed SP2 on jtg-amd5.sfbay, before the upgrade, 32bits jvm was stable on the system.
I have not got a chance to run 32 bits jvm on the system after the upgrade. )
-- reinstall RHEL AS 3.0 without SP2, and rerun 64 bits jvm.
###@###.### 2004-06-14
Update:
-- the crash is reproducible using Parallel collector. Confirmed that the failure is not specific to CMS collector.
-- It took longer to reproduce the failure with b55 64 bits vm. I tried b51, b54 and b55. With b51 and b54, the crash was easily reproduced within a couple of
hours. With b55, the crash was reproducible within 2 days.
-- The crash showed up in tomat, volanomark and volanotest runs. It's easier to
reproduce the problem using volanotest.
###@###.### 2004-06-15
Occasionally test iteration failed because of "memory fault" or "java.lang.StackOverFlow".
All three types of failures, segv at 0x0000000000000000, "memory fault" and
"java.lang.StackOverFlow" look like memory corruption problems.
###@###.### 2004-06-15
The crash can be reproduced faster using the attached script.
Based on the fact that the crash appears to have occured at start-up of the process, in the script, I killed the client process once it started successfully (indicated by the text pattern "Creating users" in the output file "vtest.temp" ) and jumped to the next test iteration.
To use the attache script:
0. "tar xf vtest.tar" will create a directory named "vtest"
1. cd vtest
2. export JAVA_HOME=<your java home>
3. start volanomark server
./run.server &
4. run volanotest client in a loop
./run.client &
5. "run.client" logs the current iteration in the
file "vtest.temp". logs from previous iterations are saved in
"run.vtest.out".
If the file "vtest.temp" is not up-to-date, probably the client process
crashes.
###@###.### 2004-06-16
See comment section. The crash is reproducible with a small test case within
an hour.
###@###.### 2004-06-17
updates:
1. Bigapps ran well on opteron003.sfbay, a 2 way RHEL AS3.0 FCS. (2.4.21-4.ELsmp #1 SMP)
2. The small test case ThreadTest ran successfully for over 24 hours on opteron003.sfbay. On jtg-amd5.sfbay (SP2) the test case failed within an hour.
3. bigapps test using 32 bits jvm has been running well on jtg-amd5.sfbay for over 24 hours since yesterday.
from 1) and 2) it appears that the bug is in RHEL3 SP2 kernel.
Hui Huang send us the bug ID in redhat's bugzilla. (thanks, Hui!)
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=104688