Bug ID: JDK-4629175 JVM crash.... Error : 11

JDK-4629175 : JVM crash.... Error : 11

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 1.4.0,1.4.1,1.4.2

Priority: P1
Status: Resolved
Resolution: Fixed
OS:
linux,linux_2.4,linux_redhat_7.1,solaris_8,solaris_9 linux,linux_2.4,linux_redhat_7.1,solaris_8,solaris_9
CPU: generic,x86,sparc

Submitted: 2002-01-28
Updated: 2009-11-16
Resolved: 2002-06-20

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
1.4.1 rcFixed

Related Reports

Duplicate :	JDK-4661594 - 64Bit VM on Solaris-SparcV9 crash.
Duplicate :	JDK-4663620 - Volano 64bit SEGV
Duplicate :	JDK-4692989 - Crash in OopFlow::build_oop_map()
Relates :	JDK-4713716 - Volano methods get COMPILE FAILED with -Xverify:none
Relates :	JDK-4655685 - Tomcat failed with hopper b06 on solaris sparc

Description

I run NetBeans IDE 3.3.1 RC3 (Build 200201280331) on
Java VM: Java HotSpot(TM) Client VM (1.4.0-rc-b91 mixed mode)
using my RH7.1 linux 2.4.10 SMP (2CPU).
--------------------------------------------------------------

Sometimes happens that JVM crashes without any reason.
And it happens more then it is healthy (so that's why I decide filling a bug) and happend with previous NB 3.3.1 builds and I think that happened also with 
jdk1.4.0 b90 (/89)

It left usualy output on screen ending with this:

#
# HotSpot Virtual Machine Error : 11
# Error ID : 4F530E43505002D3
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Client VM (1.4.0-rc-b91 mixed mode)

plus core dump (more then 200MB-not attached)) and some hd log (attached)

VTest failed with -server flag after 26 hours 54 minutes with hopper b06.
stack trace shows it's a crash in GC.
>   ---- called from signal handler with signal 10 (SIGBUS) ------
>   [8] MarkSweep::follow_root(0xfe614fa8, 0xfe614fa8, 0xff2ba008, 0xff241a54, 
> 0x310ec0, 0x0), at 0xfe0e0e30
>   [9] Universe::oops_do(0xfe601fa8, 0x0, 0xee270000, 0x0, 0x1, 0xffbeeb00), at 
> 0xfe22523c
>   [10] GenCollectedHeap::process_strong_roots(0x8c3a8, 0x1, 0x0, 0x1, 0x2, 
> 0xfe601fa8), at 0xfe2246e4
>   [11] MarkSweep::mark_sweep_phase1(0x1, 0xfa38183c, 0x0, 0x4e61d8, 
0xfe49e330, 
> 0xfe4c344c), at 0xfe265ca8
>   [12] MarkSweep::invoke_at_safepoint(0x5c00, 0x5f48, 0x4c00, 0x5400, 0x54c8, 
> 0x4f88), at 0xfe49e438
>   [13] OneContigSpaceCardGeneration::collect(0x8c5e8, 0x0, 0x0, 0x0, 0x0, 
0x0), 
> at 0xfe26a898
>   [14] GenCollectedHeap::do_collection(0x0, 0x1, 0x0, 0xfe62a268, 0xfe5aa000, 
> 0x1), at 0xfe22ebac
>   [15] TwoGenerationCollectorPolicy::satisfy_failed_allocation(0x8c3a8, 0x4, 
> 0x0, 0x0, 0xe6b815e0, 0xfa381ad0), at 0xfe234930
>   [16] VM_GenCollectForAllocation::doit(0xe6b815c0, 0x5000, 0x381bbc, 
> 0xfe605638, 0xfe5aa000, 0x0), at 0xfe234b0c
>   [17] VM_Operation::evaluate(0xe6b815c0, 0x0, 0x381690, 0xfe628e08, 
0xfe6202f0, 
> 0x0), at 0xfe2284c0
>   [18] VMThread::evaluate_operation(0xd5790, 0xe6b815c0, 0x0, 0x28de8, 
> 0xfe2c2138, 0x0), at 0xfe2289e4
>   [19] VMThread::loop(0xfe61bc50, 0xfe605870, 0xfe60586c, 0x0, 0x0, 0x0), at 
> 0xfe2c21a4
>   [20] VMThread::run(0xd5790, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfe2c11dc
>   [21] _start(0xd5790, 0xff37f690, 0x1, 0x1, 0xff37e000, 0x0), at 0xfe243684

To reproduce the bug:
Execute from command line
1. telnet to ultraowl
2. export JAVA_HOME=<your jdk>
    export JAVA_ARGS="-server"
3. cp -r /net/mooncake/export/bigapps/bigapps_commandline/vtest /tmp/vtest
4. cd /tmp/vtest
5. start the server run.server
6. run the client in an endless loop
while true; do
run.vtest.client
done

Alternatively, you can execute test script if you are familiar with bigapps scripts,
1. telnet to ultraowl
2. export JAVA_HOME=<jdk>
3. /bs/runvtest.ksh -server

###@###.### 2002-03-22

I also got the now-familiar GC looking crash in swingmark for 64 bit in
Apr 9th main/baseline with runThese.

###@###.### 2002-04-09

Comments

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: hopper-rc FIXED IN: hopper-rc INTEGRATED IN: hopper-rc

14-06-2004

SUGGESTED FIX This bug was fixed in hopper by cliff's 04-05-02 c2_baseline putback, promoted to main_baseline on 04-12-02 by azeem. It should integrated build b10. See runtime_sparc.cpp, revision 1.91. ###@###.### 2002-04-18 In hopper, add: virtual bool depends_only_on_test() const { return false; } to the classes CastP2INode and CastP2LNode in connode.hpp. In merlin, add the same line to class CastIPNode in machnode.hpp. ###@###.### 2002-06-04

04-06-2002

EVALUATION I worte a special verifier that checks for oops with NULL classes. I narrowed this down to having an oop in the old generation that points into the middle of an (another?) object in the young generation. Then I switched to optimized builds (I should have done that sooner, but Hui said it was timing dependent), and turned on just the "barrier" verifier from -XX:+Verify{Before,After}GC. Then I usually die with [Full GC {barrier} 3571K->1757K(4852K) {barrier} , 0.3532380 secs] [GC {klasses} {barrierVerifyCleanCardClosure::do_oop(0x44d977b0) p in old points to *p in young p: 0x44d977b0 boundary: 0x44c00000 *p: 0x444f0b98 # # HotSpot Virtual Machine Error, Internal Error # Please report this error at # http://java.sun.com/cgi-bin/bugreport.cgi # # Java VM: Java HotSpot(TM) Server VM (20020315.1335-4629175-compiler2-optimized-release mixed mode) # # Fatal: pointer on clean card crosses boundary # # Error happened during: generation collection for allocation where we survive a full collection (which cleans cards that should no longer be dirty), and then at the next GC we fail the barrier verify before. (The "{klasses}" is my klass verifier, and the detailed output from VerifyCleanCardClosure::do_oop is some extra output I added to that method to describe the failure in more detail.) This feels like it might be an oop store and a card mark getting scheduled on opposite sides of a safepoint. Then if we collect at that safepoint and nothing else has marked the card we won't scan it, adjust it, etc. and it will be left pointing into the middle of the young generation so the next time we notice it it probably won't be pointing at a valid object. That's one theory, anyway. I've been trying to track this down by binary searching with -XX:CIStop=. This is volano, and only fails on multi-processors, so it's not as deterministic as I'd like. I've put a lot of logs /net/jano.SFBay/export/disk20/GammaBase/Bugs/4629175 In particular, there's clean-server-optimized-with-checks.20020315.0940.text which is a clean run with all the command lines, etc. That shows that the last compilation is #162. I binary searched with -XX:CIStop=, but since it's not always reproducible (witness the clean run above), and not deterministic, there's some jitter in the numbers, even though I ran with -Xbatch. When I sort of narrowed it down, I ran 5 runs each with -XX:CIStop= 64, 67, 68, 69, 70, 75, 79, 84. The results are in the log files, e.g., 68.1, 68.2, 68.3, 68.4, 68.5. Here's the summary grep: grep '^Fatal: pointer on clean card crosses boundary' 64* 67* 68* 69* 70* 75* 79* 84* 94* 68.2:Fatal: pointer on clean card crosses boundary 69.1:Fatal: pointer on clean card crosses boundary 70.2:Fatal: pointer on clean card crosses boundary 70.3:Fatal: pointer on clean card crosses boundary 70.5:Fatal: pointer on clean card crosses boundary 75.1:Fatal: pointer on clean card crosses boundary 75.3:Fatal: pointer on clean card crosses boundary 79.2:Fatal: pointer on clean card crosses boundary 79.3:Fatal: pointer on clean card crosses boundary 79.4:Fatal: pointer on clean card crosses boundary 84.2:Fatal: pointer on clean card crosses boundary 84.5:Fatal: pointer on clean card crosses boundary So, we had no failures in 5 runs at 64 or 67, one at 68, one at 69, 3 at 70, 2 at 75, 3 at 79, and 2 at 84. Not as solid evidence as I'd like. I haven't analyzed the logs to see which methods are compiled in each run to see if I can spot a particular culprit. Nor have I looked at which methods got compiled in the runs that worked versus the ones that failed. But it seems like there something near there that's causing trouble. P.S. I also note that I got some COMPILATION KILLED BY CONCURRENT CLASS LOADING errors, only in the runs that failed, but not in all the runs that failed. I also got some warnings about suspending 1 compiler thread out of 2 (I could have that one wrong, it didn't get saved in the logs. Why not?). ###@###.### 2002-03-18 Am seeing bad oopMaps on -d64, but only apparent in optimzed builds. Corresponding assembly code is curious. Will try to narrow down the offending method. ###@###.### 2002-04-12 The bug (fairly) easily reproduced by running thr URLHammer client with some sort of server such as tomcat. Running the server with -d64 is not good for reproducing the bug, since that only confuses the issue. Previously thought to be seen only with optimized/product builds, reproduction is made easier by using +SafePointALot in a debug/fastdebug build. The bug occurs when the compiler generates code for a non-call safepoint that requires a g register in the oopMap. This happens fairly infrequently, as most back-branch safepoints are elided and an oop in a g register is likely to occur at higher register pressures. The method observed the generated such conditions was java/io/DataInputStream.readLine, and only with inlining on. When such a safepoint is reached, control of the thread is transferred to the iilegal exception handler blob created by C2. This blob is responsible for (among other things) saving registers for the RegisterMap used by oops_code_blob_do(), called by garbage collection. Counting problems due to recent long handling changes caused the g and f registers to be stored at the wrong offset in RegisterMap. (At least the register restoration was consistent; we were restoring from the same location as we saved.) Many bad things can happen when GC doesn't find the right oop where it is supposed to be... This bug was fixed in hopper by cliff's 04-05-02 c2_baseline putback, promoted to main_baseline on 04-12-02 by azeem. It should integrated build b10. ###@###.### 2002-04-18 ---------------------------------------------- I am reopening the bug because the original crash logged in this report is still easily reproducible with both hopper-b10 and Apr-26 main/baseline. Please see my Apr-19 comments on how to reproduce the crash. ###@###.### 2002-04-26 ---------------------------------------------- This is definitely a C2 bug. Under "normal" Linux/Solaris x86 conditions, it occurs when compiling the method COM.volano.mbn::��. At bci 148, the method COM.volano.mau::x is inlined, which requires a card mark. The object for which the card mark is required is spilled due to other inlining/compilation factors. Somehow, C2 goes awry and separates the "shift" instruction that calculates the card for a card mark from the actual dirty card mark store. This is bad since 4 safepoint locations intervene the two instructions. If a GC occurs when we are stopped at any of these 4 safepoints AND the object is moved, then the wrong card may get marked. This problem appears to be intermittent beacuse (1) usually the first compilation of the method deopts because of an UncommonNullCast deoptimization, requiring that a second compilation that does not always happen and (2) timing idiosyncrasies require a GC at one of the 4 safepoints in this comparatively infrequent method. Will attempt to debug the problem with a debug build by using -XX:-UncommonNullCast,-XX:CompileOnly and inspecting the generated assembly code instead of waiting for SEGV. ###@###.### 2002-05-31 ----------------------------------------------

31-05-2002