Bug ID: JDK-4746263 JDK 1.4.1 dumps core during ECperf; fails debug assert top <= end

JDK-4746263 : JDK 1.4.1 dumps core during ECperf; fails debug assert top <= end

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: unknown

Priority: P2
Status: Resolved
Resolution: Fixed
OS: solaris_8,solaris_9
CPU: sparc

Submitted: 2002-09-12
Updated: 2002-09-25
Resolved: 2002-09-25

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other	Other
1.4.1_01 01Fixed	1.4.2Fixed

Related Reports

Relates :

JDK-4753265 - [1.4.0_02] Crash in 64 bit HotSpot Server JVM,, backport 4746263

Description

Using various builds of JDK 1.4.1 (builds 15-21) on an 8x900Mhz V880,
one or more JVM instances (we use 4 instances of BEA weblogic 7.0)
will terminate intermittantly with either SIGSEGV or SIGBUS at
elevated loads. The problem has been reproduced on 2 different
V880 machines running the benchmark.

JDK 1.4.0 does not exhibit this behavior.

The options used are:

-server -XX:+AggressiveHeap -Xmx2500m -Xms2500m -Xmn300m -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=20

Currently testing with "-XX:-InlineArrayCopy" option.


Sample reports:

=======================================================================
# more hs_err_pid14484.log

Unexpected Signal : 10 occurred at PC=0xFECC5284
Function=[Unknown. Nearest: SUNWprivate_1.1+0xC5284]
Library=/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/server/libjvm.so


Dynamic libraries:
0x10000 	/export/home/VMs/j2sdk1.4.1/bin/java
0xff360000 	/usr/lib/lwp/libthread.so.1
0xff390000 	/usr/lib/libdl.so.1
0xff280000 	/usr/lib/libc.so.1
0xff350000 	/usr/platform/SUNW,Sun-Fire-880/lib/libc_psr.so.1
0xfec00000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/server/libjvm.so
0xff230000 	/usr/lib/libCrun.so.1
0xff200000 	/usr/lib/libsocket.so.1
0xfeb00000 	/usr/lib/libnsl.so.1
0xfebd0000 	/usr/lib/libm.so.1
0xff260000 	/usr/lib/libw.so.1
0xfeae0000 	/usr/lib/libmp.so.2
0xfeac0000 	/usr/lib/librt.so.1
0xfeaa0000 	/usr/lib/libaio.so.1
0xfea70000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/native_threads/libhpi.so
0xfea30000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/libverify.so
0xfe9f0000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/libjava.so
0xfe9d0000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/libzip.so
0xfab90000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/libnet.so
0xfabe0000 	/export/home/VMs/j2sdk1.4.1_MPSS/jre/lib/sparc/libioser12.so
0x5a3b0000 	/export/home/ecperf/wls7_0.fcs/weblogic700/server/lib/solaris/libmuxer.so

Local Time = Wed Sep 11 16:30:11 2002
Elapsed Time = 2080
#
# HotSpot Virtual Machine Error : 10
# Error ID : 4F530E43505002E4 01
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Server VM (1.4.1-internal mixed mode)
#
===================================================================='
more hs_err_pid684.log

Unexpected Signal : 10 occurred at PC=0xFB001E78
Function=registerSynchronization (compiled Java code)
Library=(N/A)

Current Java thread:

Dynamic libraries:
0x10000 	/export/home/VMs/j2sdk1.4.1/bin/java
0xff350000 	/usr/lib/lwp/libthread.so.1
0xff380000 	/usr/lib/libdl.so.1
0xff200000 	/usr/lib/libc.so.1
0xff340000 	/usr/platform/SUNW,Sun-Fire-880/lib/libc_psr.so.1
0xfec00000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/server/libjvm.so
0xff2e0000 	/usr/lib/libCrun.so.1
0xfebe0000 	/usr/lib/libsocket.so.1
0xfeb00000 	/usr/lib/libnsl.so.1
0xfead0000 	/usr/lib/libm.so.1
0xff310000 	/usr/lib/libw.so.1
0xfebc0000 	/usr/lib/libmp.so.2
0xfeaa0000 	/usr/lib/librt.so.1
0xfea80000 	/usr/lib/libaio.so.1
0xfea60000 	/usr/lib/libmd5.so.1
0xfea40000 	/usr/platform/SUNW,Sun-Fire-880/lib/libmd5_psr.so.1
0xfea10000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/native_threads/libhpi.so
0xfe9d0000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/libverify.so
0xfe990000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/libjava.so
0xfe970000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/libzip.so
0xfa9d0000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/libnet.so
0x5a3d0000 	/export2/VMs/j2sdk1.4.1-rc-b19/jre/lib/sparc/libioser12.so
0xfa880000 	er.so

Local Time = Wed Sep  4 12:34:29 2002
Elapsed Time = 2102
#
# HotSpot Virtual Machine Error : 10
# Error ID : 4F530E43505002E6 01
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Server VM (1.4.1-rc-b19 mixed mode)
#

Comments

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: 1.4.1_01 mantis FIXED IN: 1.4.1_01 mantis INTEGRATED IN: 1.4.1_01 mantis mantis-b03
14-06-2004
EVALUATION I've looked at one of the core files ( /export/home/jvm_cores/j2sdk1.4.1.b21/core.1 ) with SA and the problem that caused the seg. fault is a bad oop residing in a java/util/Hashtable. There wasn't any clue about where the bad oop came from. However while I was doing this a run was being made with a debug vm and it failed with the following assert: # # HotSpot Virtual Machine Error, assertion failure # Please report this error at # http://java.sun.com/cgi-bin/bugreport.cgi # # Java VM: Java HotSpot(TM) Server VM (1.4.1-b21-debug mixed mode) # # assert(top <= end, "pointers out of order") # I captured a core file from this run in /export/home/jvm_cores/j2sdk1.4.1.b21_debug/ecperf.24614. I looked at this with a combination of dbx and SA. This looks to me like compiled code has cached a TLAB top value across a gc point. At the gc point the TLAB was taken from the thread and reset. The value of the TLAB for the thread in question is: thread top end size 0x00000000 0x7bf5cd90 0x00000000 0x00000800 The size (and everything but top) is what you'd expect it a ThreadLocalAllocBuffer::reset() was called for this TLAB. The value of top is a legal heap value (in eden). The stack trace for this thread shows that it was executing in weblogic/jdbc/jts/Connection.prepareStatement(Ljava/lang/String;) The thread in question was t@41. I've attached a disassembly of that nmethod and a dump of its registers at the point it did the call (bad.Z). A snippet of the stack trace is here: =>[1] _libc_nanosleep(0x53d7dbe8, 0x53d7dbe0, 0x0, 0x0, 0x4, 0x0), at 0xff2995ac [2] _sleep(0x64, 0x400, 0xff2be164, 0x53d7e0dc, 0x2f, 0x53d7e8dc), at 0xff35e230 [3] os::message_box(0xfec3ba15, 0x53d7e0dc, 0x53d7e12b, 0x53d7e0dc, 0xfeda5356, 0x8eae9d10), at 0xfe8fbd70 [4] report_error(0x1, 0xfec6f496, 0x5e, 0xfec3ba15, 0xfec3ba27, 0xfec6f48b), at 0xfe4a7e04 [5] report_assertion_failure(0xfec6f48b, 0xfec6f496, 0x5e, 0xfec6f4e6, 0x1, 0xf738db98), at 0xfe4a7170 [6] CollectedHeap::allocate_from_tlab(0x3983488, 0x4, 0x0, 0x0, 0x0, 0x0), at 0xfe5e995c [7] CollectedHeap::common_mem_allocate_noinit(0x4, 0x0, 0x3983488, 0x0, 0x0, 0x0), at 0xfe5e8a50 [8] CollectedHeap::common_mem_allocate_init(0x4, 0x0, 0x3983488, 0x0, 0x1, 0x0), at 0xfe5e47ac [9] CollectedHeap::obj_allocate(0x53d7ec7c, 0x4, 0x3983488, 0x0, 0x0, 0x0), at 0xfe5e0c84 [10] instanceKlass::allocate_instance(0xf79ab458, 0x3983488, 0x0, 0x0, 0x0, 0x0), at 0xfe5d0a78 [11] OptoRuntime::new_C(0xf79ab450, 0x3983488, 0x6f, 0x8eae9258, 0xb8e95a6e, 0x8eae9d10), at 0xfe9fcbc0 [12] 0xfa4b0958(0xf79ab450, 0x7f, 0x8c777216, 0x1, 0x7f, 0x8c777212), at 0xfa4b0957 [13] 0xfa952f34(0xf602a198, 0xf6039e40, 0x3983488, 0x8eaea3b8, 0x1, 0xf738db98), at 0xfa952f33 [14] 0xfa959aac(0x8c777268, 0x7bf5cd58, 0xf79bdbb0, 0x8c777280, 0x4, 0x0), at 0xfa959aab [15] 0xfa8f82b4(0xb4d06500, 0xb4d06500, 0xf77f21f8, 0x7bf5cd58, 0xf7b4b400, 0xf79e1b38), at 0xfa8f82b3 This seems to me to be a compiler bug (or gc) and not a likely runtime bug. ###@###.### 2002-09-13 ----- ----- ###@###.### 2002-09-13 This week Cliff has been investigating the assert failure: # assert(top <= end, "pointers out of order") with end==0 and top== pre-GC value of top plus 24 bytes. This reproduces in < 10 minutes with the JIntegra app from Intrinsyc, 4726812. The source of the problem has not been identified, but Cliff is instrumenting to record in each thread object its state when GC starts, and each safepoint's PC. One theory is that the thread is executing native code at the start of GC and escapes being stopped. Another is that part of the generated code's allocation idiom, the update of 'top', has drifted across a safepoint. The JIntegra problem has been confirmed as a problem in generated code. The allocation idiom's load of 'top' precedes a safepoint and the store of (top + size) occurs after the safepoint. ----- ----- ###@###.### 2002-09-20 Testing of the fix for split-if and depends_only_on_test() indicates that this initial problem has been fixed. Original 70% failure rate has been reduced to 15% and investigation indicates that this is a different bug. I am closing this BugId and opening one for the crash that was hidden by this problem. ----- -----
11-06-2004
WORK AROUND Use JDK 1.4.0_01
11-06-2004
SUGGESTED FIX ###@###.### 2002-09-20 Following fix provided by cliffc after investigation of the same failure running JIntegra, 4726812. src/share/vm/opto/loopopts.cpp [1.174 vs. 1.178] @@ -635,9 +635,17 @@ void PhaseIdealLoop::split_if_with_block // If trying to do a 'Split-If' at the loop head, it is only // profitable if the cmp folds up on BOTH paths. Otherwise we // risk peeling a loop forever. - // CNC - Disabled for now. - if( n_ctrl->is_Loop() ) - policy = 999; // Policy requires BOTH paths to win + + // CNC - Disabled for now. Requires careful handling of loop + // body selection for the cloned code. Also, make sure we check + // for any input path not being in the same loop as n_ctrl. For + // irreducible loops we cannot check for 'n_ctrl->is_Loop()' + // because the alternative loop entry points won't be converted + // into LoopNodes. + IdealLoopTree n_loop = get_loop(n_ctrl); + for( uint j = 1; j < n_ctrl->req(); j++ ) + if( get_loop(n_ctrl->in(j)) != n_loop ) + return; // Split compare 'n' through the merge point if it is profitable Node phi = split_thru_phi( n, n_ctrl, policy ); src/share/vm/opto/memnode.hpp [1.93 vs. 1.95] @@ -223,6 +223,16 @@ public: virtual int Opcode() const; virtual uint ideal_reg() const { return Op_RegP; } virtual int store_Opcode() const { return Op_StoreP; } + // depends_only_on_test is almost always true, and needs to be almost always + // true to enable key hoisting & commoning optimizations. However, for the + // special case of RawPtr loads from TLS top & end, the control edge carries + // the dependence preventing hoisting past a Safepoint instead of the memory + // edge. (An unfortunate consequence of having Safepoints not set Raw + // Memory; itself an unfortunate consequence of having Nodes which produce + // results (new raw memory state) inside of loops preventing all manner of + // other optimizations). Basically, it's ugly but so is the alternative. + // See comment in graphkit.cpp, around line 1923 GraphKit::allocate_heap. + virtual bool depends_only_on_test() const { return adr_type() != TypeRawPtr::BOTTOM; } }; //------------------------------LoadKlassNode---------------------------------- @@ -233,6 +243,7 @@ public: : LoadPNode(c,mem,adr,at,tk) {} virtual int Opcode() const; virtual const Type Value( PhaseTransform phase ) const; + virtual bool depends_only_on_test() const { return true; } }; //------------------------------LoadSNode-------------------------------------- @@ -367,6 +378,7 @@ public: : LoadPNode(c,mem,adr,TypeRawPtr::BOTTOM, TypeRawPtr::BOTTOM) {} virtual int Opcode() const; virtual int store_Opcode() const { return Op_StorePConditional; } + virtual bool depends_only_on_test() const { return true; } }; //------------------------------LoadLLockedNode---------------------------------
11-06-2004