Bug ID: JDK-4692989 Crash in OopFlow::build_oop

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 1.4.2

Priority: P3
Status: Closed
Resolution: Duplicate
OS: solaris_8
CPU: generic

Submitted: 2002-05-29
Updated: 2002-06-13
Resolved: 2002-06-13

With the cms collector early in the run (possibly during startup) a VM
crash occurred with this retrace.  This VM is from a workspace that is
up-to-date with gc_baseline which has post 1.4.1 changes.

=>[1] __sigprocmask(0x0, 0xa837f540, 0x0, 0x0, 0x0, 0x0), at 0xff369b80
  [2] _resetsig(0xff36c4d0, 0x0, 0x0, 0xa8381d78, 0xff37e000, 0x0), at 0xff35e548
  [3] _sigon(0xa8381d78, 0xff3859a8, 0x6, 0xa837f614, 0xa8381d78, 0xa837f658), at 0xff35dc38
  [4] _thrp_kill(0x0, 0x11, 0x6, 0xff37e000, 0x11, 0xff2bc498), at 0xff360dac
  [5] raise(0x6, 0x0, 0x0, 0xffffffff, 0xff2bc404, 0xa837f768), at 0xff24af98
  [6] abort(0xff2b800c, 0xa837f768, 0x0, 0xfffffff8, 0x4, 0xa837f789), at 0xff235774
  [7] os::abort(0x1, 0xfe5d18a0, 0xa837fff0, 0xfe578963, 0xfe5788f2, 0xff0000), at 0xfe46df20
  [8] report_error(0xa838000e, 0xee4, 0xfe5e9ea0, 0xe2, 0xfe50f54c, 0xfe4f7640), at 0xfe352fd0
  [9] report_fatal(0xfe4f75f0, 0x107, 0xfe4f7641, 0x0, 0x4b994c, 0xfe30c784), at 0xfe35263c
  [10] OopFlow::build_oop_map(0x18, 0x7ce16c, 0x8, 0x501e18, 0x10, 0x6da168), at 0xfe30c93c
  [11] OopFlow::compute_reach(0xa6, 0xffffffff, 0x48, 0xfffffffe, 0x6da168, 0x7e1658), at 0xfe30c270
  [12] Compile::BuildOopMaps(0x1133d0, 0xe7908, 0x15c, 0x2b8, 0x47a560, 0x4a2ea0), at 0xfe30d950
  [13] Compile::Output(0xa83812b8, 0x803eec, 0x14, 0x0, 0x0, 0x0), at 0xfe1d7204
  [14] Compile::Code_Gen(0xa83812b8, 0xa838108c, 0xa83811cc, 0x703508, 0xa83811cc, 0x0), at 0xfe33ccb8
  [15] Compile::Compile(0x237b4c, 0x1f5e58, 0x0, 0x25a910, 0xffffffff, 0x1), at 0xfe33c19c
  [16] C2Compiler::compile_method(0x2b8f8, 0xa8381ab0, 0x0, 0x25a910, 0xffffffff, 0x0), at 0xfe30e2b8
  [17] CompileBroker::invoke_compiler_on_method(0x290, 0x0, 0xffffffff, 0x116b88, 0xfe5d33c0, 0x116af8), at 0xfe1edfc8
  [18] CompileBroker::compiler_thread_loop(0x116af8, 0x116af8, 0x111b78, 0x117090, 0x354440, 0xfe240210), at 0xfe2852d8
  [19] JavaThread::run(0x116af8, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfe240238
  [20] _start(0x116af8, 0xff37f6b8, 0x1, 0x1, 0xff37e000, 0x0), at 0xfe23a1fc


See comments.

PUBLIC COMMENTS no comment

10-06-2004

EVALUATION Since this was found in a post Hopper workspace of gc_baseline, I will commit it to Mantis for tracking purposes, and we will look at the core ASAP. ###@###.### 2002-05-29 The point at which the failure occurs is this line, in buildOopMap.cpp: guarantee( 0, "must find derived/base pair" ); This can happen when the code scheduler disagrees with the register allocator as to the live range of a derived pointer, so that the scheduler introduces a new safepoint within the live range of a derived pointer which the allocator was not expecting. The result is that the assembler cannot find the information it needs in the safepoint graph structure to generate an oop map for the safepoint that will preserve the derived pointer. If this model is true, a workaround is to run with -XX:-OptoScheduling . It is possible that the CMS memory barriers involve forms of derived pointers which the scheduler is not expecting, so it allows them to slide around past unprepared safepoints. ###@###.### 2002-05-31 I started a -OptoScheduling run. The failure has been happening reliably within 24 hours. I initially thought that it was happening during startup but it happens other times too. ###@###.### 2002-05-31 If the fault is not with the scheduler, then there may be an earlier disagreement between the allocator's idea of which derived pointers are live, and the oop map generator's. After trawling through the core file, here is the inlined JVM state at which the offending call is made: java/lang/String substring (II)Ljava/lang/String; bci:62 java/lang/String substring (I)Ljava/lang/String; bci:6 com/nortel/ipc/ss/cs/sip/call/base/fw/SiptApplHeaders getXNortelHeader (Lcom/nortel/ipc/ss/cs/sip/xact/api/events/InviteEvent;) Lcom/nortel/foundation/collections/SList; bci:186 ###@###.### 2002-05-31 The attached file smallTest.java , which is a boiled down version of the failing method, exhibits the root cause of the bug. There is a call followed by an array store. The card mark for the array store looks like this: String[] routes = ... int elementaddr = (int)(&routes[0]) int cardindex = elementaddr >> 9 int cardindex = (int)(&routes[0]) >> 9 Both the cast and the shift are ALU operations which in principle could be scheduled either before or after the call which precedes the array store. To prevent a GC during the call from invalidating those temporary values, the cast operation (a CastP2I) is given a control input which forces it to be scheduled after the call. At least in theory. In practice, the control input to the CastP2I is twiddled by the optimizer, which "proves" that it is equivalent to some other dominating control input. Note that the array store includes a null check (of the array itself). The CastP2I takes as its control input the null check, but this null check is replaced by a dominating null check (route.length() in the example). This allows the scheduler to slip the CastP2I above the call, causing an insidious GC bug. Currently, java_g does not self-diagnose this serious error; it should. In the present bug, a late tweak in the schedule pops the CastP2I back after the call, making the call "see" an unexpected live derived pointer (the address of routes[0]). In fact, that code is more correct than the previuos (non-asserting) state, when the cast address was live across the call, as an undifferentiated int in a register. The solution is to make CastP2I nodes not participate in control edge optimizations. The fix to bug 4629175 fixes this root problem also. We will test the Nortel app. with that fix in place, and if it passes, we can close this bug as a dup of 4629175. ###@###.### 2002-06-07

07-06-2002