Bug ID: JDK-4947125 volanomark failed intermittenly in C2 mode with tiger b25

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 5.0

Priority: P2
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2003-10-31
Updated: 2016-08-17
Resolved: 2004-02-18

Other
5.0 b39Fixed

with tiger b25, on solaris sparc, volanomark failed intermittenly.
Flag used: -server -XX:+UseConcMarkSweepGC

Test machine: j2se-db.west ( 12 cpu * 1050MHZ )
SunOS j2se-db 5.8 Generic_108528-23 sun4u sparc SUNW,Sun-Fire


Failure frequency: 4 times in 9 days ( 58000 iterations ) = 0.1%

Current thread (0x0026c188):  JavaThread "CompilerThread0" daemon [_thread_in_vm
, id=21]

  ---- called from signal handler with signal 11 (SIGSEGV) ------
  [12] methodOopDesc::set_code(0xf394fda8, 0xf9063008, 0x56e870, 0x393e08, 0xea481d98, 0x0), at 0xfe22d544
  [13] ciEnv::register_method(0x23, 0x0, 0xffffffff, 0xf9063008, 0x0, 0x0), at 0xfe37f11c
  [14] Compile::Compile(0xea481150, 0x0, 0x237f18, 0x56e870, 0xfe625e25, 0x0), at 0xfe3a8c58
  [15] C2Compiler::compile_method(0xea4819d8, 0xc00, 0x4e4680, 0xffffffff, 0x237f18, 0x1), at 0xfe37554c
  [16] CompileBroker::invoke_compiler_on_method(0x393e08, 0x26c198, 0x26c76c, 0x23e980, 0x26c198, 0x0), at 0xfe22b66c
  [17] CompileBroker::compiler_thread_loop(0xfe6def90, 0x23e950, 0x26c198, 0xf394fda8, 0x393e08, 0x1), at 0xfe2c9ed0
  [18] JavaThread::run(0x26c198, 0x1, 0x2, 0xfe6d4a7c, 0xfe6d4a78, 0xea402000), at 0xfe27fd4c
  [19] _start(0x26c198, 0x5800, 0x21db, 0x437c, 0xfe692000, 0x26cd08), at 0xfe27ce3c

core files are under /export/archive/tiger_b25/VolanoMarkrun.24606.-server

###@###.### 2003-10-31

CONVERTED DATA BugTraq+ Release Management Values COMMIT TO FIX: tiger-beta2 FIXED IN: tiger-beta2 INTEGRATED IN: tiger-b39 VERIFIED IN: tiger-beta2

14-06-2004

EVALUATION ###@###.### 2003-11-04 SEGV at this pc due to l7 == 0: ------------------------------- pc 0xfe22d544:set_code+0x14 ld [%l7 + 0x68], %l6 0xfe22d530: set_code : save %sp, -0x70, %sp 0xfe22d534: set_code+0x0004: orcc %g0, %i1, %i1 0xfe22d538: set_code+0x0008: be,pn %icc,set_code+0x24 0xfe22d53c: set_code+0x000c: st %i1, [%i0 + 0x38] 0xfe22d540: set_code+0x0010: ld [%i0 + 0x38], %l7 0xfe22d544: set_code+0x0014: ld [%l7 + 0x68], %l6 0xfe22d548: set_code+0x0018: st %l6, [%i0 + 0x40] 0xfe22d54c: set_code+0x001c: ret 0xfe22d550: set_code+0x0020: restore 0xfe22d554: set_code+0x0024: ld [%i0 + 0x18], %i3 l4-l7 0x00000011 0x000028d0 0x00006754 0x00000000 i0-i3 0xf394fda8 0xf9063008 0x0056e870 0x00393e08 According to assembler code the method update_compiled_code_entry_point() is inlined to set_code() (before it was not, this is why it passed begore): // Install compiled code for this nmethod void methodOopDesc::set_code(nmethod* code) { #ifdef COMPILER2 NOT_CORE(assert(code == NULL || !code->is_osr_method(), "osr code should not be used here");) #endif _code = code; // Update compiler entrypoint update_compiled_code_entry_point(true); } void methodOopDesc::update_compiled_code_entry_point(bool lazy) { #ifndef CORE if (_code != NULL) { _from_compiled_code_entry_point = _code->verified_entry_point(); return; } ... ###@###.### 2003-11-13 The instruction "st %i1, [%i0 + 0x38]" is executed always before "ld [%i0 + 0x38], %l7". The only case when this can happen is an other thread call set_code(NULL) for the same methodOop (for example, due to deoptimization or class unload) and store NULL before 'ld' in our thread is executed. It is the data race. It seems, this is runtime bug. ----------- This code and its callers need to be rewritten to make the data race benign. Otherwise the callers need to take out a lock (locking the methodOop?) to remove the race. I think it is possible to rework the code to make the race benign. ###@###.### 2003-11-17 Reworking the code to make the race benign by making sure that the methodOop _from_compiled_code_entry_point doesn't point to a verified entry point if the nmethod is swept. ###@###.### 2004-02-09 4947125 volanomark failed intermittenly in C2 mode with tiger b25 The volano failure showed a race between setting two related fields in the methodOop for compiled code. One was the nmethod _code and the other was a field _from_compiled_code_entry_point which depends on the value of code. The compiler sets the latter field to the verified_entry_point of the nmethod, and uses this field to generate calls to compiled code from other compiled code. The field is also used in generated code to call code through vtable stubs. If a method is being deoptimized through an uncommon trap (which can happen on any thread and does not need a safepoint), the nmethod is marked non-entrant and the _code field is set to NULL and the _from_compiled_code_entry_point is set to the c2i adapter. The volano crash occurred because the compiler thread set the code to an nmethod while deoptimization was occurring. compiler: _code = code deopt: .... if (_code != NULL) .... .... _code = NULL; _from_cep = _code->vep(); .... sig 11 .... _from_cep = lazy-c2i-adapter; The window between checking code and getting the verified entry point is very small and we suspect that the crash started occurring more frequently because the C++ compiler stopped generating code which cached the value of the _code field in a register. This crash only started happening on fast/big machines. I was able to provoke this crash quickly by putting a stall in between the test for code and the fetch for vep() and running with DeoptimizeALot. So the race between the values of these fields yields these states: _code = nmethod; _from_cep = nmethod->vep(); /// okay -set by the compiler. _code = NULL _from_cep = c2i adapter /// okay -set by deopt _code = nmethod; _from_cep = c2i adapter // result: compiled code goes to interpreter entry point then to compiled code unverified entry point. It's slow but correct. <does this ever get fixed...???> only if the code is either deoptimized again or compiled again. _code = NULL; _from_cep = nmethod->vep(); // result: The verified entry point has already been patched to handle_wrong_method which will call update_compiled_code_entry_point(), this will fix up the _from_cep field for the null _code case. The worry is if the pointer into the verified entry_point persists after the nmethod memory is flushed. Can this field result in a branch into space? Yes. Before a nonentrant method is flushed it will be made into a zombie method. Extra code is added to fix up the compiled code entry in this case. I added an assert to make sure the flushed method doesn't point back into the nmethod. In summary, the fix is to cache the value of _code in update_compiled_entry_point so it doesn't SEGV, fix up the value of _from_compiled_code_entry_point in make_zombie(), and assert that methods don't point into verified entry points of flushed methods, and added lots of comments that insist the race it now benign. Thanks to ###@###.### for talking me through this and for providing the fix. Thanks to ###@###.### for diagnosing it. ###@###.### 2004-02-13

11-06-2004

SUGGESTED FIX See comments.

11-06-2004