JDK-8225681 : vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine fails due to a) MT-unsafe modification of inline cache
  • Type: Bug
  • Component: hotspot
  • Sub-Component: jvmti
  • Affected Version: 13,14
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux,windows
  • CPU: x86_64
  • Submitted: 2019-06-12
  • Updated: 2024-10-10
  • Resolved: 2019-10-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 14
14 b18Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
The following test failure was spotted in the JDK13 CI:

vmTestbase/nsk/jvmti/RedefineClasses/StressRedefine/TestDescription.java

Here's a snippet of the log file:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (open/src/hotspot/cpu/x86/compiledIC_x86.cpp:166), pid=25415, tid=25557
#  assert(old_method == __null || old_method == callee() || !old_method->method_holder()->is_loader_alive()) failed: a) MT-unsafe modification of inline cache
#
# JRE version: Java(TM) SE Runtime Environment (13.0) (fastdebug build 13-ea+0-1311)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 13-ea+0-1311, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x9e16a0][thread 25565 also had an error]
  CompiledDirectStaticCall::set_to_interpreted(methodHandle const&, unsigned char*)+0xd0
#
# Core dump will be written. Default location: /scratch/opt/mach5/mesos/work_dir/slaves/df27b84d-b5c1-4760-9e48-df95fd33274c-S285/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/52db6808-5cf9-4822-96a9-e1bebdb5dd13/runs/02b4b70d-3621-4c81-9560-4c8c96ce62e7/testoutput/test-support/jtreg_open_test_hotspot_jtreg_vmTestbase_nsk_jvmti/scratch/3/core.25415
#
# An error report file with more information is saved as:
# /scratch/opt/mach5/mesos/work_dir/slaves/df27b84d-b5c1-4760-9e48-df95fd33274c-S285/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/52db6808-5cf9-4822-96a9-e1bebdb5dd13/runs/02b4b70d-3621-4c81-9560-4c8c96ce62e7/testoutput/test-support/jtreg_open_test_hotspot_jtreg_vmTestbase_nsk_jvmti/scratch/3/hs_err_pid25415.log
Compiled method (c1)   18713 1966   !   2       jdk.internal.reflect.GeneratedMethodAccessor1068::invoke (270 bytes)
 total in heap  [0x00007f4c89232590,0x00007f4c892354c8] = 12088
 relocation     [0x00007f4c89232700,0x00007f4c89232a18] = 792
 main code      [0x00007f4c89232a20,0x00007f4c89234340] = 6432
 stub code      [0x00007f4c89234340,0x00007f4c89234470] = 304
 oops           [0x00007f4c89234470,0x00007f4c89234478] = 8
 metadata       [0x00007f4c89234478,0x00007f4c892344b8] = 64
 scopes data    [0x00007f4c892344b8,0x00007f4c892348a0] = 1000
 scopes pcs     [0x00007f4c892348a0,0x00007f4c892350a0] = 2048
 dependencies   [0x00007f4c892350a0,0x00007f4c892350b8] = 24
 handler table  [0x00007f4c892350b8,0x00007f4c89235460] = 936
 nul chk table  [0x00007f4c89235460,0x00007f4c892354c8] = 104
AHE@0x00007f4ca0a7d180: 0x7eab0000 i2c: 0x00007f4c88cd4960 c2i: 0x00007f4c88cd4ab2 c2iUV: 0x00007f4c88cd4a75
Compiled method (c1)   18714 1966   !   2       jdk.internal.reflect.GeneratedMethodAccessor1068::invoke (270 bytes)
 total in heap  [0x00007f4c89232590,0x00007f4c892354c8] = 12088
 relocation     [0x00007f4c89232700,0x00007f4c89232a18] = 792
 main code      [0x00007f4c89232a20,0x00007f4c89234340] = 6432
 stub code      [0x00007f4c89234340,0x00007f4c89234470] = 304
 oops           [0x00007f4c89234470,0x00007f4c89234478] = 8
 metadata       [0x00007f4c89234478,0x00007f4c892344b8] = 64
 scopes data    [0x00007f4c892344b8,0x00007f4c892348a0] = 1000
 scopes pcs     [0x00007f4c892348a0,0x00007f4c892350a0] = 2048
 dependencies   [0x00007f4c892350a0,0x00007f4c892350b8] = 24
 handler table  [0x00007f4c892350b8,0x00007f4c89235460] = 936
 nul chk table  [0x00007f4c89235460,0x00007f4c892354c8] = 104
Compiled method (c1)   18720 1966   !   2       jdk.internal.reflect.GeneratedMethodAccessor1068::invoke (270 bytes)
 total in heap  [0x00007f4c89232590,0x00007f4c892354c8] = 12088
 relocation     [0x00007f4c89232700,0x00007f4c89232a18] = 792
 main code      [0x00007f4c89232a20,0x00007f4c89234340] = 6432
 stub code      [0x00007f4c89234340,0x00007f4c89234470] = 304
 oops           [0x00007f4c89234470,0x00007f4c89234478] = 8
 metadata       [0x00007f4c89234478,0x00007f4c892344b8] = 64
 scopes data    [0x00007f4c892344b8,0x00007f4c892348a0] = 1000
 scopes pcs     [0x00007f4c892348a0,0x00007f4c892350a0] = 2048
 dependencies   [0x00007f4c892350a0,0x00007f4c892350b8] = 24
 handler table  [0x00007f4c892350b8,0x00007f4c89235460] = 936
 nul chk table  [0x00007f4c89235460,0x00007f4c892354c8] = 104
Comments
URL: https://hg.openjdk.java.net/jdk/jdk/rev/c16f3a24a6fc User: coleenp Date: 2019-10-08 17:19:09 +0000
08-10-2019

Metadata* Dependencies::DepStream::argument(int i) { Metadata* result = recorded_metadata_at(argument_index(i)); if (result == NULL) { // Explicit context argument can be compressed int ctxkj = dep_context_arg(type()); // -1 if no explicit context arg if (ctxkj >= 0 && i == ctxkj && ctxkj+1 < argument_count()) { result = ctxk_encoded_as_null(type(), argument(ctxkj+1)); } } assert(result == NULL || result->is_klass() || result->is_method(), "must be"); <==== Here return result; } This crashes here which shows non-zero junk in the metadata section. The original assert in the description is a different problem, as Erik says. This crash is a duplicate of JDK-8226690.
24-09-2019

I think we are dealing with different bugs here. I think the assertion failure is separate from the crash. I think the assertion failure happens because there is a static call taking a slow path, but has a safepoint (with redefinition) hit before the call and the static stub patching. The caller nmethod is deoptimized, but the patching logic still looks at the caller static call stub and does not expect to find an is_old() method in there. But that is totally fine, and kept alive by the stack walker. The assert just needs to accept that there could be an is_old() method there. Anything that crashes is very likely to be a separate bug, possibly due to verification of newly compiled nmethods not looking through the metadata section, and hence allowing new compilations to contain is_old() methods, which will eventually lead to weird crashes as they are not necessarily kept alive.
16-07-2019

I've linked some related bugs to this. I'm wondering if JDK-8221183 caused the problem, although that was pushed April and this didn't start showing up until June. But it could be because it's not very reproduceable. I think for 13 maybe we should back out JDK-8221183, which was purely a performance enhancement to MetadataOnStackMark. I think that might fix this and the other linked bug that I have.
16-07-2019

I was able to reproduce this issue just once and then no luck anymore. It was not reproduced in 200 runs. Not much time left before RDP2. Re-targetting to 14.
16-07-2019

Was not able to reproduce it locally so far on my Linux-x64 server sca00jqu with the fastdebug build. I used the VM options: "-server -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+UseNUMA". Do I miss anything, or maybe the reproducibility level is very low?
12-07-2019

Why are core files disabled?!!? > # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
08-07-2019

The first deoptimization we mark all nmethods for deoptimization and then deoptimize them. I don't see the "first" bug: void VM_RedefineClasses::flush_dependent_code() { assert(SafepointSynchronize::is_at_safepoint(), "sanity check"); bool deopt_needed; // This is the first redefinition, mark all the nmethods for deoptimization if (!JvmtiExport::all_dependencies_are_recorded()) { log_debug(redefine, class, nmethod)("Marked all nmethods for deopt"); CodeCache::mark_all_nmethods_for_evol_deoptimization(); <== mark all deopt_needed = true; <== set deopt_needed } else { int deopt = CodeCache::mark_dependents_for_evol_deoptimization(); log_debug(redefine, class, nmethod)("Marked %d dependent nmethods for deopt", deopt); deopt_needed = (deopt != 0); } if (deopt_needed) { CodeCache::flush_evol_dependents(); <== deoptimize all } The bug reported here is that the inline cache points to an "old" method in the inline cache, but we should have deoptimized this nmethod. I thought the nmethod should be moved off the stack but it looks like it keeps running and finds an "old" method in the inline cache. Maybe we should still clear_inline_caches.
24-06-2019

[~eosterlund] yes, I agree with your reasoning. The assert looks way too strong and filtering out old methods should fix it.
19-06-2019

Thanks for confirming this [~vlivanov]. It looks like we might have two bugs in this area then. What are the odds?! The bug I mentioned would occur the very first class redefinition; it seems to forget actually deoptimizing (as opposed to just marking for deopt) the nmethods. But since the old version was 414 and the new 415. So while this does indeed look like a real bug that we should fix, it's not what we see here. Another observation is that class redefinition no longer clears or even cleans inline caches. I think this is more likely to explain this assertion. Instead of clearing/cleaning inline caches in the redefinition safepoint, we rely on bad callees being not_entrant, causing the call to be re-resolved, and hence inline caches to be fixed up lazily. But during re-resolution of such a static call, the MT-safe assert will yield false positives. The scenario then is that since the class redefinition safepoint is released, caller nmethod static call stubs may point at a callee not_entrant deopted nmethod (which is perfectly fine). The mutators will compete for calling the re-resolution handler which under mutual exclusion might want to CompiledDirectStaticCall::set_to_interpreted. In there, the assertion will be sad then since the callee() != old_method. But really, there is nothing MT-unsafe about this. It's just a false positive assertion error of a perfectly sane and MT-safe situation. The solution would be to add yet another line to this assertion that explicitly allows old_method->is_old() situations. I think we should fix that other bug too for the very first redefinition. I heard [~coleenp] loves fixing these class redefinition problems. Can't get enough of them. That's what I heard...
18-06-2019

Thanks, [~eosterlund]. Yes, that's the case here: call site stub contains old method (version 414 vs 415). Additional info from debugging: Call site: 0x00007f4c8923333f: callq 0x7f4c88b9f4a0 relocation info: +062: 0x4417: type=4 format=1 offset=23 addr=0x00007f4c8923333f +348: 0xf801: type=15 datalen=1 format=2 immediate=1 +349: 0xf6e1 +350: 0x5005: type=5 format=0 offset=5 addr=0x00007f4c89234345 static_stub_Relocation::static_call() == 0x00007f4c8923333f (= 0x00007f4c89232a20 - 0xf6e1) static_call_Relocation::static_stub() == 0x00007f4c89234345 0x00007f4c89234345: movabs $0x7f4c6a7e77d0,%rbx 0x00007f4c8923434f: jmpq 0x7f4c88cd4ab2 old_method = 0x7f4c6a7e77d0 old_method->_constMethod->_constants._saved._version 414 Event: 18.289 Thread 0x00007f4ca054f800 redefined class name=MyClass, count=414 Event: 18.439 Thread 0x00007f4ca054f800 redefined class name=MyClass, count=415
18-06-2019

[~vlivanov] Yes I do. In VM_RedefineClasses::flush_dependent_code(), the first time we redefine classes, we mark all nmethods for deoptimization, but we never "flush" (CodeCache::flush_evol_dependents()) the deoptimization, so that the nmethods are made not entrant. Think that is the bug. The consequence is that nmethods with static call stubs containing an old Method* will be exposed to mutators, which can run in to this assert.
17-06-2019

[~eosterlund] do you see anything suspicious?
17-06-2019

Putting back to jvmti subcategory for further investigations.
17-06-2019

There were no class unloading before the crash. Nmethod where the crash happens has been installed right before the crash. There are numerous threads (60) exercising the code being redefined. According to the log, all of them are blocked except 2 - one is in running state and another is where the crash happened: =>0x00007f4ca0e73000 JavaThread "Thread-0" [_thread_in_vm, id=25557, stack(0x00007f4c6a641000,0x00007f4c6a742000)] 0x00007f4ca0e96800 JavaThread "Thread-8" [_thread_in_Java, id=25565, stack(0x00007f4c69e39000,0x00007f4c69f3a000)] So, it seems Thread-8 just resolved the call site and patched the code, but concurrent resolution attempt in Thread-0 doesn't agree on the result and the assert failed: assert(old_method == NULL || old_method == callee() || !old_method->method_holder()->is_loader_alive(), "a) MT-unsafe modification of inline cache"); (old_method->method_holder()->is_loader_alive() is true because no class unloading happened.)
17-06-2019

Relevant info from the hs_err: J 1966 c1 jdk.internal.reflect.GeneratedMethodAccessor1068.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (270 bytes) @ 0x00007f4c89233344 [0x00007f4c89232a20+0x0000000000000924] Event: 18.405 Thread 0x00007f4ca056a000 1966 ! 2 jdk.internal.reflect.GeneratedMethodAccessor1068::invoke (270 bytes) Event: 18.409 Thread 0x00007f4ca056a000 nmethod 1966 0x00007f4c89232590 code [0x00007f4c89232a20, 0x00007f4c89234470] Event: 17.962 GC heap before {Heap before GC invocations=1 (full 0): garbage-first heap total 974848K, used 230671K [0x0000000768400000, 0x0000000800000000) region size 1024K, 227 young (232448K), 3 survivors (3072K) Metaspace used 16684K, capacity 21172K, committed 21296K, reserved 1067008K class space used 1929K, capacity 2866K, committed 2944K, reserved 1048576K } Event: 18.045 GC heap after {Heap after GC invocations=2 (full 0): garbage-first heap total 974848K, used 4781K [0x0000000768400000, 0x0000000800000000) region size 1024K, 4 young (4096K), 4 survivors (4096K) Metaspace used 16684K, capacity 21172K, committed 21296K, reserved 1067008K class space used 1929K, capacity 2866K, committed 2944K, reserved 1048576K } Event: 18.247 Executing VM operation: RedefineClasses Event: 18.328 Executing VM operation: RedefineClasses done Event: 18.430 Executing VM operation: RedefineClasses Event: 18.582 Executing VM operation: RedefineClasses done Classes unloaded (0 events): No events Classes redefined (20 events): ... Event: 18.289 Thread 0x00007f4ca054f800 redefined class name=MyClass, count=414 Event: 18.439 Thread 0x00007f4ca054f800 redefined class name=MyClass, count=415 Latest events: Event: 18.586 Thread 0x00007f4ca0e96800 DEOPT UNPACKING pc=0x00007f4c88b9e33c sp=0x00007f4c69f37b38 mode 0 Event: 18.587 Protecting memory [0x00007f4ca873f000,0x00007f4ca8740000] with protection modes 7
17-06-2019

I haven't found any signs it is related to static initialization guards. Moreover, I wasn't able to reproduce it during a continuous rerun attempts over the weekend. As Rahul mentions, the assert was touched (relaxed) recently by JDK-8222841. Looks like there's another bug lurking there.
17-06-2019

There is no JVMTI code on the stack: Stack: [0x00007f4c6a641000,0x00007f4c6a742000], sp=0x00007f4c6a73d0c0, free space=1008k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x9e16a0] CompiledDirectStaticCall::set_to_interpreted(methodHandle const&, unsigned char*)+0xd0 V [libjvm.so+0x9dfc24] CompiledStaticCall::set(StaticCallInfo const&)+0x164 V [libjvm.so+0x1563629] SharedRuntime::resolve_sub_helper_internal(methodHandle, frame const&, CompiledMethod*, bool, bool, Handle, CallInfo&, Bytecodes::Code, Thread*)+0x559 V [libjvm.so+0x156ecd5] SharedRuntime::resolve_sub_helper(JavaThread*, bool, bool, Thread*)+0x3f5 V [libjvm.so+0x156f079] SharedRuntime::resolve_helper(JavaThread*, bool, bool, Thread*)+0xa9 V [libjvm.so+0x156f2dc] SharedRuntime::resolve_static_call_C(JavaThread*)+0xec v ~RuntimeStub::resolve_static_call J 1966 c1 jdk.internal.reflect.GeneratedMethodAccessor1068.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (270 bytes) @ 0x00007f4c89233344 [0x00007f4c89232a20+0x0000000000000924] J 1810 c2 jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; java.base@13-ea (10 bytes) @ 0x00007f4c90677fac [0x00007f4c90677ec0+0x00000000000000ec] J 1334 c1 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; java.base@13-ea (65 bytes) @ 0x00007f4c8904881c [0x00007f4c890485a0+0x000000000000027c] j nsk.jvmti.RedefineClasses.StressRedefine.runMethod(Ljava/util/Random;Ljava/lang/String;)V+79 j nsk.jvmti.RedefineClasses.StressRedefine$StaticMethodCaller.run()V+6 j java.lang.Thread.run()V+11 java.base@13-ea v ~StubRoutines::call_stub V [libjvm.so+0xdf3e7a] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, Thread*)+0x6ea V [libjvm.so+0xdf0eef] JavaCalls::call_virtual(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x33f V [libjvm.so+0xdf110a] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, Symbol*, Symbol*, Thread*)+0xca V [libjvm.so+0xf3a861] thread_entry(JavaThread*, Thread*)+0x91 V [libjvm.so+0x16cafba] JavaThread::thread_main_inner()+0x26a V [libjvm.so+0x16d3627] JavaThread::run()+0x227 V [libjvm.so+0x16d0696] Thread::call_run()+0xf6 V [libjvm.so+0x13eea8e] thread_native_entry(Thread*)+0x10e I suspect this may be related to recent changes in relation to static initialization guards. Moving to compiler team for analysis. [~vlivanov] can you take an initial look please?
12-06-2019