Bug ID: JDK-8200366 SIGSEGV in CodeHeapState::print

JDK 11
11 b10Fixed

Fix is pushed. Waiting "robot" to update bug's status. Changeset: c42db4d81e33 Author: lucy Date: 2018-03-29 14:53 -0700 URL: http://hg.openjdk.java.net/jdk/hs/rev/c42db4d81e33
29-03-2018
Tests passed except known issues.
29-03-2018
http://cr.openjdk.java.net/~lucy/webrevs/8200366.02/
29-03-2018
I have implemented the following modifications * only granules for which CodeBlobs were found during "aggregate" are scanned. As a side effect, that brings the MethodNames output closer to the state that was recorded during "aggregate". * allocated, but uninitialized blobs detection is tried with (this + header_size) == relocation_begin && (this + header_size + relocation_size) == content_begin * CodeBlob->name() returning NULL is handled by assigning a default name To my understanding, CodeHeap::find_blob_unsafe() does not detect an allocated, but uninitialised blob. It used find_start() and then verifies that the returned CodeBlob actually contains the address that was used to find the CodeBlob. I would say, CodeHeap::find_blob_unsafe() verified correctness of find_start(). Tobias is currently testing a preliminary patch. If it works well, I will send it out for public review today.
29-03-2018
May be use one of CodeBlob's contains() methods. This is what CodeHeap does: http://hg.openjdk.java.net/jdk/hs/file/dd76fcbda95f/src/hotspot/share/memory/heap.cpp#l305
29-03-2018
Note, it does not work for AOT code but it is located in separate space.
29-03-2018
Looks like this + _header_size = _relocation_begin http://hg.openjdk.java.net/jdk/hs/file/dd76fcbda95f/src/hotspot/share/code/codeBlob.hpp#l296 (gdb) p (CodeBlob)0x7fffe3794590 $11 = {_vptr.CodeBlob = 0x7ffff70d25d0 <vtable for nmethod+16>, _type = compiler_c2, _size = 1368, _header_size = 400, _frame_complete_offset = 12, _data_offset = 840, _frame_size = 6, _code_begin = 0x7fffe3794740 "\211\204$", _code_end = 0x7fffe37948d8 "p��\210\377\177", _content_begin = 0x7fffe3794740 "\211\204$", _data_end = 0x7fffe3794ae8 '\335' <repeats 24 times>, "@", _relocation_begin = 0x7fffe3794720 "\t\241:\260\034d\fd\025", _relocation_end = 0x7fffe3794730 '\314' <repeats 15 times>, "��\204$", _oop_maps = 0x7fff5c0a9f30, _caller_must_gc_arguments = false, _strings = { _strings = 0x7fff5c0a9e70, _defunct = false, static _prefix = 0x7ffff6a076d6 " ;; "}, _name = 0x7ffff6b28eb9 "nmethod"} You can still hit a problem but its probability will be much less than current case.
29-03-2018
So the question is can we determine if allocated space is initialized without virtual call? Something like checking that _content_begin field has valid value.
29-03-2018
Looks like machine (MacMini) on which it ran has only 4 threads (2 cores * 2 hyperthreads). It is possible that scheduler messed up threads execution.
29-03-2018
So the problem is next code which cast allocated space to CodeBlob* when it is still no initialized (no virtual pointer): CodeBlob* this_blob = (CodeBlob )(heap->find_start(low_bound+ixgranule_size+is));
29-03-2018
So the only my explanation is that CodeHeapState accessed this space after it is allocated (nmethod::new()) but before constructir is executed.
29-03-2018
And test run ouput have this: > Compiled method (c1) 2630 673 1 com.sun.management.internal.DiagnosticCommandArgumentInfo::getName (5 bytes) > total in heap [0x000000011b891d90,0x000000011b892178] = 1000 > relocation [0x000000011b891f20,0x000000011b891f60] = 64 > main code [0x000000011b891f60,0x000000011b892060] = 256 > stub code [0x000000011b892060,0x000000011b892118] = 184 > metadata [0x000000011b892118,0x000000011b892120] = 8 > scopes data [0x000000011b892120,0x000000011b892130] = 16 > scopes pcs [0x000000011b892130,0x000000011b892170] = 64 > dependencies [0x000000011b892170,0x000000011b892178] = 8
29-03-2018
Looks like mov 0x70(%r13),%r12 instruction loads _name field (at least on Linux-x64): (gdb) p &((struct nmethod)0)->_name $9 = (const char ) 0x70 Anyway, it looks like CodeCache space which nmethod points to is not initialized - it contains 0xcccccccccccccccc
29-03-2018
I tried to decode assembler from hs_err file and it looks like this (ignore addresses): 0x7fff40000008: mov $0x21,%esi 0x7fff4000000d: mov %rbx,%rdi 0x7fff40000010: callq 0x7fff40722e7b 0x7fff40000015: mov 0x0(%r13),%rax 0x7fff40000019: mov 0x70(%r13),%r12 0x7fff4000001d: mov %r13,%rdi 0x7fff40000020: callq 0x10(%rax) <<<< crashed here (virtual call for nmethod?) 0x7fff40000023: test %al,%al 0x7fff40000025: mov $0x0,%eax 0x7fff4000002a: cmove %rax,%r14 0x7fff4000002e: cmpb $0x0,0xda5ef4(%rip) # 0x7fff40da5f29 0x7fff40000035: je 0x7fff4000005b 0x7fff40000037: mov %r12,-0x130(%rbp) RAX=0xcccccccccccccccc is an unknown value RBX=0x00000001287b4808 is pointing into the stack for thread: 0x00007fa52519c000 RCX=0x0000000000000020 is an unknown value RDX=0x0 is NULL RSP=0x00000001287b4750 is pointing into the stack for thread: 0x00007fa52519c000 RBP=0x00000001287b48e0 is pointing into the stack for thread: 0x00007fa52519c000 RSI=0x0000000108de6237: __cxx_global_var_init52+0x27387 in /scratch/mesos/jib-master/install/2018-03-28-0456395.hotspot_gk_ww_grp.source/macosx-x64-debug.jdk/jdk-11/fastdebug/lib/server/libjvm.dylib at 0x0000000107fb5000 RDI=0x000000011b891d90 is at entry_point+-464 in (nmethod)0x000000011b891d90 R8 =0x0 is NULL R9 =0x00007fff8a571d60: zeroes+0 in /usr/lib/system/libsystem_c.dylib at 0x00007fff8a4ea000 R10=0x0000000108e2e4fd: __cxx_global_var_init52+0x6f64d in /scratch/mesos/jib-master/install/2018-03-28-0456395.hotspot_gk_ww_grp.source/macosx-x64-debug.jdk/jdk-11/fastdebug/lib/server/libjvm.dylib at 0x0000000107fb5000 R11=0x00007fa41b3c6c0a is an unknown value R12=0xcccccccccccccccc is an unknown value R13=0x000000011b891d90 is at entry_point+-464 in (nmethod)0x000000011b891d90 R14=0x000000011b891d90 is at entry_point+-464 in (nmethod)0x000000011b891d90 R15=0x0000000000000d80 is an unknown value As you see it R13 and RDI points to CodeCache where non-profiled methods (C2 or C1's tier1 generated code) are located: CodeHeap 'non-profiled nmethods': size=120032Kb used=406Kb max_used=406Kb free=119625Kb bounds [0x000000011b831000, 0x000000011baa1000, 0x0000000122d69000] And it is listed in nmethods last installation events (C1 non-profiling compilation - Tier1): Event: 2.611 Thread 0x00007fa52380e800 673 1 com.sun.management.internal.DiagnosticCommandArgumentInfo::getName (5 bytes) Event: 2.611 Thread 0x00007fa52380e800 nmethod 673 0x000000011b891d90 code [0x000000011b891f60, 0x000000011b892118]
29-03-2018
Here is my analysis so far: * The failure happens very early, after just 2.6 seconds elapsed time. * The CodeCache is almost empty. * From the above I conclude that nmethod sweeping is not an issue. There are no really old nmethods and there is no memory pressure. * In print_names(), the entire CodeHeap is scanned for valid CodeBlobs. The scan should be limited to areas (granules) where, during the "aggregate" phase, at least one CodeBlob was found. * If a CodeBlob, for whatever reason and if at all possible, does not have a name (_name == NULL), the print function will fail. * Accessing the CodeHeap without holding the CodeCache_lock bears a non-zero risk to fail. I believe this risk is very close to zero, in particular with the above issues fixed. * There is a multi-dimensional trade-off to decide upon: - do we need the list of all CodeBlobs at all? If no, we remove the "MethodNames" function and we are done. - can we tolerate the CodeCache_lock to be held during the print loop (may take quite some time)? - can we live with the minimal risk of failure? - should we collect all necessary data during the "aggregate" phase (will require a lot of add'l memory)? Here is the good news: As long as nobody specifies -Xlog:codecache={Trace\|Debug}, or issues the equivalent jcmd, the impact is zero. Normal operation is therefore possible. May I please request your opinions? Based on your comments, I will come up with a solution by Thursday, EOB, if at all possible. Thanks
28-03-2018
It is by intention that CodeCache_lock is not held during printing. Printing can take a while, depending on many, uncontrollable factors. Thinking theoretically, I see nmethod installation as extremely low to zero risk. What I had seen, in the very early days of the code, was interference with nmethod sweeping. But the code was redesigned since then. I'm currently looking at the code to get an idea what theoretically could have gone wrong.
28-03-2018
There are compilation events at that time with nmethods installations: Event: 2.607 Thread 0x00007fa52380e800 647 3 java.lang.Character::toUpperCase (9 bytes) Event: 2.610 Thread 0x00007fa52400f800 nmethod 682 0x000000011b890f10 code [0x000000011b891140, 0x000000011b891858] Event: 2.611 Thread 0x00007fa52380e800 nmethod 647 0x0000000114419c90 code [0x0000000114419e80, 0x000000011441a160] Event: 2.611 Thread 0x00007fa52380e800 673 1 com.sun.management.internal.DiagnosticCommandArgumentInfo::getName (5 bytes) Event: 2.611 Thread 0x00007fa52380e800 nmethod 673 0x000000011b891d90 code [0x000000011b891f60, 0x000000011b892118] Event: 2.611 Thread 0x00007fa52380e800 651 3 java.lang.String::checkBoundsOffCount (61 bytes) Could it be that some nmethods are in a middle of installation but not complete when we try to print CodeCache? CodeCache_lock is not held (only CodeHeapStateAnalytics_lock) and it is not in SafePoint: VM state:not at safepoint (normal execution) VM Mutex/Monitor currently owned by a thread: ([mutex/lock_event]) [0x00007fa523518880] CodeHeapStateAnalytics_lock - owner thread: 0x00007fa52519c000 But print_names() scan codeCache: ast->print_cr(" Method names are dynamically retrieved from the code cache at print time.\n" CodeBlob* this_blob = (CodeBlob )(heap->find_start(low_bound+ixgranule_size+is));
28-03-2018
One interesting thing in hs_err is freee memory is < 256Mb: Memory: 4k page, physical 16777216k(241636k free)
28-03-2018
Running jtreg with additional vmoptions flag -XX:+PrintCodeCache show that CodeCache is mostly free: TEST PASSED CodeHeap 'non-profiled nmethods': size=118600Kb used=622Kb max_used=622Kb free=117977Kb bounds [0x00007f9c36a2d000, 0x00007f9c36c9d000, 0x00007f9c3ddff000] CodeHeap 'profiled nmethods': size=118596Kb used=2765Kb max_used=2765Kb free=115831Kb bounds [0x00007f9c2f65c000, 0x00007f9c2f91c000, 0x00007f9c36a2d000] CodeHeap 'non-nmethods': size=8564Kb used=4802Kb max_used=4832Kb free=3761Kb bounds [0x00007f9c2edff000, 0x00007f9c2f2bf000, 0x00007f9c2f65c000] total_blobs=2218 nmethods=1030 adapters=860 compilation: enabled From attached hs_err file: CodeHeap 'non-profiled nmethods': size=120032Kb used=406Kb max_used=406Kb free=119625Kb bounds [0x000000011b831000, 0x000000011baa1000, 0x0000000122d69000] CodeHeap 'profiled nmethods': size=120032Kb used=1177Kb max_used=1177Kb free=118854Kb bounds [0x00000001142f9000, 0x0000000114569000, 0x000000011b831000] CodeHeap 'non-nmethods': size=5696Kb used=1969Kb max_used=1986Kb free=3726Kb bounds [0x0000000113d69000, 0x0000000113fd9000, 0x00000001142f9000] total_blobs=1773 nmethods=649 adapters=860 compilation: enabled
28-03-2018
I will have a look asap (this evening my time (CET)). Is this a test creating heavy load on the compiler/CodeCache? Is the output of the jcmd commands available?
28-03-2018
I ran on Linux-x64 and was not able to reproduce it: $ cd test/jdk $ jtreg -testjdk:$JAVA_HOME -va -vmoptions:'-Xmx512m -server -ea -esa -XX:MaxRAMPercentage=12' com/sun/management/DiagnosticCommandMBean/DcmdMBeanPermissionsTest.java
28-03-2018
Lutz, could you please have a look?
28-03-2018
ILW = Crash while printing code heap statistics, single test in hs-tier5, no workaround = HLH = P2
28-03-2018
The output from the test before crash: "Testing compilerCodeHeapAnalytics"
28-03-2018
Stack: [0x00000001286b6000,0x00000001287b6000], sp=0x00000001287b4750, free space=1017k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.dylib+0x4472cf] CodeHeapState::print_names(outputStream, CodeHeap)+0x431 V [libjvm.dylib+0x43dd1e] CodeCache::print_names(outputStream)+0x72 V [libjvm.dylib+0x485add] CompileBroker::print_heapinfo(outputStream, char const, char const)+0x3e1 V [libjvm.dylib+0x54bb1f] DCmd::parse_and_execute(DCmdSource, outputStream, char const, char, Thread*)+0x159 V [libjvm.dylib+0xa865f2] jmm_ExecuteDiagnosticCommand+0x1af j com.sun.management.internal.DiagnosticCommandImpl.executeDiagnosticCommand(Ljava/lang/String;)Ljava/lang/String;+0 jdk.management@11-internal
28-03-2018
This is new code added by JDK-8198691: CodeHeap State Analytics
28-03-2018

Relates :	JDK-8209950 - SIGBUS in CodeHeapState::print_names()
Relates :	JDK-8217465 - [REDO] - Optimize CodeHeap Analytics
Relates :	JDK-8200450 - Root cause analysis for JDK-8200366
Relates :	JDK-8217250 - Optimize CodeHeap Analytics
Relates :	JDK-8216314 - SIGILL in CodeHeapState::print_names()
Relates :	JDK-8198691 - CodeHeap State Analytics