Bug ID: JDK-8244383 jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"

JDK-8244383 : jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"

Type: Bug
Component: hotspot
Sub-Component: svc-agent
Affected Version: 15

Priority: P4
Status: Resolved
Resolution: Fixed
OS: linux
CPU: x86_64

Submitted: 2020-05-04
Updated: 2024-11-13
Resolved: 2020-07-03

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 16
16 b05Fixed

Related Reports

Blocks :	JDK-8247533 - SA stack walking sometimes fails with sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp
Relates :	JDK-8247641 - HeapDumpTestWithActiveProcess.java has multiple undetected failure modes
Relates :	JDK-8231634 - SA stack walking fails with "illegal bci"
Relates :	JDK-8219405 - serviceability/sa/TestJmapCore.java failed with java.io.EOFException

Description

The following test failed in the JDK15 CI:

sun/tools/jhsdb/HeapDumpTestWithActiveProcess.java	

Here's a snippet from the log file:

----------System.out:(40/2378)----------
Starting Jshell
[JShell]:/opt/mach5/mesos/work_dir/jib-master/install/jdk-15+22-1011/linux-x64-debug.jdk/jdk-15/fastdebug/bin/jshell
[JShell] |  Welcome to JShell -- Version 15-ea
[JShell] |  For an introduction type: /help intro
Jshell Started in 3432ms
Starting jmap against 12029
[2020-05-04T21:40:21.671711Z] Gathering output for process 12098
[2020-05-04T21:40:35.291144Z] Waiting for completion for process 12098
[2020-05-04T21:40:35.291378Z] Waiting for completion finished for process 12098
Output and diagnostic info for process 12098 was saved into 'pid-12098-output.log'
jhsdb jmap stdout:
Attaching to process ID 12029, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 15-ea+22-1011

jhsdb jmap stderr:
sun.jvm.hotspot.utilities.AssertionFailure: illegal bci(-1368758984) codeSize(66)
	at jdk.hotspot.agent/sun.jvm.hotspot.utilities.Assert.that(Assert.java:32)
	at jdk.hotspot.agent/sun.jvm.hotspot.oops.ConstMethod.getLineNumberFromBCI(ConstMethod.java:310)
	at jdk.hotspot.agent/sun.jvm.hotspot.oops.Method.getLineNumberFromBCI(Method.java:282)
	at jdk.hotspot.agent/sun.jvm.hotspot.utilities.HeapHprofBinWriter.dumpStackFrame(HeapHprofBinWriter.java:751)
	at jdk.hotspot.agent/sun.jvm.hotspot.utilities.HeapHprofBinWriter.dumpStackTraces(HeapHprofBinWriter.java:730)
	at jdk.hotspot.agent/sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(HeapHprofBinWriter.java:439)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.JMap.writeHeapHprofBin(JMap.java:182)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.JMap.run(JMap.java:97)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:262)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.Tool.start(Tool.java:225)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.JMap.main(JMap.java:176)
	at jdk.hotspot.agent/sun.jvm.hotspot.SALauncher.runJMAP(SALauncher.java:331)
	at jdk.hotspot.agent/sun.jvm.hotspot.SALauncher.main(SALauncher.java:483)

###### End of all output:
[2020-05-04T21:40:35.315070Z] Waiting for completion for process 12098
[2020-05-04T21:40:35.315242Z] Waiting for completion finished for process 12098
[2020-05-04T21:40:35.315431Z] Waiting for completion for process 12098
[2020-05-04T21:40:35.315548Z] Waiting for completion finished for process 12098
Destroying jshell
Jshell destroyed
----------System.err:(46/2815)----------

Comments

Still seeing this, but it has now been relegated to extremely rare. Before pushing the fix I had done 1000's of runs with the above fix in place (probably over 10,000), with no sightings. Previously I was seeing a failure about every 1000 runs. But just now I saw it while running all tests in test/jdk/sun/tools about 3000 times. No new info in the logs, so I won't bother posting one here. Will need to keep an eye on it and see if it happens again. ...and I just realized that this happened during a -Xcomp run. The changes I made for this this CR only impact code executing in the interpreter. It could be due to a similar issue, with the frame being in an inconsistent state while pushing a new compiled frame. In ThreadStackTrace.dumpStack() we have: for (JavaVFrame vf = thread.getLastJavaVFrameDbg(); vf != null; vf = vf.javaSender()) { StackFrameInfo frame = new StackFrameInfo(vf); frames.add(frame); And the Method and BCI are then fetched from the Java vframe and used later during the hprof stack walking: public StackFrameInfo(JavaVFrame vf) { this.method = vf.getMethod(); this.bci = vf.getBCI(); } So likely vf is not actually a valid vframe. We should do some sanity checking on the vframe and abort if not valid. Also, I think this is the same issue as JDK-8247641, except in this case it managed not to get an exception during ThreadStackTrace.dumpStack() for walking into garbage frames, so the "illegal BCI" was hit later.
06-07-2020
URL: https://hg.openjdk.java.net/jdk/jdk/rev/144c75948136 User: cjplummer Date: 2020-07-03 00:22:08 +0000
03-07-2020
I think I figured it out. There's nothing special about the methods in question, other than they tend to bring about the specific situation that causes this bug. The root cause is that although we are in code generated by TemplateInterpreterGenerator::generate_normal_entry(), since no frame has been pushed we are technically still in the frame of the callee, which in this case is a compiled method, so frame->bcp points to nmethod code, not bytecodes in the Method. When creating the current Frame, we mistake it for an interpreted frame since PC points into the interpreter. This problem starts in AMD64CurrentFrameGuess.run(), which does the following: if (vm.getInterpreter().contains(pc)) { setValues(sp, fp, pc); return true; } "pc" is the PC register, which currently points to code generated by TemplateInterpreterGenerator::generate_normal_entry(), so vm.getInterpreter().contains(pc) returns true. But fp and sp are still for the callee, and in the case of this "illegal bci" failure, the callee is a compiled method. Thus frame->bcp points into the compiled method, not into the methods java bytecodes.. LinuxAMD64JavaThreadPDAccess.getCurrentFrameGuess() uses this info to create a Frame: } else if (VM.getVM().getInterpreter().contains(guesser.getPC())) { // pass the value of R13 which contains the bcp for the top level frame Address bcp = context.getRegisterAsAddress(AMD64ThreadContext.R13); return new X86Frame(guesser.getSP(), guesser.getFP(), guesser.getPC(), null, bcp); } else { return new X86Frame(guesser.getSP(), guesser.getFP(), guesser.getPC()); } So we create an X86Frame where the live_bcp field gets set to R13, although live_bcp does not play a roll in this bug. This Frame is returned to JavaThread.getLastJavaVFrameDbg(): public JavaVFrame getLastJavaVFrameDbg() { ... sun.jvm.hotspot.runtime.Frame f = getCurrentFrameGuess(); ... VFrame vf = VFrame.newVFrame(f, regMap, this, true, imprecise); return vf.isJavaFrame() ? (JavaVFrame)vf : vf.javaSender(); And newVFrame(): if (f.isInterpretedFrame()) { return new InterpretedVFrame(f, regMap, thread); } ... if (cb.isNMethod()) { ... return new CompiledVFrame(f, regMap, thread, scope, mayBeImprecise); } And Frame.isInterpretedFrame(): public boolean isInterpretedFrame() { return VM.getVM().getInterpreter().contains(getPC()); } So we end up with an InterpretedVFrame instead of a CompiledVFrame. It's returned up to getLastJavaVFrameDbg, which returns it to ThreadStackTrace.dumpStack, which saves it away as a StackFrameInfo record. public StackFrameInfo(JavaVFrame vf) { this.method = vf.getMethod(); this.bci = vf.getBCI(); this.frame = vf.getFrame(); } The issue here is with vf.getBCI(). We end up in the InterpretedVFrame version of getBCI() instead of the CompiledVFrame version. The InterpretedVFrame version simply does: return getFrame().getInterpreterFrameBCI(); Where getFrame() returns the original X86Frame, and X86Frame.getInterpreterFrameBCI() is going to try to convert frame->bcp to a BCI. But frame->bcp isn't valid since this is actually a compiled frame (and in this case frame->bcp points to another stack location), so getInterpreterFrameBCI() returns an invalid BCI which later will generate the "illegal BCI" failure. Now if a short way back we had created a CompiledVFrame instead of an InterpretedVFrame, vf.getBCI would() have fetched the bci from the ScopeDesc of the frame, which typically would produce the proper bci to use. However, this relies on frame->pc pointing to the nmethod, which is does not (it points into the interpreter method entry code). In order to fix this we would need to create an X86Frame where pc points to the invoke site in the compiled callee frame. We actually have this, but it's a bit hard to track. Early on in the interpreter method entry code it is on top of the stack. Eventually it is popped into RAX. And finally as the new interpreter frame gets pushed it gets stored in the pc_return_offset slot of the frame. Since we can't really know exactly how far along the interpreter method entry code has gotten, it would take quite a bit of trial and error to hone where the return address is stored. I think given the rarity of this bug turning up, perhaps it is best to just have getInterpreterFrameBCI() detect that bcp is not pointing into the method's bytecodes, and return 0 as the BCI. The result will be a stack trace where the source line number of the topmost frame points to the first line of the method rather than somewhere in body of the message where the "in progress" invoke was actually being made from.
27-05-2020
This is the code I added to TemplateInterpreterGenerator::generate_normal_entry() to make this issue much easier to reproduce: diff --git a/src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp b/src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp --- a/src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp +++ b/src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp @@ -1376,6 +1376,14 @@ __ jcc(Assembler::greater, loop); __ bind(exit); } + { + Label exit, loop; + __ movl(rdx, 1000); + __ bind(loop); + __ decrementl(rdx); + __ jcc(Assembler::greater, loop); + __ bind(exit); + } // initialize fixed part of activation frame generate_fixed_frame(false);
27-05-2020
So far I've managed to hit two "invalid bci" failures with some extra debugging code included in SA to help diagnose the reason. I'm still a bit unsure of what is going on, but there are a few things I can confirm. The first is that we are in code generated by TemplateInterpreterGenerator::generate_normal_entry(), but unlike what was mentioned in the previous comment, we are not yet at the point of having pushed the new frame. So I think this means we are still looking at the callee frame. In the two cases I have, the callee method is somewhat suspicious. In other words, not your typical method. The two callee's I have seen are: MethodHandleNatives.linkMethodHandleConstant() MethodHandleNatives.linkCallSite() Both of these have specialized code to handle them in the JVM. I suspect their frames might not be normal and possibly don't initialize frame->bcp. More investigation is needed. Update: Hit this a second time with MethodHandleNatives.linkMethodHandleConstant. Also, the address stored in frame->bcp seems to be a stack address pointing into a previous frame.
10-05-2020
There is definitely a window during frame pushing/initialization where where you can have an uninitialize frame->bcp. However, it's somewhat small and I haven't actually hit it yet in my testing. It's in the code generated by TemplateInterpreterGenerator::generate_normal_entry(). Once the code generated by its call to enter() is executed, the new frame is considered pushed, but is not yet initialized. If we then get passed the point of the push(rbx) to push the Method* to the frame, then it is considered a viable frame, but the bcp is not pushed until 20 or so instructions later. If we try to walk the stack at this time, we would get an Illegal BCI 0x00007f25abf54e7c: push %rbp ;; 663: __ enter(); // save old & set new rbp 0x00007f25abf54e7d: mov %rsp,%rbp <--- new frame becomes active after executed 0x00007f25abf54e80: push %r13 ;; 664: __ push(rbcp); // set sender sp 0x00007f25abf54e82: pushq $0x0 ;; 665: __ push((int)NULL_WORD); // leave last_sp as null 0x00007f25abf54e87: mov 0x10(%rbx),%r13 ;; 666: __ movptr(rbcp, Address(rbx, Method::const_offset())); // get ConstMethod* 0x00007f25abf54e8b: lea 0x38(%r13),%r13 ;; 667: __ lea(rbcp, Address(rbcp, ConstMethod::codes_offset())); // get codebase 0x00007f25abf54e8f: push %rbx ;; 668: __ push(rbx); // save Method* <--frame is now viable with a valid frame->method 0x00007f25abf54e90: mov 0x10(%rbx),%rdx ;; 670: __ load_mirror(rdx, rbx); 0x00007f25abf54e94: mov 0x8(%rdx),%rdx 0x00007f25abf54e98: mov 0x20(%rdx),%rdx 0x00007f25abf54e9c: mov 0x78(%rdx),%rdx 0x00007f25abf54ea0: mov (%rdx),%rdx 0x00007f25abf54ea3: push %rdx ;; 671: __ push(rdx); 0x00007f25abf54ea4: mov 0x18(%rbx),%rdx ;; 674: __ movptr(rdx, Address(rbx, in_bytes(Method::method_data_offset()))); 0x00007f25abf54ea8: test %rdx,%rdx ;; 675: __ testptr(rdx, rdx); 0x00007f25abf54eab: je 0x00007f25abf54eb8 ;; 676: __ jcc(Assembler::zero, method_data_continue); 0x00007f25abf54eb1: add $0x1a0,%rdx ;; 677: __ addptr(rdx, in_bytes(MethodData::data_offset())); 0x00007f25abf54eb8: push %rdx ;; 678: __ bind(method_data_continue); ;; 679: __ push(rdx); // set the mdp (method data pointer) 0x00007f25abf54eb9: mov 0x10(%rbx),%rdx ;; 684: __ movptr(rdx, Address(rbx, Method::const_offset())); 0x00007f25abf54ebd: mov 0x8(%rdx),%rdx ;; 685: __ movptr(rdx, Address(rdx, ConstMethod::constants_offset())); 0x00007f25abf54ec1: mov 0x18(%rdx),%rdx ;; 686: __ movptr(rdx, Address(rdx, ConstantPool::cache_offset_in_bytes())); 0x00007f25abf54ec5: push %rdx ;; 687: __ push(rdx); // set constant pool cache 0x00007f25abf54ec6: push %r14 ;; 688: __ push(rlocals); // set locals pointer 0x00007f25abf54ec8: push %r13 ;; 692: __ push(rbcp); // set bcp <--- at this point frame->bcp finally becomes valid The fix here is pretty simple. In X86Frame.getInterpreterFrameBCI(), if neither R13 nor frame->bcp look valid, return 0 as the bci, which actually map to the correct line number in this case since we are about the execute the first line of the method.
08-05-2020
Looking at my mach5 history, looks like after fixing JDK-8231634 in my repo I did about 1000 successful runs on linux-x64 alone without any failures, 1700 runs including OSX and Windows. Before the fix I noted "illegal bci" reproduced with this test about 3% of the time. I wonder if there is some edge case were the frame has been created but frame->bcp has not been properly setup yet, or perhaps addressOfStackSlot(), which addressOfInterpreterFrameBCX() uses, is not returning a valid stack address. We might need to validate the bcp we pull out of frame->bcp just like we validate the BCP register (R13): public int getInterpreterFrameBCI() { Address methodHandle = addressOfInterpreterFrameMethod().getAddressAt(0); Method method = (Method)Metadata.instantiateWrapperFor(methodHandle); Address bcp = addressOfInterpreterFrameBCX().getAddressAt(0); // If we are in the top level frame then the bcp may have been set for us. If so then let it // take priority. If we are in a top level interpreter frame, the bcp is live in R13 (on x86_64) // and not saved in the BCX stack slot. if (live_bcp != null) { // Only use live_bcp if it points within the Method's bytecodes. Sometimes R13 is used // for scratch purposes and is not a valid BCP. If it is not valid, then we stick with // the bcp stored in the frame, which R13 should have been flushed to. if (method.getConstMethod().isAddressInMethod(live_bcp)) { bcp = live_bcp; } } return bcpToBci(bcp, method); } JDK-8231634 added the isAddressInMethod() call to validate that R13 points within the method. If it doesn't then we use frame->bcp. Perhaps we should also call isAddressInMethod() on frame->bcp, and return 0 if it is not valid.
04-05-2020
Yeah I can see that JDK-8231634 is in place because of the following: sun.jvm.hotspot.utilities.AssertionFailure: illegal bci(-1368758984) codeSize(66) JDK-8231634 updated the exception message to include the bci and codeSize values.
04-05-2020
JDK-8231634 was recently fixed and this Mach5 jobset includes that fix.
04-05-2020