JDK-8269881 : SA stack dump fails to include stack trace for SteadyStateThread
  • Type: Bug
  • Component: hotspot
  • Sub-Component: svc-agent
  • Affected Version: 11,17,17.0.10,18,19,20
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: linux,os_x,windows
  • CPU: x86_64,aarch64
  • Submitted: 2021-07-05
  • Updated: 2024-11-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Description
SA sometimes can't get the stack trace for a thread. This is known to be an issue when the thread is active because sometimes ToS and "last java frame" cannot be determined. For this reason it is not considered a bug, but just a shortcoming since SA can't be expected to get a thread's stack trace no matter what state it is in.

A number of tests used to fail because of this. See JDK-8243455, which was fixed by introducing the SteadStateThread, a thread which blocks so jstack should always be able to get its stack trace. This fix has been in place since 2021-03-12, but we recently just saw for the first time a failure to get the stack of the SteadyStateThread. This is all jstack produces for the thread:

 "SteadyStateThread" #10 prio=5 tid=0x000001afc3e12af0 nid=0x85f4 waiting for monitor entry [0x000000922a6ff000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   JavaThread state: _thread_blocked 

It's unclear how this could happen. SA does need to do a fair amount of guessing and trial-and-error in its attempt to walk the stack. Possibly it thought it found a valid ToS, but that led to no frames, and because it thought it was valid it didn't continue searching for another valid ToS.

Note this is happening not only when using jstack, but also with dumpheap, which includes the stack trace in the heap dump.
Comments
We don't have many tests that look for specific stack dump results. The ones that do always use SteadyStateThread or a special version of LingeredApp that blocks on something. We do have a stack dumping stress test that just tries to make sure SA handles errors gracefully. The reality is that if you throw jstack at a running java app enough times, you will see failures to produce a stack trace.
12-07-2024

> I'm not sure what you mean by this. What I meant was, why do we not see this problem with a missing stack much more often in different tests? Is there something special about this SteadyStateThread in how it blocks? Otherwise if this is a general issue with using SP I would expect to see it far more often.
12-07-2024

> But I should note that it seems very odd that only the steadyStateThread ever seems to encounter this kind of problem ... I'm not sure what you mean by this. The SteadyStateThread is the only thread that we wait on until it is blocked and then run jstack on it. It was created because tests used to instead look at the stack trace of the the main LingeredApp thread, which is usually in Thread.sleep(). But the sleep is short (and repeated in a loop), so there are periods of time where you are in the middle of calling or returning from sleep, and that resulted in this similar issue of not being able to get the stack trace. It was a very frequent problem until SteadyStateThread was added.
12-07-2024

> Hmmm if a thread is safe to stackwalk then _last_Java_fp should always be set. > Conversely, an arbitrary reading of SP without knowing what the target was executing seems highly unreliable for stack walking. This is SA. Everything it does is unsafe, thus it does it also does a lot of safe guarding. Various "address error" related exceptions are expected and SA does it's best to deal with them or abort in a meaningful way. The code responsible for finding the first frame is done AARCH64CurrentFrameGuess.run(). Actually it now looks to me that if the topmost frame is not a java frame, then indeed _last_Java_fp is used. I missed it in my search for _last_Java_fp users. However, for java frames it tries to derive SP and FP from current registers. For C2 frames it walks the stack starting with SP looking for what appears to be a valid first frame by first assuming it is valid and then trying to use it to walk the stack to first entry frame. I'm even more unsure now how sometimes we can't walk the SteadyStackStack since we should be using _last_Java_fp from the moment we enter the OS call. Maybe we haven't entered the OS call yet, and PC still shows that we are in java code, but it seems SP and FP should be valid in that case.
12-07-2024

Hmmm if a thread is safe to stackwalk then _last_Java_fp should always be set. Conversely, an arbitrary reading of SP without knowing what the target was executing seems highly unreliable for stack walking. But I should note that it seems very odd that only the steadyStateThread ever seems to encounter this kind of problem ...
11-07-2024

It gets SP by using ptrace to fetch the thread's registers. I think _last_Java_fp is not used because it often is not up to date. Keep in mind that SA can attach at any point in a thread's execution.
11-07-2024

> It turns out that SA uses SP, not _last_Java_fp I'm not really sure what that means in terms of trying to walk the stack - how does SA read the actual SP of another thread? In what way is SP invalid - is it that it no longer matches the target thread's actual state? And why doesn't the SA use _last_Java_FP?
11-07-2024

[~dholmes] Been meeting to get back to this. Sorry about the delay. It turns out that SA uses SP, not _last_Java_fp, to produce the stack trace. If SP is bad, it will fail. So even though you say _last_Java_fp is flushed before marking the thread as blocked, it doesn't really help the stack trace effort. Once the JVM makes the blocking OS call, for a period of time (until the OS call truly blocks), SP could be invalid. However, I'm not so sure my proposed fix here will be of much help. For the test in this CR, once the LingeredApp SteadyStateThread is marked as blocked, the debugger needs to launch the SA jstack tool against LingeredApp and dump all threads. It's hard to believe the tool gets launched and starts dumping the SteadyStateThread before the SteadyStateThread is truly blocked. Perhaps the OS occasionally wakes up the SteadyStateThread while it is blocked, and this is when SP is invalid.
11-07-2024

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/19951 Date: 2024-06-28 20:34:48 +0000
09-07-2024

As you pointed out yourself the thread does not stop executing when the thread state changes, it still has to perform other actions before actually doing the OS blocking operation - but that is not all it does, particularly on the way out of a blocking operation. Any general states changes, e.g. to _last_Java_fp, should be "flushed" before the thread state is changed e.g. to _thread_blocked
03-07-2024

[~dholmes] Can you elaborate? Are you saying that when the JVM makes some sort of blocking calling, like trying to enter a mutex, the state of the stack might still change while it remains blocked? Regarding the java stack being stable, are you referring to _last_Java_fp. I'm pretty sure that is what the SA jstack code is using in this case, and it should be stable. My thinking is that the thread has made the transition to the BLOCKED state, but hasn't made the underlying blocking OS call yet, nor had the chance to flush _last_Java_fp before making the call. If that is the case, another solution might be to do something like Thread.sleep(1) just to get _last_Java_fp flushed.
02-07-2024

The native stack can always be in a state of flux when a thread is logically blocked or otherwise safepoint-safe. Only the Java stack is guaranteed to be stable.
02-07-2024

[~kevinw] comments in JDK-8335124 gave me a thought as to why we sometimes can't get the stack trace of the SteadyStateThread. Kevin noted in JDK-8335124 that the test in question was waiting for a thread to be sleeping by waiting for it to be in the Thread.State.WAITING state. However, when this state is first set, the thread still has a little bit of work to do before it is actually sleeping (it needs to call into some OS api to sleep). In the case of JDK-8335124, this resulted in some unexpected CPU time being attributed to a thread. When creating the SteadyStateThread, we sit in a spin loop until the SteadyStateThread is in the Thread.State.BLOCKED state (which it gets to by doing a synchronized). Likely once the thread moves into the Thread.State.BLOCKED state, it still has a bit of execution to do before truly going idle (it needs to call into some OS api to block). This could mean the stack is in an inconsistent state for a short while after being moved to the Thread.State.BLOCKED state. We should do a short sleep after detecting it is in the Thread.State.BLOCKED state just to make sure the thread really has blocked and is no longer executing. This is the code in question: private static volatile boolean steadyStateReached = false; private static void steadyState(Object steadyStateObj) { steadyStateReached = true; synchronized(steadyStateObj) { } } private static void startSteadyStateThread(Object steadyStateObj) { Thread steadyStateThread = new Thread() { public void run() { steadyState(steadyStateObj); } }; steadyStateThread.setName("SteadyStateThread"); steadyStateThread.start(); // Wait until the thread has started running. while (!steadyStateReached) { Thread.onSpinWait(); } // Now wait until we get into the synchronized block. while (steadyStateThread.getState() != Thread.State.BLOCKED) { Thread.onSpinWait(); } --------> We should do a short sleep here. }
28-06-2024

From JDK-8313313 ClhsdbFindPC.java#id3 failed with ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1: "SteadyStateThread" #10 prio=5 tid=0x0000027fc0775030 nid=0x769c waiting for monitor entry [0x00000097efefe000] java.lang.Thread.State: BLOCKED (on object monitor) JavaThread state: _thread_blocked So the following code in the test fails because the LingeredApp.steadyState() frame is not found. // Extract pc address from the following line: // - LingeredApp.steadyState(java.lang.Object) @bci=1, line=33, pc=0x00007ff18ff519f0, ... String pcAddress = null; String[] parts = jStackOutput.split("LingeredApp.steadyState");
28-07-2023

Note the serviceability/sa/ClhsdbThreadContext.java failures are not the same issue. They are covered by JDK-8292201.
30-08-2022

I added a case that matched what Dan still reports above for jdk-19+6-271-tier7. Seems there are far too many different, but similar, failure modes for these tests. :(
27-01-2022

I think that's a different issue. In the other failures the SteadyStateThread is found, but no stack trace can be generated for it. In this ClhsdbThreadContext failure (which is a new test), it seems it find thread 23, which should be the SteadyStateThread thread as seen in the "threadcontext -a" and "threadcontext -a -v" output. Created JDK-8280770.
27-01-2022

test=serviceability/sa/ClhsdbThreadContext.java Test ERROR java.lang.RuntimeException: 'Thread "SteadyStateThread"' missing from stdout/stderr Based on the log hsdb initially finds the SteadyStateThread as thread 23, but when it executes the second time to get the context of thread 23, it gets an error: hsdb> + threadcontext -v 23 Couldn't find thread 23
27-01-2022

Here's a log file snippet from the jdk-19+5-158-tier3 sighting: serviceability/sa/ClhsdbFindPC.java#no-xcomp-core "SteadyStateThread" #9 prio=5 tid=0x0000fffca02c3910 nid=2806627 waiting for monitor entry [0x0000fffca464e000] java.lang.Thread.State: BLOCKED (on object monitor) JavaThread state: _thread_blocked Locked ownable synchronizers: - None hsdb> + quit ----------System.err:(15/1017)---------- java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:306) at ClhsdbFindPC.main(ClhsdbFindPC.java:328) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:140) ... 5 more JavaTest Message: Test threw exception: java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 ----------rerun:(34/6966)*----------
07-01-2022

The following test failed in the JDK18 CI: serviceability/sa/ClhsdbFindPC.java#no-xcomp-process Here's a snippet from the log file: ----------System.err:(15/1032)---------- java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:284) at ClhsdbFindPC.main(ClhsdbFindPC.java:306) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:577) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:140) ... 5 more JavaTest Message: Test threw exception: java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 JavaTest Message: shutting down test STATUS:Failed.`main' threw exception: java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 ----------rerun:(37/5615)*---------- SteadStateThread is missing its stacktrace: "SteadyStateThread" #10 prio=5 tid=0x000001ebffa99030 nid=8416 waiting for monitor entry [0x000000a1832ff000] java.lang.Thread.State: BLOCKED (on object monitor) JavaThread state: _thread_blocked Locked ownable synchronizers: - None
13-11-2021

Test: serviceability/sa/ClhsdbDumpheap.java: java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'LingeredApp.steadyState' missing from stdout/stderr at ClhsdbDumpheap.main(ClhsdbDumpheap.java:177) This is the fist macosx sighting. Previous had all been on windows. "SerialNo 8" is suppose to show the stack trace of the SteadyStateThread, but it is empty: HprofReader.getStack() output: Dump file created Fri Oct 22 15:59:27 GMT 2021 SerialNo 8 SerialNo 7 java.lang.Object.wait(J)V (Object.java:(native method)) java.lang.ref.ReferenceQueue.remove(J)Ljava/lang/ref/Reference; (ReferenceQueue.java:155) jdk.internal.ref.CleanerImpl.run()V (CleanerImpl.java:140) java.lang.Thread.run()V (Thread.java:833) jdk.internal.misc.InnocuousThread.run()V (InnocuousThread.java:162) SerialNo 6 SerialNo 5 SerialNo 4 java.lang.Object.wait(J)V (Object.java:(native method)) java.lang.ref.ReferenceQueue.remove(J)Ljava/lang/ref/Reference; (ReferenceQueue.java:155) java.lang.ref.ReferenceQueue.remove()Ljava/lang/ref/Reference; (ReferenceQueue.java:176) java.lang.ref.Finalizer$FinalizerThread.run()V (Finalizer.java:175) SerialNo 3 java.lang.ref.Reference.waitForReferencePendingList()V (Reference.java:(native method)) java.lang.ref.Reference.processPendingReferences()V (Reference.java:253) java.lang.ref.Reference$ReferenceHandler.run()V (Reference.java:215) SerialNo 2 java.lang.Thread.sleep(J)V (Thread.java:(native method)) jdk.test.lib.apps.LingeredApp.main([Ljava/lang/String;)V (LingeredApp.java:599) SerialNo 1
25-10-2021

serviceability/sa/ClhsdbCDSJstackPrintAll.java failed with: java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'LingeredApp.steadyState' missing from stdout/stderr at ClhsdbCDSJstackPrintAll.main(ClhsdbCDSJstackPrintAll.java:117) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:76) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:51) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833)
21-10-2021

serviceability/sa/ClhsdbWhere.java failed with: execStatus=Failed. Execution failed\: `main' threw exception\: java.lang.RuntimeException\: Test ERROR java.lang.RuntimeException\: 'private static void steadyState' missing from stdout/stderr The statck trace for the SteadyStateThread is missing: Java Stack Trace for SteadyStateThread Thread state = BLOCKED
21-09-2021

From what I can tell of the SA code, if the thread is not currently executing in a java frame (by looking at the PC register), then JavaThread::__last_Java_sp is fetched and used as the stack pointer. This is done by X86CurrentFrameGuess.run(), which is called indirectly by JavaThread::getLastJavaVFrameDbg(), which I think for some reason is returning NULL.
06-07-2021

The manner in which the test fails as a result of this bug will depend on the test. The one failure so far was with ClhsdbFindPC#id3, which failed as follows (.jtr files attached): java.lang.RuntimeException: Test ERROR java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:284) at ClhsdbFindPC.main(ClhsdbFindPC.java:306) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 at ClhsdbFindPC.testFindPC(ClhsdbFindPC.java:140) ... 7 more
05-07-2021