JDK-8246206 : Thread.getStackTrace has performance linear in the number of threads
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.lang
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2020-05-31
  • Updated: 2020-06-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
There is an expectation that
    Thread.getAllStackTraces()
and
    for (Thread thread : allThreads)
                    thread.getStackTrace();
have similar performance characteristics.
But the former is O(N) and the latter is O(N^2)

Demo:

---
import java.util.Collection;
import java.util.concurrent.locks.StampedLock;

/**
 * Compares the performance of {@code Thread.getAllStackTraces();}
 * with {@code for (Thread thread : allThreads) thread.getStackTrace();}
 */
public class GetStacktraceBenchmark {
    public static void main(String[] args) throws Throwable {
        int nThreads = Integer.getInteger("threads", 1000);
        int reps = Integer.getInteger("reps", 10);
        boolean getAllStackTraces = Boolean.getBoolean("getAllStackTraces");

        StampedLock lock = new StampedLock();
        long writeStamp = lock.writeLock();
        for (int i = nThreads; i--> 0; )
            new Thread(() -> lock.readLock()).start();

        Collection<Thread> allThreads = Thread.getAllStackTraces().keySet();
        for (int i = reps; i--> 0; )
            if (getAllStackTraces)
                Thread.getAllStackTraces();
            else
                for (Thread thread : allThreads)
                    thread.getStackTrace();
        lock.unlockWrite(writeStamp);
    }
}
---
 $ for java_version in 8 11 15; do for getAllStackTraces in true false; do echo --- java_version=$java_version getAllStackTraces=$getAllStackTraces ---; for ((threads = 100; threads <= 6400; threads *= 4)); do echo threads=$threads $(/usr/bin/time -f 'user %U sys %S real %e' ~/jdk/jdk$java_version/bin/java -Dthreads=$threads -DgetAllStackTraces=$getAllStackTraces GetStacktraceBenchmark 2>&1); done; done; done
--- java_version=8 getAllStackTraces=true ---
threads=100 user 0.16 sys 0.05 real 0.18
threads=400 user 0.24 sys 0.14 real 0.21
threads=1600 user 0.59 sys 0.49 real 0.55
threads=6400 user 2.56 sys 2.03 real 2.40
--- java_version=8 getAllStackTraces=false ---
threads=100 user 0.23 sys 0.09 real 0.32
threads=400 user 0.89 sys 0.20 real 0.88
threads=1600 user 11.52 sys 0.76 real 11.70
threads=6400 user 351.59 sys 4.93 real 353.76
--- java_version=11 getAllStackTraces=true ---
threads=100 user 0.19 sys 0.05 real 0.26
threads=400 user 0.35 sys 0.13 real 0.25
threads=1600 user 0.87 sys 0.62 real 0.77
threads=6400 user 4.88 sys 2.56 real 4.40
--- java_version=11 getAllStackTraces=false ---
threads=100 user 0.48 sys 0.40 real 0.38
threads=400 user 3.98 sys 2.86 real 1.18
threads=1600 user 60.18 sys 35.95 real 15.56
threads=6400 user 1187.57 sys 485.03 real 307.89
--- java_version=15 getAllStackTraces=true ---
threads=100 user 0.10 sys 0.09 real 0.27
threads=400 user 0.26 sys 0.13 real 0.20
threads=1600 user 0.62 sys 0.62 real 0.64
threads=6400 user 2.99 sys 2.48 real 3.04
--- java_version=15 getAllStackTraces=false ---
threads=100 user 0.32 sys 0.20 real 0.28
threads=400 user 3.52 sys 0.41 real 0.70
threads=1600 user 45.55 sys 1.57 real 7.38
threads=6400 user 863.45 sys 9.06 real 144.55
Comments
I agree that getting a set of stacktraces that together constitute a single atomic snapshot is a nice property. I agree that getting Thread.getStackTrace to not be a global operation would be very nice. Ideally you would want a thread dump operation on N threads to use some mechanism that affects only those threads, is as atomic as possible, and has O(N) performance. But it seems the VM doesn't offer any such mechanism. Thread coordination is a hard problem.
08-06-2020

I agree that the current performance characteristics are determined by the safepoint logic. The time to achieve and then release a global safepoint is a function of the number of active threads in the system. So the more threads you want to get the stacktrace for the more you amortise the cost of that part of the operation. The worst-case as I've said is requesting one stacktrace at a time in a loop. With handshakes there is a much lower per-thread cost but we have not yet introduced handshakes for the stack dump operation because in general they cannot be used due to the (implicit?) expectation that all stacktraces are taken at the same time. So without introducing an unexpected behavioural change the most we could do is change to use a handshake for the single case where only a single thread's stack is requested. Such as request already exists for JVM TI - JDK-8201641.
08-06-2020

I think David and I agree that the performance problem here is not the collection of the stacktrace itself, but the safepoint/handshakes required as a prerequisite. We could change the summary to """Thread safepoint/handshake has performance linear in the number of live threads""" but safepoint/handshake is a VM implementation detail not exposed to users. I'm not sure what other public APIs are affected, beyond those already identified.
04-06-2020

We seem to be going in circles. You are citing results from the code that loops on each thread calling a method that has a lot of overhead - the safepoint/handshake. The bulk collection methods amortize that cost across N threads.
03-06-2020

David: My demo program gathers 10*threads stacktraces, so total work done should be linear in threads. But: threads=1600 user 45.66 sys 1.42 real 7.27 threads=6400 user 835.17 sys 9.26 real 142.55 Converting from total to milliseconds per stacktrace collected, I get threads=1600 user 2.85 sys 0.088 real 0.454 threads=6400 user 13.04 sys 0.14 real 2.227 so the cost __per stacktrace collected__ scales approximately linearly with the number of threads. (I'm curious why user time is so much higher than real time. What is java doing??)
03-06-2020

> It seems that Threads and thread ids are maintained in internal data structures that are not scalable, requiring linear search. I'm not at all sure what you mean by that. Where is "linear search" apparent in your benchmarking? As far as I can see the primary overhead here is the safepoint/handshake needed to get the target thread's information. The more threads you process in that one safepoint/handshake, the more that overhead is amortized. At one per thread you have the worst performance and the cost is primarily that of the safepoint/handshake.
03-06-2020

getThreadInfo(long ...) and getStackTrace have similar performance characteristics - O(live threads) getThreadInfo(long[]...) and getAllStackTraces are bulk operations that avoid the per-thread cost, and also have similar performance characteristics. getThreadInfo(long[]....) is obviously a more flexible bulk method than getAllStackTraces because the caller can select any set of threads, as David pointed out. It seems that Threads and thread ids are maintained in internal data structures that are not scalable, requiring linear search.
02-06-2020

Here's a version of the benchmark that includes getThreadInfo: --- import java.lang.management.ManagementFactory; import java.lang.management.ThreadMXBean; import java.util.Collection; import java.util.concurrent.locks.StampedLock; /** * Benchmarks the performance of various ways of collecting stack traces. * Usage: * java GetStackTraceBenchmark [-Dreps=N] [-Dthreads=N] * [Dmethod=getStackTrace|getAllStackTraces|getThreadInfo|bulk-getThreadInfo] */ public class GetStackTraceBenchmark { public static void main(String[] args) throws Throwable { final int nThreads = Integer.getInteger("threads", 1000); final int reps = Integer.getInteger("reps", 10); final String method = System.getProperty("method", "getStackTrace"); ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); StampedLock lock = new StampedLock(); long writeStamp = lock.writeLock(); for (int i = nThreads; i--> 0; ) new Thread(() -> lock.readLock()).start(); Collection<Thread> allThreads = Thread.getAllStackTraces().keySet(); long[] threadIds = new long[allThreads.size()]; { int i = 0; for (Thread thread : allThreads) threadIds[i++] = thread.getId(); } for (int i = reps; i--> 0; ) switch (method) { case "getStackTrace": for (Thread thread : allThreads) thread.getStackTrace(); break; case "getAllStackTraces": Thread.getAllStackTraces(); break; case "getThreadInfo": for (long threadId : threadIds) threadMXBean.getThreadInfo(threadId, Integer.MAX_VALUE); break; case "bulk-getThreadInfo": threadMXBean.getThreadInfo(threadIds, false, false); break; default: throw new Error("bad method: " + method); } lock.unlockWrite(writeStamp); } } --- $ for java_version in 8 11 15; do for method in getStackTrace getAllStackTraces getThreadInfo bulk-getThreadInfo; do echo --- jdk$java_version method=$method ---; for ((threads = 100; threads <= 6400; threads *= 4)); do echo threads=$threads $(/usr/bin/time -f 'user %U sys %S real %e' ~/jdk/jdk$java_version/bin/java -Dthreads=$threads -Dmethod=$method GetStackTraceBenchmark 2>&1); done; done; done --- jdk8 method=getStackTrace --- threads=100 user 0.23 sys 0.09 real 0.41 threads=400 user 0.98 sys 0.16 real 0.93 threads=1600 user 11.85 sys 0.83 real 12.04 threads=6400 user 356.87 sys 4.93 real 359.09 --- jdk8 method=getAllStackTraces --- threads=100 user 0.19 sys 0.02 real 0.21 threads=400 user 0.24 sys 0.16 real 0.21 threads=1600 user 0.57 sys 0.52 real 0.52 threads=6400 user 2.51 sys 2.04 real 2.42 --- jdk8 method=getThreadInfo --- threads=100 user 0.24 sys 0.08 real 0.30 threads=400 user 1.06 sys 0.19 real 1.01 threads=1600 user 12.55 sys 0.85 real 12.73 threads=6400 user 375.42 sys 5.39 real 377.86 --- jdk8 method=bulk-getThreadInfo --- threads=100 user 0.17 sys 0.05 real 0.38 threads=400 user 0.32 sys 0.13 real 0.24 threads=1600 user 1.06 sys 0.46 real 0.92 threads=6400 user 11.64 sys 1.89 real 11.34 --- jdk11 method=getStackTrace --- threads=100 user 0.53 sys 0.27 real 0.36 threads=400 user 4.21 sys 2.71 real 1.21 threads=1600 user 61.33 sys 34.93 real 16.15 threads=6400 user 1229.21 sys 471.63 real 309.35 --- jdk11 method=getAllStackTraces --- threads=100 user 0.23 sys 0.04 real 0.25 threads=400 user 0.33 sys 0.16 real 0.25 threads=1600 user 0.86 sys 0.67 real 0.82 threads=6400 user 5.54 sys 2.41 real 4.42 --- jdk11 method=getThreadInfo --- threads=100 user 0.59 sys 0.28 real 0.37 threads=400 user 4.01 sys 2.90 real 1.23 threads=1600 user 67.80 sys 35.07 real 16.10 threads=6400 user 1201.41 sys 485.52 real 310.20 --- jdk11 method=bulk-getThreadInfo --- threads=100 user 0.28 sys 0.06 real 0.28 threads=400 user 0.38 sys 0.19 real 0.38 threads=1600 user 1.07 sys 0.61 real 0.86 threads=6400 user 5.28 sys 2.51 real 4.83 --- jdk15 method=getStackTrace --- threads=100 user 0.41 sys 0.14 real 0.28 threads=400 user 3.51 sys 0.32 real 0.71 threads=1600 user 45.66 sys 1.42 real 7.27 threads=6400 user 835.17 sys 9.26 real 142.55 --- jdk15 method=getAllStackTraces --- threads=100 user 0.15 sys 0.06 real 0.21 threads=400 user 0.25 sys 0.14 real 0.21 threads=1600 user 0.65 sys 0.62 real 0.67 threads=6400 user 3.30 sys 2.31 real 3.08 --- jdk15 method=getThreadInfo --- threads=100 user 0.43 sys 0.18 real 0.42 threads=400 user 3.49 sys 0.44 real 0.73 threads=1600 user 45.71 sys 1.49 real 7.55 threads=6400 user 843.79 sys 9.58 real 142.73 --- jdk15 method=bulk-getThreadInfo --- threads=100 user 0.14 sys 0.11 real 0.23 threads=400 user 0.30 sys 0.15 real 0.24 threads=1600 user 0.74 sys 0.57 real 0.66 threads=6400 user 3.64 sys 2.39 real 3.41
02-06-2020

Mandy: The stacktraces are part of an occasional sampling monitoring of the state of a set of threads. It is expected that collecting a thread's stack will be expensive, but not that it scales linearly with the number of live threads, getStackTrace() should never take > 1ms. StackWalker can't be used because it only works with the __current__ thread.
02-06-2020

how do you use the returned StackTraceElement[]? If you are interested in only top N frames, you may consider using StackWalker (its performance is not fully optimized and so if the performance number is not satisfactory, there should have room for improvement).
02-06-2020

java.lang.management.ThreadMXbean::getThreadInfo���(long[] ids, boolean lockedMonitors, boolean lockedSynchronizers) provides the requested API. I would suggest a server subsystem managing a set of threads use that instead. If you wanted to add a new API to Thread that provided the same functionality that would also be an option. But I don't see any way to improve the performance of Thread.getStackTrace() when used in a loop, if it is the per-thread safepoint that is the primary bottleneck. Though as I said handshakes may make a difference here.
01-06-2020

David, can we agree that improving the performance of Thread.getStacktrace is a worthwhile goal? If a server subsystem that manages a set of threads wants to collect stacktraces for those threads, it faces a dilemma: They could call Thread.getStacktrace for each thread, risking O(N) behavior, or They could call Thread.getAllStackTraces, risking requesting much more data than they need. There's no Thread.getStackTraces(List<Thread> threads) bulk operation that is Just Right for this use case.
01-06-2020

I reject the hypothesis. Why should a bulk operation have the same performance characteristics as looping over all threads one by one? There is a general expectation that bulk operations are more efficient that manual iteration - else why bother to provide them. getStacktrace() of an arbitrary thread requires a safepoint (or handshake) per thread. getAllStackTraces() uses a single safepoint (depending on JDK version as handshakes may be changing this) to get all the stacks.
01-06-2020