JDK-8180450 : secondary_super_cache does not scale well
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9,10,11,12,13,17,20,21,22
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2017-05-16
  • Updated: 2024-09-16
  • Resolved: 2024-04-16
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 23
23 b19Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
On some workloads, updates to the Klass::secondary_super_cache field
cause excessive cache line invalidation traffic, with noticeable slowdowns.

Specifically, the cache itself may become unstable (which is a normal corner case for one-element caches) and at that point a multi-threaded application may begin "hammering" on the cache line from multiple threads, causing an explosion of coherence traffic.

One customer reported this as happening when multiple threads were traversing heterogeneous sequences of objects, testing the same classes against more than one interface, with rapid variation between the interfaces.  

In such a case, two interfaces could compete to use the single SSC slot on each class that occurs in the object sequence.  The competition would turn into frequent updating of the SSC slots by multiple threads, causing cache lines to ping-pong between processors.

To fix this, the SSC has to have some sort of limit on its update rate, or be replaced by a mechanism that scales better.

The simplest fix is probably to put an "update count" profile counter somewhere, and consult that counter just before updating the SSC.  If the counter is too high (evidence of a high contention rate), don't update the SSC.  The trade-off is between linear searches of the Klass::secondary_supers array (which is stable and therefore replicated across caches) versus time spent waiting to acquire write access to the SSC (which may be hundreds of cycles).  Linear search will easily win in those cases, except of course for very dense dynamic query mixes over very complex interface graphs, which is a corner case we can leave for the future.

The obvious place to put the update count is next to the SSC, on the same cache line.  When the miss count overflows past some selected threshold, the SSC is left unchanged.  On balance the extra footprint of a 32-bit field per Klass seems acceptable.

Such a counter should be allowed to decay, so that temporary bursts in type test complexity do not shut down the SSC forever.

Another possible fix would be a thread-local update counter for the SSC, under JavaThread::current.  In that case, only Java code could use the extra fix to avoid cache contention, but that is probably acceptable also.  This fix would be significantly more complex, but would have the benefit that only "offending" threads would throttle themselves.

Similarly, the counter could be placed in the MethodData object which carries the profile of the instruction which is causing the SSC contention.  (This instruction could be instanceof, checkcast, aastore, or a call to an intrinsic method that emulates one of those.)  This fix would be even more complex than the thread-based fix, and would probably be overkill given the relatively small importance of the problem.

If the secondary_supers lists ever grow in length to more than a few tens of elements, additional mechanisms may be needed for quickly testing the subtype relation.  Probably a tree walk would be sufficient.  Sometimes unified caches (global or thread-local) are proposed, or perhaps unified numbering schemes, but those, also, seem overkill for this problem.
Comments
A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk22u/pull/166 Date: 2024-04-25 13:38:56 +0000
11-07-2024

Changeset: f11a496d Author: Andrew Haley <aph@openjdk.org> Date: 2024-04-16 14:21:48 +0000 URL: https://git.openjdk.org/jdk/commit/f11a496de61d800a680517457eb43b078a633953
16-04-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/18309 Date: 2024-03-14 18:24:11 +0000
14-03-2024

Setting this to tbd in favor of JDK-8316180 .
09-11-2023

Out of curiosity, does there exist (or could there exist) a write mode that is _weaker_ than plain on Intel or other CPUs? Something that might (for example) write the value to the local L1 cache but without directly causing an eventual subsequent write to other cache levels or main memory (or perhaps doing so only if it can be determined that the cost would be low or none)? Such a mode could hypothetically have many of the advantages of all of these options put together, with few of the drawbacks.
14-09-2023

That sounds reasonable Aleksey. We're converting the benchmark to JMH form, but will post it soon (then PR)
14-09-2023

Hi Derek! Excellent news, thanks! Would you mind sharing the benchmark somewhere? I think we are still talking about stop-gap mitigation strategies, not the Real Fix (tm), right? If so, simplicity is a virtue. For backportable fix for production use, I would be concerned to never updating the cache after first install (#3), since we can get unlucky with polluting the cache with unlucky value during the startup phase. Never updating the cache (#2) also looks risky for prod use, as we penalize even the uncontended case; and we would not know the impact until we tried it in prod. This feels unsatisfactory for mitigating fix. The better strategy looks like something that would converge to most frequent cached value eventually, which leaves us #4. #4A is basically JDK-8316180. I don't think the backoff counter per Java thread is a big deal footprint-wise. That counter can be made more generic, to cover other contended caches, if any. (I actually had JDK-8316180 code handy from my compilation counters contention work I did in background.) The statistical drawback I see with #4A is that we might have sampling problems if we get the "bad" class exactly each N lookups, and only install that one. #4B mitigates the sampling problem with #4A, but it becomes architecture-specific, relying on platform providing fast unprivileged access to TSC-like counters. Unless we want to accept the VM call to `os::nanotime()`, which I think we don't. Not sure how much of the additional latency TSC read would bring; it feels odd to add more latency on can-update-cache checking path than we have for linear scans. The experiments with #4A seem to show that cost of scan on trivial contending code is a handful of nanoseconds. Nerd-snipy idea: I wonder if we can combine #4A and #4B by fuzzing the counter in #4A using "random" counter from #4B. This saves latency on fast-slow-path, and accept TSC latency on slow-slow-path. But maybe #4A is already good on its own, especially provided its simplicity. So, my proposal would be to go with JDK-8316180, implement more architectures, fold Derek benchmarks there, figure out the sensible backoff using Derek's benchmarks, deploy that fix as stop-gap mitigation. Once work on the Real Fix (tm) restarts here, we would have the benchmarks already ready to go, and there would be a more aggressive baseline that the fix would need to win against to justify its (probably higher) complexity. What do you think, Derek?
14-09-2023

Thanks for looking at this Aleksey! There are a few solutions to this problem: 1: Keep current secondary supercache code. 2: Never write to secondary supercache. 3: Only write to the secondary supercache once. 4: Only write to the secondary supercache sometimes. A) By keeping a thread-local counter as suggested by Aleksey - at the cost of a thread local. B) By sampling some bits of the core's cycle counter (AKA a psuedo-random number) and write only if "< X". Maybe at the cost of a little more code. ... We've also been working on an "instanceof" benchmark that scales: - Number of threads - Ratio of reads/writes of the secondary supercache - Length of the interface list We have compared running with enabled and disabled secondary supercache (#1 vs #2). We don't have everything rolled up, but some interesting observations are: - With only 1 thread, the secondary supercache wins in all other scenarios. - If the secondary supercache is updated even 5% of the time, disabling the secondary supercache wins (with more than a few threads and "normal" interface list lengths). E.g., if the secondary supercache is never updated beyond the first time, the secondary supercache wins, other it quickly starts loosing... - The interface length has to be really long (100+ of interfaces) to make the secondary supercache win if there is even a moderate amount of contention. So this is pointing me to a simpler fix #3 - only update the secondary supercache the first time (when NULL), with a backup choice for #4B if we don't want to use up a thread-local for every software cache that's doesn't scale. Note that in this benchmark, the interface array is being hit heavily, so tends to be cached. In a "real application" this might fall out of cache. but on the other hand if the array is long it's easily HW prefetchable. Also, in a "real app" there would be other code running between calls to instance of, but there would still be cache-line contention between cores that impact memory latencies. So it would be a car crash in slow motion, but still a (performance) crash.
13-09-2023

The fix proposed in JDK-8316180 does look promising as a stop-the-gap solution for the scalability issue.
13-09-2023

Derek, have you had any success with this? I think as the stop-gap we can do TLR counters, like JDK-8316180. It seems to considerably alleviate the problems on very contended cases with just a little bit of backoff.
13-09-2023

Deferral Request (JDK 21): We are running out of time for JDK 21 and the complexity / risk of a full fix is high. [~drwhite] if you have a simple stop-the-gap solution for JDK 21, feel free to propose that for review/discussion.
29-06-2023

We wanted to reverify that the secondary supercache provides a benefit on modern HW before trying to patch it. We went back to the original "Fast subtype checking in the HotSpot JVM" paper by Cliff Click and John Rose [1] and are trying to reproduce their results that motivated the optimization. BTW, SPECjbb 2000 and SPECjvm98 are showing their age :-) We're still making sure that we are scaling the benchmark properly, but some early results point to the optimization not scaling much earlier than I expected, compared to not updating the cache at all. If that holds true there may be an even simpler patch :-) [1] https://www.researchgate.net/publication/221552851_Fast_subtype_checking_in_the_HotSpot_JVM
25-05-2023

Thank you for your efforts on this issue. I'm wondering, did the simple scheme ever bear fruit, or is it still too early to tell?
24-05-2023

Thanks Vladimir, we'll take a stab at a fix and see where it's appropriate to use.
20-04-2023

Hi Derek [~drwhite], The current plan to address the issue in 21 is to replace linear search over secondary_supers array with a faster lookup implementation which would make secondary_super_cache obsolete. If you have a simple stop-the-gap solution, please, share the details (or post it for review). Speaking of backports, I would be much more comfortable with a straightforward change to mitigate the scalability issue rather than bringing in a complete rewrite.
18-04-2023

Hi John, Vladimir, How likely is it that a fix for this will get in to JDK 21? I'm quite anxious to get something in soon, even if it's an interim fix. Our kernel folks suggested a really simple scheme that we'd be happy to put out for review soon, but if you've already working on a more global fix for JDK 21 that's fine.
14-04-2023

Many thanks ~thartmann, glad to hear! We can help with performance tests.
15-11-2022

Raising priority due to recent reports.
15-11-2022

I'll add some references to our work in Red Hat since there's a revival of interest; it was surprising for us to see the Netflix blog being published, apparently independently, while we were tinkering with the same problem. Past summer we noticed some suspicious performance results in one of our Quarkus performance benchmarks; we've been plagued with poor reproducibility of results: strong variability, especially on machines with many CPUs. These were dedicated "performance lab" servers, configured by experts for the very purpose of producing reliable, reproducible performance figures. We had also noticed a very high impact on total performance depending on the warmup patterns being used, and overall we were concerned that we were far from the theoretical performance results we were expecting from this simple code; confirmed by achieving significantly higher figures by writing code in different languages not running on the JVM. After a brilliant deep-dive investigation by Francesco Nigro (Red Hat middleware performance team) and with the help of Andrew Haley and Andrew Dinn (Red Hat OpenJDK team) they figured out that we were affected by this particular issue, to an extent of having a very significant impact. Francesco created an agent to help people like me, maintainers of Java OSS libraries, to identify code patterns which were triggering the issue. The agent is available here: - https://github.com/RedHatPerf/type-pollution-agent We're now using this agent in the community to identify and prioritize patches on a number of popular open source libraries; we're well aware that fixing this JDK issue is preferrable but we're doing this to get a better grasp on the impact; the intention being to provide more data here as soon as we can converge all those patches to different projects. A sample of these patches might be useful: Hibernate ORM: - https://github.com/hibernate/hibernate-orm/pull/5444 - https://github.com/hibernate/hibernate-orm/pull/5483 - https://github.com/hibernate/hibernate-orm/pull/5502 - https://github.com/hibernate/hibernate-orm/pull/5527 Hibernate Reactive: - https://github.com/hibernate/hibernate-reactive/pull/1400 Smallrye Mutiny: - https://github.com/smallrye/smallrye-mutiny/pull/1081 - https://github.com/smallrye/smallrye-mutiny/pull/1113 Smallrye Common: - https://github.com/smallrye/smallrye-common/pull/190 Vert.x: - https://github.com/eclipse-vertx/vert.x/pull/4520 - https://github.com/eclipse-vertx/vert.x/pull/4525 Vert.x Web: - https://github.com/vert-x3/vertx-web/pull/2289 - https://github.com/eclipse-vertx/vert.x/pull/4525] Netty: - https://github.com/netty/netty/pull/12709 - https://github.com/netty/netty/pull/12806 - https://github.com/netty/netty/pull/12980 Quarkus: - https://github.com/quarkusio/quarkus/pull/29044 - https://github.com/quarkusio/quarkus/pull/28985 - https://github.com/quarkusio/quarkus/pull/29109 - https://github.com/quarkusio/quarkus/pull/29136 These patches to our projects are not something to be proud of: it doesn't look like clean Java code, certainly not easy to maintain. Hopefully these won't be needed for long, and while we don't have final figures yet the preliminary tests are looking extremely promising. N.B. the agent is prone to report false positives; we're still learning how to best distinguish a real problem from an apparent one, so it's possible that some of these patches are actually unnecessary. Our plan is to verify things via hard performance numbers, but having many cases identified it's become unpractical to test each patch individually (although we did so on some of them).
14-11-2022

I should mention that for my new invoke bindings code, I have a prototype where each type gets a "selector"; a randomly unified yet unique number. Each class had a cuckoo style hash table for its super types. For a type check operation, I would check first in the primary bucket, and then the secondary bucket. They are immutable after creation. Very quick lookup times, and no contention. Might be interesting to try out.
11-11-2022

Yet another incarnation of this problem: https://netflixtechblog.com/seeing-through-hardware-counters-a-journey-to-threefold-performance-increase-2721924a2822
10-11-2022

I'd like to add three thoughts: 1) This is way more common than the initial priority rating is suggesting. We (Intel) have seen several customers hit this issue, RedHat/Quarkus put out a nice video describing the problem (https://youtu.be/G40VfIsnCdo) and there are multiple cases of patches to major libraries essentially fleeing from instanceOf and checkCast usages (it seems like daily!). 2) There is also a false sharing issue between _secondary_super_cache and _secondary_supers (and possibly _primary_supers). At the least a quick fix to pad _secondary_super_cache to avoid false sharing will help. 3) One potential fix occurred as I wrote this, so no code or perf data: Basic idea: Don't update the _secondary_super_cache after a search if we detect that the _secondary_super_cache changed while we searched. This will start throttling updates at extreme contention. Klass* orig_value = _secondary_super_cache; if (search_secondary_supers(k)) { SemiAtomic::cmpxchg(&_secondary_super_cache, orig_value, k); return true; } "SemiAtomic", because we need the effect of cmpxchg, but don't need the atomicity. Testing is needed to see if Atomic::cmpxchg is OK, or if a non atomic sequence of "load, cmp, bne, store" performs better. A variation is to poison the _secondary_super_cache if we EVER see a failure in the "cmpxchg", because the "cmpxchg" may not fail enough if the first load of _secondary_super_cache is slower than the search_secondary_supers(). Klass* orig_value = _secondary_super_cache; if (search_secondary_supers(k)) { if (orig_value != POISON) { if (!SemiAtomic::cmpxchg(&_secondary_super_cache, orig_value, k)) { _secondary_super_cache = POISON; } } return true; } Or maybe the "cmpxchg" never fails and we need something else. I'm happy to hear any thoughts on this or other approaches...
02-11-2022

The PR for JDK-8151481 (https://github.com/openjdk/jdk/pull/6434) explains how the secondary_super_cache issue negatively affected the performance of j.u.regex.Pattern.
02-11-2022

Added link to https://github.com/franz1981/type-pollution-agent
12-10-2022

Background information: http://cr.openjdk.java.net/~jrose/jvm/checktype-2001.txt https://bugs.openjdk.java.net/browse/JDK-8251318 https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2020-August/041056.html https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2020-August/041051.html
07-08-2020

ILW = Performance issue with secondary_super_cache (not a regression), with rare workloads, no workaround (but disable writing to cache in code and re-build hotspot) = MMH = P3
20-02-2019

Mulugeta Mammo of Intel came across this during a Genomics study and the same problem was later discovered in Cassandra. Here is more information from Mulugeta: Original JIRA post that includes the example: https://issues.scala-lang.org/browse/SI-9823 The PR that uses virtual calls: https://github.com/scala/scala/pull/5364 JVM workaround: disabling a writing to the secondary super cache:http://hg.openjdk.java.net/jdk8u/jdk8u60/hotspot/file/37240c1019fd/src/cpu/x86/vm/macroAssembler_x86.cpp lines: 5053, 5055, 5114 Scala download (we used 2.10.4 but 2.11 versions has same issue as well): https://www.scala-lang.org/download/all.html The below code demonstrates the problem. After disabling the secondary super cache, I measured a performance gain as high as 40x on Haswell-EP with 2 sockets and 36 cores. // javac Main.java // java -XX:-Inline -XX:-TieredCompilation Main 36 8000000 import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; interface I1 {} interface I2 {} interface I3 extends I1, I2 {} class B implements I3 {} class C implements I3 {} public class Main { public static void main(String[] args) { int numThreads = Integer.parseInt(args[0]); int loopCount = Integer.parseInt(args[1]); ExecutorService es = Executors.newFixedThreadPool(numThreads); I3 b = new B(); I3 c = new C(); for (int i = 0; i != numThreads; i++) { es.submit(() - >{ for (int j = 0; j != loopCount; j++) { foo(b); foo(c); goo(b); goo(c); } }); } es.shutdown(); } public static boolean foo(I3 i) { return i instanceof I1; } public static boolean goo(I3 i) { return i instanceof I2; } }
16-05-2017