JDK-8315559 : Delay TempSymbol cleanup to avoid symbol table churn
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 17.0.9,21.0.1,22
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2023-09-01
  • Updated: 2024-06-03
  • Resolved: 2023-12-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 17 JDK 21 JDK 22
17.0.13Fixed 21.0.4-oracleFixed 22 b27Fixed
Related Reports
Relates :  
Relates :  
Description
Dacapo pmd benchmark (`java -jar dacapo-9.12-MR1.jar --size large --iterations 20 pmd`) regressed 5-20% on all platforms.

This is caused by JDK-8313678, which fixed a symbol leak, but the leak was actually beneficial to performance in some cases as it allowed cheap temp symbol re-use instead of churning symbols with the associated cleanup and recreation cost.


Comments
[jdk17u-fix-request] Approval Request from olivergillespie Fixes non-trivial performance regression in class-loading (dacapo pmd). Not clean due to file movement but simple to merge. Confirmed that the performance regression is present before the change and fixed after, using the original benchmark. Medium risk as it affects Symbol creation, but has been working in tip and 21 for a few months with no associated issues.
23-04-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/2424 Date: 2024-04-22 15:35:14 +0000
22-04-2024

[jdk21u-fix-request] Approval Request from olivergillespie Re-requesting now that it has baked in mainline for a few months, and for parity with 21.0.4-oracle. The backport fixes a performance regression in class-loading which showed up as 5-20% regression in Dacapo pmd benchmark. It is not trivial code, so there is some risk, but with the baking I think it's reasonable. No special testing performed.
05-03-2024

This should have more time in mainline before backporting to 21.
15-01-2024

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/154 Date: 2024-01-09 16:46:27 +0000
09-01-2024

Changeset: d23f4f12 Author: Oli Gillespie <ogillespie@openjdk.org> Committer: Coleen Phillimore <coleenp@openjdk.org> Date: 2023-12-04 12:25:51 +0000 URL: https://git.openjdk.org/jdk/commit/d23f4f12adf1ea26b8c340efe2c3854e50b68301
04-12-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16398 Date: 2023-10-27 10:55:16 +0000
27-10-2023

Thanks. I agree it would be better just for temp symbols, but if that doesn't work easily I don't think it's fatal - constant pool entries will be quickly pushed out of the queue if there is high volume temp symbol creation due to the push-pop behaviour when the queue fills up. The main downside I see with handling *all* symbols is the unnecessary overhead of extra queue operations.
27-10-2023

I like the simplicity of keeping the bounded queue to 100 and not triggering cleanup, but draining the queue if we do have concurrent cleanup. I like the prototype so far, hopefully it moves cleanly to TempNewSymbol because constant pool loads would use up all the entries in the queue and they're never unloaded until class unloading time.
26-10-2023

Thanks for looking at it :) Since you're okay with it in principle I'll convert it to a PR and we can discuss the code details there. I'll look at restricting it to TempNewSymbol. For the cleanup, what I meant was: The queue insert logic keeps the queue size bounded (I'm using 100 for testing). So at most we'll have 100 half-leaked entries. If we don't want a possibly indefinite retention of 100 entries, we could trigger has_work whenever the queue is not empty, but I tried that and it seems to trigger it too often, degrading performance. We could have a threshold for concurrent cleanup somewhere between 0 and 100, but then is it really much benefit over the already bounded queue?
26-10-2023

+ add_to_cleanup_delay_queue(sym); // TODO: only temp symbols? This might make sense. Move the code to be used by TempNewSymbol.
26-10-2023

Yes this patch is quite simple. It increments refcount of recently added symbols for some amount of time and decrements them sometime later. So you never return a symbol with a zero refcount (always refcount would be at least 2). I like it. The queue belongs in the .cpp file since nothing else refers to it. Not sure I understand your last sentence. Cleanup won't be triggered as often because lookup won't find as many 0 refcount symbols. But you could trigger has_work if the queue gets to some threshold of the symbol table size. Having a NonBlockingQueue is good that so we don't need another CHT.
26-10-2023

I have prototyped Aleksey's initial idea of a queue which delays symbol death, see attached symbol-delay-queue.patch. It brings performance back in line with the original code before the leak fix in my two tests (dacapo pmd and my `findLoadedClass("does-not-exist" + Integer.valueOf(1)); ` benchmark above). I liked the idea because it feels relatively simple/non-invasive - I'm not sure I understand enough to try the second CHT idea. Two key things are missing - operate only on temp symbols, and consider async cleanup. Triggering async cleanup based on the queue being non-empty seems to be too often. I'm not sure if it's needed, the queue is tightly bounded in number of entries but the entries could theoretically be large.
26-10-2023

I think allowing Symbol revival would cause some hard to track down bugs, since entries are asynchronously deleted. We had bugs like that when implementing this table. Symbols going away unexpectedly.
23-10-2023

Aleksey's idea is interesting. Here's my interpretation: Instead of TempNewSymbol destructor decrementing the refcount, it could add the SymbolCleanup { Symbol*, refcount } node to a second CHT. At class unloading time, or when the table hits a threshold, run through this table, decrement Symbol.refcounts number of node.refcount times, and clear out the table. Trigger this from the ServiceThread_lock. Another asynchronous moving part but might help a lot with transient symbols that are eventually used. [~ogillespie] Would you like to prototype this?
19-10-2023

I think I understand now how we have 'collisions' with just one symbol and why removing the sync cleanup path hurts performance (you probably already know this, but I needed to learn a bit more to understand fully). Back-to-back calls to TempSymbol SymbolTable::new_symbol("foo") always creates multiple entries, even if sync cleanup is removed. That's because once a symbol is dead, it can never be revived (by choice to avoid concurrency issues - https://github.com/openjdk/jdk/blob/master/src/hotspot/share/oops/symbol.cpp#L299). So without sync cleanup, we create a bad situation where the bucket for that symbol grows very long with 'equal but dead' symbols that we have to trawl through all of before concluding that in fact we need to create a new symbol after all - plus the async cleanup has way more work to do with all the duplicate dead symbols. After the first call, the bucket now has an existing (but dead) entry for the given string, which will reliably trigger the original leak on the next call, conveniently stopping the delete/create churn by creating one canonical entry that will never die and stops further entries being created. So my original leak test can actually use the same string twice to demonstrate the leak, not just two different strings with the same hashcode. Removing the 'no revival allowed' safety net *and* sync cleanups actually gives me good performance, so I think that's good evidence for my explanation. But I still don't know the best option, not sure if it's worth considering allowing symbol revival, or if you think Aleksey's keepalive queue would be interesting to see?
19-10-2023

Not cleaning up during add (allowing the service thread to do the work) is 10% worse than reverting all the changes from JDK-8313670.
16-10-2023

I am sending a build through performance testing that doesn't clean on the insert path, and lets the service thread do all the work. I don't know if that change would help.
12-10-2023

Oli and me discussed this today, and one thing that pops into mind is somehow delaying the removal of temporary symbols from symbol table. The wrinkle is that there is both sync (delete_in_bucket called from get path) and async (service thread) parts. So if we are able to maybe delay the synchronous cleanup, and let the async cleanup to act is enough? Without introducing Symbol age and other intrusive rewrites, we can _maybe_ go to SymbolTable::new_symbol, and on the path that inserts an new symbol, we _also_ increment the refcount and put that symbol on the global NonblockingQueue. Then a service thread comes in for async cleanup, pops symbols from the queue, decreases their refcounts, and thus makes "temp/garbage" symbols to be eligible for deletion. So the temp symbols would have a window between two async cleanups to be reused? Not sure if Oli wants to prototype it.
12-10-2023

My test that does essentially the same thing gets similar results with -XX:+PrintStrimgTableStatistics: With leak Fixed: Number of literals : 16881 = 5255432 bytes, avg 311.000 Total footprint : = 5787672 bytes With Leak: Number of literals : 26043 = 8113976 bytes, avg 311.000 Total footprint : = 8792808 bytes I think fixing this leak is a Good Thing. The leak for the same symbol is that lookup for class loading has a TempNewSymbol, so the first instance is reclaimed for the lookup for AppClassLoader (is_dead=true), then the second call for PlatformLoader, adds the same name which gets an additional refcount while calling delete_in_bucket for the first name. I think without extensive rewrite, this performance problem for this dacapo benchmark may be the cost of correctness, but I'm still trying to think of some way to mitigate it.
12-10-2023

I am trying to understand the overhead a bit more deeply. I created this benchmark which I think reproduces the issue. ``` MyClassLoader cl = new MyClassLoader(); @Benchmark public Class findLoadedClass() { return cl.find(); } static class MyClassLoader extends ClassLoader { public Class find() { return findLoadedClass("does-not-exist" + Integer.valueOf(1)); } } ``` I've attached a flame graph of this running with and without the leak fix. We can see that SymbolTable::do_add_if_needed is the big difference, with most of the cost being delete_in_bucket. I don't fully understand it, the original leak requires a collision between two symbols, but this repro works with just one symbol.
12-10-2023

Yes, I had an application running Groovy 4 which was leaking memory at a significant overall rate due to this, combined with a bug/issue in Groovy which can cause churn of LambdaForms classes (https://issues.apache.org/jira/browse/GROOVY-11152). However in well-behaved applications I don't expect this to be a bad leak, so I wouldn't say the leak fix is high priority. Also my application has avoided the issue by using Groovy 3, anyway, so it won't affect me personally if we decide to revert. That being said, I wonder how realistic/important the pmd benchmark is in this case - it's only repeated failed classloading with the same name which is improved by the accidental cache, right? Successful loading shouldn't be affected as I understand it, and that seems the more important case to optimize. As an aside, I've often thought caching of negative results from classloaders would be useful for performance, I've seen a lot of applications which repeatedly look for classes that will never be there, and I would guess an unchanging classpath is the most common situation.
12-10-2023

[~ogillespie] Was there an observed problem running an application that made you discover this leak? I see you backported the fix.
12-10-2023

Making JVM_FindLoadedClass leak symbols reduces the dead symbols to 54171. Still a lot. It seems that this leak for transient symbols was good for performance. When we moved Symbol out of PermGen over a decade ago, we settled on an implementation that allowed us to use Symbol* directly and had refcounting so that unused Symbols could be reclaimed from the SymbolTable. This design and implementation has been buggy for years. Most Symbols are never reclaimed, the percentage was like 5% when we were reclaiming symbols at a safepoint. Now with the CHT, we're more eager about reclaiming symbols, which is a good thing. But fixing this leak makes them work a bit too well. Maybe it's time for an alternate design? Bring back SymbolHandle for on stack use in the VM, and make it work like methodHandles for reclamation? Like MetadataOnStackMark? This requires work during a safepoint (pause) but we'd probably almost never do it.
11-10-2023

It looks like class loading is affected by patching up this leak. When looking up a class, we call JVM_FindLoadedClass for the hierarchy of class loaders. That function has a TempNewSymbol to deallocate a Symbol after searching for whether its loaded. With the leak fixed, each time we look up a symbol, it's deallocated rather than reused. With the patch I have 1102485 total dead symbols. Without the patch I have 2663. I added logging to the is_dead function.
10-10-2023

Very good point - my patch reintroduces the leak, so your idea that we're detecting more dead symbols with the leak fix seems reasonable. I added logging to is_dead and I observe around 200,000 `true` returns per Dacapo iteration, compared to 0 (after a few warmup rounds) with the additional equals call. I also observe around 70,000 SymbolTable::do_add_if_needed calls, again compared to 0 before. Here are the top matches: ``` net/sourceforge/pmd/ast/RuntimeException java/util/RuntimeException net/sourceforge/pmd/ast/JJTFORINIT net/sourceforge/pmd/SimpleJavaNode net/sourceforge/pmd/ast/LookaheadSuccess ... ``` So could this mean we're seeing high symbol churn in the benchmark when the leak is fixed, and before with the leak we were accidentally caching the symbols, saving time deleting and re-creating them? I'm not sure how that idea fits with your CLEAN_DEAD_HIGH_WATER_MARK patch though, I would expect that to help in this case.
10-10-2023

In my testing (x86), it seems we can 'fix' the regression with this patch. I have no idea why, but it seems to indicate that the issue is not the new is_dead function but the removal of the equals call. diff --git a/src/hotspot/share/utilities/concurrentHashTable.inline.hpp b/src/hotspot/share/utilities/concurrentHashTable.inline.hpp index b222d379b72..ae256a5ef20 100644 --- a/src/hotspot/share/utilities/concurrentHashTable.inline.hpp +++ b/src/hotspot/share/utilities/concurrentHashTable.inline.hpp @@ -545,6 +545,7 @@ inline void ConcurrentHashTable<CONFIG, F>:: Node* const volatile * rem_n_prev = bucket->first_ptr(); Node* rem_n = bucket->first(); while (rem_n != nullptr) { + lookup_f.equals(rem_n->value()); // ??? if (lookup_f.is_dead(rem_n->value())) { ndel[dels++] = rem_n; Node* next_node = rem_n->next();
10-10-2023

[~ogillespie] This is really interesting, and is consistent with my findings as well. On linux-x64: Reverting all the changes for JDK-8313670 vs master (sources with changes) : -5.55% Reverting all the changes vs. your first fix to add another parameter to equals: -5.65% Reverting all the changes vs. a fix to keep is_dead() function but use the original is_dead parameter for equals: -6.40% Reverting all the changes vs. above with calling bucket_delete only on the slow lookup (with some threshold for triggering the service thread less frequently) : -9.17% I really thought the performance problem was the lookup calling both equals then is_dead() but the results don't improve with changing that. I also thought there were more dead Symbols detected because the leak is fixed. This is a diff for #3 above, which doesn't help. https://github.com/openjdk/jdk/pull/16118 I'll try your patch. I don't have any explanation why this is happening. I was going to suggest that correctness (no leaks) is costing us 5% performance. edit: The patch to add 'equals' in delete_in_bucket reintroduces the leak (and fails the gtest you added).
10-10-2023

This one is indeed puzzling. From the profiles, I notice that: a) even after 20 iterations we have plenty of C2 compiles; b) there are lots of exceptions thrown from pmd code; c) there is considerable contention on internal pmd locks. I am beginning to suspect the either of these three things got worse somehow.
06-10-2023

So far, I've found that the regression seems to be caused by delete_in_bucket() calling the new is_dead() function rather than calling equals(&is_dead). The is_dead() function does little but possibly because of the number of calls for table insertions, there's some cache misses that show a performance regression. I don't really understand it. If I revert the change, the performance is measurably better. If I revert most of the change and keep is_dead() function, the performance is not better.
29-09-2023