JDK-8208142 : Fix perf regression for simple startup benchmarks in new SymbolTable
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 12
  • Priority: P4
  • Status: Resolved
  • Resolution: Not an Issue
  • Submitted: 2018-07-24
  • Updated: 2019-08-15
  • Resolved: 2019-02-21
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 13
13Resolved
Related Reports
Blocks :  
Blocks :  
Relates :  
Relates :  
Relates :  
Relates :  
Description
Switching SymbolTable implementation to use concurrent hash table shows minor (~5%) regression for trivial startup benchmarks.

Need to figure out what exactly is going on (old table was practically sync free when JVM starts up single threaded, but new one has some bookkeeping to do always) and get the performance back up.
Comments
You are absolutely right that my analysis does not apply to big apps, but this particular issue is only about the perf regression as noted in the startup perf benchmarks noted after changing the implementation of symbol table, but before CDS was turned ON by default. For this reason I believe that this particular issue should be closed "as not an issue" and you can and should open another issue raising the concern you noted here (please link them).
21-02-2019

I disagree with #1: default CDS being on by default just shifts what applications are affected. A no-op Hello World will load most of everything from CDS, sure, but the gain from default CDS might be in the noise on a large application. One inefficiency I've spotted is that in SymbolTable::add we do a lookup_common, even though we've already done a lookup_common in the code bulks up a number of symbols. This means we can be absolutely sure the symbol doesn't exist in the shared table and improve a bit by skipping this back-to-back lookup_common: diff -r cc9359f8c563 src/hotspot/share/classfile/symbolTable.cpp --- a/src/hotspot/share/classfile/symbolTable.cpp Thu Feb 21 14:16:44 2019 +0100 +++ b/src/hotspot/share/classfile/symbolTable.cpp Thu Feb 21 16:42:45 2019 +0100 @@ -444,10 +444,7 @@ const char *name = names[i]; int len = lengths[i]; unsigned int hash = hashValues[i]; - Symbol* sym = SymbolTable::the_table()->lookup_common(name, len, hash); - if (sym == NULL) { - sym = SymbolTable::the_table()->do_add_if_needed(name, len, hash, c_heap, CHECK); - } + Symbol* sym = SymbolTable::the_table()->do_add_if_needed(name, len, hash, c_heap, CHECK); assert(sym->refcount() != 0, "lookup should have incremented the count"); cp->symbol_at_put(cp_indices[i], sym); } Also: seeing how do_add_if_needed is always preceeded by a lookup of some sort, so we might speculatively assume we need to allocate a symbol and do insert and then get rather than get and then insert.
21-02-2019

Resuming work. Now that CDS is ON by default and we removed get_insert API from the concurrent hash tables, the assumptions have changed dramatically: #1 Because CDS is ON by default we do not use the concurrent hash table as much, so any performance improvements to it will not show much benefit overall #2 Without get_insert API we might have to lookup twice (?) and any improvements will be diminished because of #1 Need to investigate more, whether this issue is still worth it...
07-02-2019

Some stats about Symbols with "single threaded" optimization. In a trivial "hello world" test case, we create : - 840 normal symbols - 16,334 permanent symbols (includes the 8,557 symbols below) - 8,557 permanent symbols during initializing VM (single thread)
10-09-2018

Results of using memory pool (same Arena as the one used by the permanent Symbols) for all table item needs (just a prototype at the moment - we leak memory on item deletion): NEW1: reduced lookups NEW2: NEW1 + Symbol:equals optimizations NEW3: NEW2 + pre allocated memory ............................................................... OLD... vs NEW1 vs NEW2 vs NEW3 #1 JFR-Startup3-ModNoop ................... 4.38% vs 4.17% vs 3.34% vs 0.00% #2 Perfstartup-JFAAS_Hello ................. 4.65% vs 2.39% vs 2.15% vs 0.64% #3 Perfstartup-JFAAS_Hello-Tuned ...... 3.89% vs 1.51% vs 1.87% vs FAIL #4 Perfstartup-JFAAS_Hello-Tuned-G1. 3.56% vs 0.96% vs 1.51% vs FAIL #5 Perfstartup-Netty .............................. 3.15% vs 1.36% vs 1.44% vs 2.37% #6 Startup3-Jetty-G1 ............................. 3.08% vs 1.40% vs 1.34% vs 0.50% #7 Startup3-Jetty-ParGC ....................... 2.75% vs 1.92% vs 0.23% vs 0.34% #8 Startup3-ModNoop-G1 ..................... 4.17% vs 3.33% vs 3.13% vs N/A #9 Startup3-Noop-G1 ............................ 4.60% vs 2.88% vs 3.76% vs 0.00% #10 Startup3-Noop-ParGC .................... 5.03% vs 3.70% vs 0.00% vs 0.00% There are 2 failed benchmarks (my prototype is incompatible with CDS?) The results show that preallocating memory for the concurrent hash table gives us back the performance before. Now, turning the prototype into working code...
22-08-2018

Next, we try to do reduce malloc calls (using Arenas?)
16-08-2018

Results: NEW1: reduced lookups NEW2: NEW1 + Symbol:equals optimizations ............................................................... OLD... vs NEW1 vs NEW2 #1 JFR-Startup3-ModNoop ................... 4.38% vs 4.17% vs 3.34% #2 Perfstartup-JFAAS_Hello ................. 4.65% vs 2.39% vs 2.15% #3 Perfstartup-JFAAS_Hello-Tuned ...... 3.89% vs 1.51% vs 1.87% #4 Perfstartup-JFAAS_Hello-Tuned-G1. 3.56% vs 0.96% vs 1.51% #5 Perfstartup-Netty .............................. 3.15% vs 1.36% vs 1.44% #6 Startup3-Jetty-G1 ............................. 3.08% vs 1.40% vs 1.34% #7 Startup3-Jetty-ParGC ....................... 2.75% vs 1.92% vs 0.23% #8 Startup3-ModNoop-G1 ..................... 4.17% vs 3.33% vs 3.13% #9 Startup3-Noop-G1 ............................ 4.60% vs 2.88% vs 3.76% #10 Startup3-Noop-ParGC .................... 5.03% vs 3.70% vs 0.00% so we improved bit more (noise?).
16-08-2018

Results: ............................................................... OLD... vs NEW #1 JFR-Startup3-ModNoop ................... 4.38% vs 4.17% #2 Perfstartup-JFAAS_Hello ................. 4.65% vs 2.39% #3 Perfstartup-JFAAS_Hello-Tuned ...... 3.89% vs 1.51% #4 Perfstartup-JFAAS_Hello-Tuned-G1. 3.56% vs 0.96% #5 Perfstartup-Netty .............................. 3.15% vs 1.36% #6 Startup3-Jetty-G1 ............................. 3.08% vs 1.40% #7 Startup3-Jetty-ParGC ....................... 2.75% vs 1.92% #8 Startup3-ModNoop-G1 ..................... 4.17% vs 3.33% #9 Startup3-Noop-G1 ............................ 4.60% vs 2.88% #10 Startup3-Noop-ParGC .................... 5.03% vs 3.70% so we improved and are on the right track.
14-08-2018

With a simple optimization that eliminates the lookups when we're adding (lookup gets implicitly called by the concurrent hash table anyhow, so we end up looking up item twice), we get this profile: 1 ms for SymbolTable::the_table() 1 ms for szone_malloc_should_clear 1 ms for Symbol::extract_length(unsigned int) 1 ms for ConcurrentHashTable<>::Bucket::is_state() 1 ms for ConcurrentHashTable<>::BaseConfig::allocate_node() 1 ms for ConcurrentHashTable<>::Node* ConcurrentHashTable<>::get_node<>() const 6 ms total. Time to run benchmark...
14-08-2018

Totals for the regression case involving SymbolTable: 2 ms for ConcurrentHashTable<>::ScopedCS::ScopedCS() 2 ms for ConcurrentHashTable<>::Bucket::is_state() 2 ms for SymbolTableLookup::equals(Symbol**, bool*) 1 ms for DYLD-STUB$$malloc 1 ms for szone_malloc_should_clear 1 ms for ConcurrentHashTable<>::internal_get<>() 1 ms for Symbol::Symbol() 1 ms for SymbolTableLookup::get_hash() vs the baseline: 2 ms for java_lang_String::hash_code() 2 ms for ScopedFence<()0>::~ScopedFence() 1 ms for SymbolTable::lookup_only() 1 ms for SymbolTable::lookup_dynamic() 1 ms for Symbol::Symbol()
13-08-2018

Here is Xcode time profile generated using Instruments for new SymbolTable (focused on SymbolTable): 122.00 ms 100.0% 0 s__java (824) 11.00 ms 9.0% 0 s____pthread_body 0x430e 2.00 ms 1.6% 2.00 ms_ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::ScopedCS:... 1.00 ms 0.8% 0 s_____bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::ge... 1.00 ms 0.8% 0 s______SymbolTable::do_lookup(char const*, int, unsigned long) 1.00 ms 0.8% 0 s______ SymbolTable::lookup_dynamic(char const*, int, unsigned int) 1.00 ms 0.8% 0 s_____bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::in... 1.00 ms 0.8% 0 s______bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::ge... 1.00 ms 0.8% 0 s______ SymbolTable::do_add_if_needed(char const*, int, unsigned long, bool, Th... 2.00 ms 1.6% 2.00 ms_ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::Bucket::is... 2.00 ms 1.6% 0 s_____ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::Bucket::h... 2.00 ms 1.6% 0 s______ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get_buck... 2.00 ms 1.6% 0 s______ Symbol** ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>... 2.00 ms 1.6% 0 s________bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get... 2.00 ms 1.6% 0 s_________SymbolTable::do_lookup(char const*, int, unsigned long) 2.00 ms 1.6% 0 s__________SymbolTable::lookup_dynamic(char const*, int, unsigned int) 1.00 ms 0.8% 1.00 ms_SymbolTableLookup::get_hash() const 1.00 ms 0.8% 0 s_____bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::intern... 1.00 ms 0.8% 0 s______bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get_i... 1.00 ms 0.8% 0 s______ SymbolTable::do_add_if_needed(char const*, int, unsigned long, bool, Thre... 1.00 ms 0.8% 1.00 ms Symbol** ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::inte... 1.00 ms 0.8% 0 s_____bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get<Sym... 1.00 ms 0.8% 0 s______SymbolTable::do_lookup(char const*, int, unsigned long) 1.00 ms 0.8% 0 s______ SymbolTable::lookup_dynamic(char const*, int, unsigned int) 1.00 ms 0.8% 1.00 ms_DYLD-STUB$$malloc 1.00 ms 0.8% 0 s_____AllocateHeap(unsigned long, MemoryType, NativeCallStack const&, AllocFailStr... 1.00 ms 0.8% 0 s______AllocateHeap(unsigned long, MemoryType, AllocFailStrategy::AllocFailEnum) 1.00 ms 0.8% 0 s______ ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::BaseConfig... 1.00 ms 0.8% 1.00 ms_SymbolTableLookup::equals(Symbol**, bool*) 1.00 ms 0.8% 0 s_____ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::Node* Concur... 1.00 ms 0.8% 0 s______Symbol** ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::in... 1.00 ms 0.8% 0 s______ bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get<S... 1.00 ms 0.8% 0 s________SymbolTable::do_lookup(char const*, int, unsigned long) 1.00 ms 0.8% 0 s_________SymbolTable::lookup_dynamic(char const*, int, unsigned int) 1.00 ms 0.8% 1.00 ms_szone_malloc_should_clear 1.00 ms 0.8% 0 s_____malloc_zone_malloc 1.00 ms 0.8% 0 s______malloc 1.00 ms 0.8% 0 s______ os::malloc(unsigned long, MemoryType, NativeCallStack const&) 1.00 ms 0.8% 0 s________AllocateHeap(unsigned long, MemoryType, NativeCallStack const&, AllocFailS... 1.00 ms 0.8% 0 s_________AllocateHeap(unsigned long, MemoryType, AllocFailStrategy::AllocFailEnum) 1.00 ms 0.8% 0 s__________ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::BaseConf... 1.00 ms 0.8% 1.00 ms_Symbol::Symbol(unsigned char const*, int, int) 1.00 ms 0.8% 0 s_____SymbolTable::allocate_symbol(char const*, int, bool, Thread*) 1.00 ms 0.8% 0 s______SymbolTableCreateEntry::operator()() 1.00 ms 0.8% 0 s______ bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::intern... 1.00 ms 0.8% 0 s________bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get_i... 1.00 ms 0.8% 0 s_________SymbolTable::do_add_if_needed(char const*, int, unsigned long, bool, Thre... 1.00 ms 0.8% 1.00 ms_Symbol::equals(char const*, int) const 1.00 ms 0.8% 0 s_____SymbolTableLookup::equals(Symbol**, bool*) 1.00 ms 0.8% 0 s______ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::Node* Concur... 1.00 ms 0.8% 0 s______ Symbol** ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::in... 1.00 ms 0.8% 0 s________bool ConcurrentHashTable<Symbol*, SymbolTableConfig, (MemoryType)9>::get<S... 1.00 ms 0.8% 0 s_________SymbolTable::do_lookup(char const*, int, unsigned long)
13-08-2018

Here is Xcode time profile generated using Instruments for baseline (focused on SymbolTable): 122.00 ms 100.0% 0 s__java (834) 7.00 ms 5.7% 0 s____pthread_body 0x43ff 2.00 ms 1.6% 2.00 ms_java_lang_String::hash_code(signed char const*, int) 2.00 ms 1.6% 0 s______SymbolTable::hash_symbol(char const*, int) 2.00 ms 1.6% 0 s______ SymbolTable::lookup_only(char const*, int, unsigned int&) 2.00 ms 1.6% 2.00 ms_ScopedFence<(ScopedFenceType)0>::~ScopedFence() 2.00 ms 1.6% 0 s______BasicHashtableEntry<(MemoryType)9>* OrderAccess::ordered_load<BasicHa... 2.00 ms 1.6% 0 s______ BasicHashtableEntry<(MemoryType)9>* OrderAccess::PlatformOrderedLoad... 2.00 ms 1.6% 0 s________Atomic::LoadImpl<BasicHashtableEntry<(MemoryType)9>*, OrderAccess::... 2.00 ms 1.6% 0 s_________BasicHashtableEntry<(MemoryType)9>* OrderAccess::load_acquire<Basic... 2.00 ms 1.6% 0 s__________HashtableBucket<(MemoryType)9>::get_entry() const 2.00 ms 1.6% 0 s___________BasicHashtable<(MemoryType)9>::bucket(int) const 2.00 ms 1.6% 0 s____________Hashtable<Symbol*, (MemoryType)9>::bucket(int) const 2.00 ms 1.6% 0 s_____________SymbolTable::lookup_dynamic(int, char const*, int, unsigned int) 1.00 ms 0.8% 1.00 ms_SymbolTable::lookup_only(char const*, int, unsigned int&) 1.00 ms 0.8% 1.00 ms_SymbolTable::lookup_dynamic(int, char const*, int, unsigned int) 1.00 ms 0.8% 1.00 ms_Symbol::Symbol(unsigned char const*, int, int) 1.00 ms 0.8% 0 s______SymbolTable::allocate_symbol(unsigned char const*, int, bool, Thread*)
13-08-2018

One thing to immediately notice is how much memory related operations show in the regression case.
13-08-2018

We can see immediately that SymbolTable increased from 7.00 ms to 11.00 ms, which is a regression of 4ms. Out of 122 ms total runtime, 4ms is 3.2% regression, which is pretty close to 4%,5% we see in official startup perf benchmarks. These runs were generated using "hello world" test case.
13-08-2018

Things to try: #1 Eliminate unnecessary lookups #2 Start table in single-threaded mode (no critical sections, locks) #3 (from Coleen) Burst allocate the nodes
31-07-2018

Specific regressions: #1 JFR-Startup3-ModNoop 4.38% #2 Perfstartup-JFAAS_Hello 4.65% #3 Perfstartup-JFAAS_Hello-Tuned 3.89% #4 Perfstartup-JFAAS_Hello-Tuned-G1 3.56% #5 Perfstartup-Netty 3.15% #6 Startup3-Jetty-G1 3.08% #7 Startup3-Jetty-ParGC 2.75% #8 Startup3-ModNoop-G1 4.17% #9 Startup3-Noop-G1 4.60% #10 Startup3-Noop-ParGC 5.03%
25-07-2018