Bug ID: JDK-8213574 Deadlock in string table expansion when dumping lots of CDS classes

JDK-8213574 : Deadlock in string table expansion when dumping lots of CDS classes

Type: Bug
Component: hotspot
Sub-Component: runtime
Affected Version: 12

Priority: P2
Status: Closed
Resolution: Fixed

Submitted: 2018-11-09
Updated: 2020-06-15
Resolved: 2018-11-14

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11	JDK 12
11.0.9-oracleFixed	12 b20Fixed

Related Reports

Blocks :	JDK-8213587 - Speed up CDS dump time by using resizable hashtables
Relates :	JDK-8245264 - Test runtime/cds/appcds/SignedJar.java fails

Description

Test case:

i=0; while java \
    -XX:MaxRAM=8g -Xshare:dump \
    -Xlog:cds,cds+hashtables -Xlog:hashtables \
    -XX:SharedArchiveFile=LotsOfClasses.jsa \
    -XX:ExtraSharedClassListFile=LotsOfClasses.classlist \
    --add-modules ALL-SYSTEM; do
    i=$(expr $i + 1);
    echo ========================================$i;
done

(See LotsOfClasses.classlist in attachment of this bug)

A deadlock happens between the service thread, which tries to expand the string table, and the VM thread, which tries to iterate over all entries in the string table.

(Note --  VM_PopulateDumpSharedSpace::doit runs inside the VM thread because it wants work on various data structures without worrying about concurrent modifications.)

SERVICE THREAD

#1  os::PlatformEvent::park (this=0x7fcb3c814800)
#2  ParkCommon (timo=0, ev=0x7fcb3c814800)
#3  Monitor::ILock (this=0x7fcb3c016a00, Self=0x7fcb3c8132b0)
#4  Monitor::lock_without_safepoint_check (Self=0x7fcb3c8132b0, this=0x7fcb3c016a00)   
#5  Monitor::lock_without_safepoint_check (this=0x7fcb3c016a00)
#6  SafepointSynchronize::block (thread=0x7fcb3c8132b0)
#7  SafepointMechanism::block_if_requested_local_poll (thread=0x7fcb3c8132b0)
#8  SafepointMechanism::block_if_requested (thread=0x7fcb3c8132b0)
#9  ThreadStateTransition::transition_and_fence (to=_thread_blocked, from=_thread_in_vm, thread=0x7fcb3c8132b0)
#10 ThreadStateTransition::trans_and_fence (this=<synthetic pointer>, to=_thread_blocked, from=_thread_in_vm)
#11 ThreadBlockInVM::ThreadBlockInVM (thread=0x7fcb3c8132b0, this=<synthetic pointer>)
#12 StringTable::grow (this=0x7fcb3c161920, jt=jt@entry=0x7fcb3c8132b0)
#13 StringTable::concurrent_work (jt=0x7fcb3c8132b0, this=<optimized out>)
#14 StringTable::do_concurrent_work (jt=jt@entry=0x7fcb3c8132b0)
#15 ServiceThread::service_thread_entry (jt=0x7fcb3c8132b0, __the_thread__=0x7fcb3c8132b0)
#16 JavaThread::thread_main_inner (this=this@entry=0x7fcb3c8132b0)
#17 JavaThread::run (this=0x7fcb3c8132b0)
#18 Thread::call_run (this=0x7fcb3c8132b0)
#19 thread_native_entry (thread=0x7fcb3c8132b0)
#20 start_thread (arg=0x7fcb18b9d700)
#21 clone ()

VM THREAD

#0  ConcurrentHashTable<WeakHandle<(WeakHandleType)2>, StringTableConfig, (MemoryType)9>::lock_resize_lock (locker=0x7fcb3c38db80, this=0x7fcb3c161c30)
#1  ConcurrentHashTable<WeakHandle<(WeakHandleType)2>, StringTableConfig, (MemoryType)9>::do_scan<CopyToArchive> (scan_f=..., thread=0x7fcb3c38db80, 
    this=0x7fcb3c161c30)
#2  StringTable::copy_shared_string_table (writer=writer@entry=0x7fcb19379820)
#3  StringTable::write_to_archive ()
#4  HeapShared::copy_closed_archive_heap_objects (closed_archive=closed_archive@entry=0x7fcac0f314f0)
#5  HeapShared::archive_java_heap_objects (closed=0x7fcac0f314f0, open=open@entry=0x7fcac0f31550)
#6  VM_PopulateDumpSharedSpace::dump_java_heap_objects (this=this@entry=0x7fcb436c7b90)
#7  VM_PopulateDumpSharedSpace::doit (this=<optimized out>)
#8  VM_Operation::evaluate (this=this@entry=0x7fcb436c7b90)
#9  VMThread::evaluate_operation (this=this@entry=0x7fcb3c38db80, op=0x7fcb436c7b90)
#10 VMThread::loop (this=this@entry=0x7fcb3c38db80)
#11 VMThread::run (this=0x7fcb3c38db80)
#12 Thread::call_run (this=0x7fcb3c38db80)
#13 thread_native_entry (thread=0x7fcb3c38db80)
#14 start_thread (arg=0x7fcb1937a700)

Comments

Christoph, I just pinged for the backport review, and CC'ed you and Martin (https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-June/003240.html). Thanks for the notes!
05-06-2020
[~jiangli], unfortunately there's no review for your backport yet. Probably the RFR fell victim to the mailing list problems. I for myself didn't receive it in my inbox but could just see it in the mailing list archives. I suggest to issue a ping for your review, maybe also cc'ing people directly that could review this patch.
05-06-2020
JDK 11u fix request: https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2020-May/003107.html CDS dumping may run into a deadlock situation during copying the interned j.l.String objects in the VM thread, while the service thread is concurrently expanding the string table. I ran into the deadlock recently when creating a CDS archive with an input classlist of ~60K classes on JDK 11. The backport addresses the deadlock issue. The diffs in symbolTable.cpp are not included in the backport as they only apply to ConcurrentHashTable. In JDK 11 the SymbolTable is not yet a ConcurrentHashTable. All other diffs in the patch applied cleanly. Tested with tier1, jtreg/runtime/SharedArchiveFile
18-05-2020
The test for this is LotsOfClasses.java, which was pushed in a separate bug (JDK-8213587) http://hg.openjdk.java.net/jdk/jdk/rev/3009ca99de32 http://hg.openjdk.java.net/jdk/jdk/file/3009ca99de32/test/hotspot/jtreg/runtime/appcds/LotsOfClasses.java 8213587: Speed up CDS dump time by using resizable hashtables Reviewed-by: jiangli, coleenp, gziemski That test would deadlock without this fix. Such deadlocks no longer appear after this bug is fixed. Hence I am marking this bug as verified.
05-01-2019
During safepoint only threads used in the VM op are 'allowed' to use VM data structures. So it's no different from e.g.: assert_locked_or_safepoint(SystemDictionary_lock); Which means you are supposedly the only thread accessing it. I think many of our data structures lacks such guarantees. Note that the rehash of symboltable/stringtable already take advantage of this. We have some upcoming improvement to the hashtable, we should add a, "Better interaction with safepoint" as another improvement. Note that I totally agree with your concern. [~iklam] You could also dump the symboltable/stringtable outside a safepoint, which is ironically safer :). But my thought was that you want do it all in a safepoint as snap-shot.
12-11-2018
It's not at all clear to me how this lock-free at-safepoint scan is guaranteed safe unless the only mutators are JavaThreads - is that the case? If so what ensures that?
10-11-2018
The latest patch works great. The number of strings iterated now stay constant at 312595. Thanks a lot [~rehn] for the quick and elegant fix! I think the new function should work for jcmds as well. $ (i=0; while java \ -XX:MaxRAM=8g -Xshare:dump \ -XX:SharedArchiveFile=LotsOfClasses.jsa \ -XX:ExtraSharedClassListFile=LotsOfClasses.classlist \ -Xlog:cds+hashtables \ --add-modules ALL-SYSTEM; do i=$(expr $i + 1); echo ========================================$i; done) \| tee log.txt & $ tail -f log3.txt \| grep 'Number of entries' [24.503s][info][cds,hashtables] Number of entries : 312595 [26.572s][info][cds,hashtables] Number of entries : 6 [22.580s][info][cds,hashtables] Number of entries : 599003 [23.808s][info][cds,hashtables] Number of entries : 312595 [25.921s][info][cds,hashtables] Number of entries : 6 [23.500s][info][cds,hashtables] Number of entries : 599003 [24.773s][info][cds,hashtables] Number of entries : 312595 [26.759s][info][cds,hashtables] Number of entries : 6 [22.984s][info][cds,hashtables] Number of entries : 599003 [24.217s][info][cds,hashtables] Number of entries : 312595 [26.222s][info][cds,hashtables] Number of entries : 6 [22.403s][info][cds,hashtables] Number of entries : 599003 [23.594s][info][cds,hashtables] Number of entries : 312595 [25.535s][info][cds,hashtables] Number of entries : 6 [22.866s][info][cds,hashtables] Number of entries : 599003 [24.087s][info][cds,hashtables] Number of entries : 312595
09-11-2018
Yes, it is, because a concurrent insert would end up in the new table. I forgot to filter out those which should be visited in the new table. I have updated the webrev in-place on cr with the filtering. (late Friday here, I cross my fingers :) ) Thanks for the testing [~iklam]!
09-11-2018
The patch worked! The deadlock is gone (more than 100 runs). However, do_safepoint_scan() seems to pass some elements twice if the new table is not empty. Is this loop needed? table = get_new_table(); if (table == NULL) { return; } for (size_t bucket_it = 0; bucket_it < table->_size; bucket_it++) { if (!visit_nodes(table->get_bucket(bucket_it), scan_f)) { break; } }
09-11-2018
Please test this out, [~iklam] : http://cr.openjdk.java.net/~rehn/8213574/webrev/ Added [~kbarrett] and [~gziemski] as watchers.
09-11-2018
How to reproduce. (step 2 is optional but makes VM run faster when executing the script) 1. Get a recent jdk/jdk repo 2. Apply diff.txt (from attachment) 3. Build a product VM (the deadlock is much less reproducible in fastdebug or slowdebug) 4. Run the following script i=0; while java \ -XX:MaxRAM=8g -Xshare:dump \ -XX:SharedArchiveFile=LotsOfClasses.jsa \ -XX:ExtraSharedClassListFile=LotsOfClasses.classlist \ --add-modules ALL-SYSTEM; do i=$(expr $i + 1); echo ========================================$i; done The deadlock happens once in several runs on my machine. It gets stuck here: Allocated shared space: 3221225472 bytes at 0x0000000800000000 Loading classes to share ... Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 javax/management/remote/rmi/RMIConnectionImpl_Stub Preload Error: Failed to load javax/management/remote/rmi/RMIConnectionImpl_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 javax/management/remote/rmi/RMIConnectionImpl_Stub Preload Error: Failed to load javax/management/remote/rmi/RMIConnectionImpl_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 javax/management/remote/rmi/RMIServerImpl_Stub Preload Error: Failed to load javax/management/remote/rmi/RMIServerImpl_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 javax/management/remote/rmi/RMIServerImpl_Stub Preload Error: Failed to load javax/management/remote/rmi/RMIServerImpl_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 java/rmi/activation/ActivationGroup_Stub Preload Error: Failed to load java/rmi/activation/ActivationGroup_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 java/rmi/activation/ActivationGroup_Stub Preload Error: Failed to load java/rmi/activation/ActivationGroup_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 sun/rmi/server/Activation$ActivationSystemImpl_Stub Preload Error: Failed to load sun/rmi/server/Activation$ActivationSystemImpl_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 sun/rmi/server/Activation$ActivationSystemImpl_Stub Preload Error: Failed to load sun/rmi/server/Activation$ActivationSystemImpl_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 com/sun/jndi/rmi/registry/ReferenceWrapper_Stub Preload Error: Failed to load com/sun/jndi/rmi/registry/ReferenceWrapper_Stub Java HotSpot(TM) 64-Bit Server VM warning: Pre JDK 1.5 class not supported by CDS: 45.3 com/sun/jndi/rmi/registry/ReferenceWrapper_Stub Preload Error: Failed to load com/sun/jndi/rmi/registry/ReferenceWrapper_Stub Loading classes to share: done. Rewriting and linking classes ... Rewriting and linking classes: done Number of classes 30323 instance classes = 30231 obj array classes = 84 type array classes = 8 Updating ConstMethods ... done. Removing unshareable information ... done. Scanning all metaspace objects ... Allocating RW objects ... Allocating RO objects ... Relocating embedded pointers ... Relocating external roots ... Dumping symbol table ... Dumping objects to closed archive heap region ... ^C^C^C^Z
09-11-2018
The "grow" drops the resize_lock but not the state-lock, when it calls ThreadBlockInVM. The VM op code leads to lock_resize_lock #0 ConcurrentHashTable<WeakHandle<(WeakHandleType)2>, StringTableConfig, (MemoryType)9>::lock_resize_lock (locker=0x7fcb3c38db80, this=0x7fcb3c161c30) #1 ConcurrentHashTable<WeakHandle<(WeakHandleType)2>, StringTableConfig, (MemoryType)9>::do_scan<CopyToArchive> (scan_f=..., thread=0x7fcb3c38db80, this=0x7fcb3c161c30) #2 StringTable::copy_shared_string_table (writer=writer@entry=0x7fcb19379820) which requires both (despite the name): // Returns when both mutex and state are proper locked. void lock_resize_lock(Thread* locker); // Unlocks mutex and state. void unlock_resize_lock(Thread* locker); So the VMThread is blocked by the ServiceThread which is blocked by the safepoint and thus the VMThread.
09-11-2018
A classic safepoint deadlock: you can't go to a safepoint holding a lock that may be needed by a safepoint VM operation. In this case apparently caused by grow explicitly trying to honor the safepoint protocol by explicit use of ThreadBlockInVM, while it has the table "locked". But looking briefly at the StringTable code and the ConcurrentHashTable code, it looks like this kind of interaction is supposed to be accommodated by the "grow" unlocking the table while it blocks for the safepoint. There must be some specific aspect of the current interaction which isn't fitting in that model.
09-11-2018
Any chance you can pre-size the StringTable in the VM-op prolog so that you block there waiting for the grow to finish, prior to initiating the actual safepoint. But then it depends on what triggered the serviceThread to do the grow in the first place? There seems an inevitable race here if a third-party can request the grow via the ServiceThread.
09-11-2018