The JavaThread::oops_do code path currently contains at least three ways to acquire more or less global mutexes. This can led to lock contention during parallel stack walking and thus long root scan times.
There are three kinds of locks known to be taken in the code path:
DerivedPointerTableGC_lock
- guards DerivedPointerTable::add, which is called for every c2-compiled stack frame which contains derived pointers. It is currently unknown how common derived pointers are in real workloads.
The other two are not (any longer) a problem:
OopMapCache::_mut
- guards all retrieval of InterpreterOopMap instances, which are used to scan a specific (Method*, bci) executing in the interpreter. The per-klass OopMapCaches are lazily allocated as per below. The mutex protects hash lookup, generation of new oop maps if cache miss, hash table insertion after generation and eviction of less recently used oop maps. See OopMapCache::lookup. This was dealt with by JDK-8186042, so is no longer a problem.
OopMapCacheAlloc_lock
- guards the lazy initialization of InstanceKlass::_oop_map_cache, only taken if a thread observes _oop_map_cache == NULL so unless new classes are added all the time it should disappear after warmup. Because of the use of DCLP, this lock is rarely hit, so should not be a performance bottleneck.