Specjbb2005 (and any other applicaitons run on large heaps and many threads) with 200 worker threads, 50G heap and large young gen size (20G) shows that there is around 50% of pause time outside of actual evacuation that is not accounted for in the "Other" times.
Investigation showed that the problem is initializing the G1ParScanThreadState instances for the worker threads, and then merging the per-thread collected values back later into the global variables. The latter is maybe four times more expensive than the former.
The reasons why this is a problem is that this is an O(#threads * size of collection set) operation, that is done in a single thread.
This CR should deal with improving the PSS information merge phase at the end of the GC. JDK-8150629 is about the initialization phase.