In some cases (possibly high load machines) starting up the worker threads to process a gang task takes a lot of time.
Eg. Stress BPM or CRM Fuse
[Parallel Time: 420.8 ms, GC Workers: 8]
[GC Worker Start (ms): 1178165.2 1178377.4 1178457.8 1178572.3 1178572.6 1178572.9 1178573.2 1178573.4
Min: 1178165.2, Avg: 1178483.1, Max: 1178573.4, Diff: 408.1]
[Ext Root Scanning (ms): 44.4 0.1 0.1 0.1 0.1 0.1 0.0 0.0
Min: 0.0, Avg: 5.6, Max: 44.4, Diff: 44.3, Sum: 44.9]
[SH_PS_Universe_oops_do: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[SH_PS_JNIHandles_oops_do: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[SH_PS_ObjectSynchronizer_oops_do: 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.3, Diff: 0.3, Sum: 0.3]
[SH_PS_FlatProfiler_oops_do: 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.1]
[SH_PS_Management_oops_do: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[SH_PS_SystemDictionary_oops_do: 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
[SH_PS_ClassLoaderDataGraph_oops_do: 8.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 1.1, Max: 8.9, Diff: 8.9, Sum: 8.9]
[SH_PS_jvmti_oops_do: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[SH_PS_CodeCache_oops_do: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[SH_PS_Threads_oops_do: 12.0 0.1 0.1 0.1 0.1 0.0 0.0 0.0
Min: 0.0, Avg: 1.5, Max: 12.0, Diff: 12.0, Sum: 12.3]
[SH_PS_StringTable_oops_do: 34.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 4.3, Max: 34.4, Diff: 34.4, Sum: 34.4]
[G1H_PS_filter_satb_buffers: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[G1H_PS_refProcessor_oops_do: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Update RS (ms): 284.5 26.7 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 38.9, Max: 284.5, Diff: 284.5, Sum: 311.2]
[Processed Buffers: 260 188 0 0 0 0 0 0
Min: 0, Avg: 56.0, Max: 260, Diff: 260, Sum: 448]
[Scan RS (ms): 0.1 4.9 0.1 0.1 0.1 0.0 0.0 0.0
Min: 0.0, Avg: 0.7, Max: 4.9, Diff: 4.9, Sum: 5.5]
[Code Root Scanning (ms): 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.1, Max: 0.7, Diff: 0.7, Sum: 0.7]
[Object Copy (ms): 77.9 161.7 113.3 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 44.1, Max: 161.7, Diff: 161.7, Sum: 353.0]
[Termination (ms): 3.7 7.4 14.2 8.6 6.2 5.9 5.7 0.0
Min: 0.0, Avg: 6.5, Max: 14.2, Diff: 14.2, Sum: 51.7]
[Termination Attempts: 1 456 485 1 1 1 1 1
Min: 1, Avg: 118.4, Max: 485, Diff: 484, Sum: 947]
[GC Worker Other (ms): 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0
Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.3]
[GC Worker Total (ms): 410.6 201.5 127.8 8.9 6.4 6.1 5.9 0.1
Min: 0.1, Avg: 95.9, Max: 410.6, Diff: 410.5, Sum: 767.3]
[GC Worker End (ms): 1178575.8 1178579.0 1178585.5 1178581.1 1178579.0 1178579.0 1178579.0 1178573.5
Min: 1178573.5, Avg: 1178579.0, Max: 1178585.5, Diff: 12.0]
[Code Root Fixup: 1.8 ms]
[Code Root Migration: 0.1 ms]
[Clear CT: 0.8 ms]
[Other: 283.0 ms]
[Choose CSet: 0.1 ms]
[Ref Proc: 278.8 ms]
[Ref Enq: 0.2 ms]
[Card Redirty: 0.6 ms]
[Humongous Reclaim: 1.4 ms]
[Free CSet: 0.8 ms]
[Young Free CSet: 0.8 ms]
[Nonyoung Free CSet: 0.0 ms]
[Eden: 547.0M(547.0M)->0.0B(549.0M) Survivors: 53.0M->51.0M Heap: 2433.2M(3072.0M)->1898.7M(3072.0M)]
[Times: user=1.10 sys=0.01, real=0.71 secs]
As shown in above log snippet, this delay of the worker threads causes many of the threads contribute absolutely nothing to the pause time.
Also, in this case (stress bpm), parallel reference processing is enabled - without it, reference processing does not take any time at all while it uses ~300ms here. Apparently because we start up and shut down worker threads eight times.
Investigate and fix this.