Recent scalability testing with G1 showed that some phases do not scale well with number of threads.
Using Bigramtester@20gb showed that on a large machine, with ~30 threads the Merge Log Buffers phases takes about 1% of gc pause time (~3ms) on average; with >100 threads it takes around 13% already (~14ms) - note that this is the same application with roughly the same amount of cards generated.
This seems to be related to dequeuing buffers from the DCQS. Some testing showed that quadrupling the buffer sizes decreases this time to ~6%.