Recent scalability testing with G1 showed that some phases do not scale well with number of threads.
Using Bigramtester@20gb showed that on a large machine, with ~30 threads the Redirty Logged Cards phase takes about 1% of gc pause time (~3ms) on average; with >100 threads it takes around 5% already (~6ms) - note that this is the same application with roughly the same amount of cards generated.
This seems to be related to iterating the buffers from the RDCQS. Some testing showed that quadrupling the buffer sizes decreases this time to ~1.5% of pause.