With JDK-8213108 consecutive dirty cards are scanned as a single block. This may result in pushing a lot of references to scan into the task queue (64kb chunk size, ie. 16k references max).
During development of JDK-6672778 we found that having less references on the task queue is advantagous due to memory prefetching/caching reasons.
Investigate the impact of splitting blocks on a smaller granularity than chunk size.
BigRAMTester and SPECjbb2015 have a few GCs (at the start iirc) where the block/chunk size is very small, i.e. G1 can find many large blocks.
Very short attempts on e.g. limiting block size to 4 cards (random number) showed no conclusive initial advantage or disadvantage in pause times.