During a young collection, ParallelGC partitions the old generation into 64kB stripes when scanning it for references into the young generation. These stripes are assigned to worker threads that do the scanning in parallel as work units.
Before this change, Parallel GC always scanned these stripes completely even if only a small part had been known to contain interesting references. Additionally, every worker thread processed the objects that start in that stripe by itself, including parts of objects that extend into other stripes. This behavior limited parallelism when processing large objects. A single large object, potentially containing thousands of references, had been scanned by a single thread only and in full. This would cause bad scaling due to memory sharing and cache misses in the subsequent long, work stealing phase.
With this change, Parallel GC workers limit work to their stripe and only process interesting parts of large object arrays. This reduces the work done by a single thread for a stripe, improves parallelism, and reduces the amount of work stealing. Parallel GC pauses are now on par with G1 in presence of large object arrays, reducing pause times by 4-5 times in some cases.