When pushing items into the object copy task queue, ParallelGC has a "fast path" that checks for already forwarded objects. It does not push them into the task queue, but instead fixes them up and does any remembered set processing inline. G1 takes a different approach, instead getting the referenced object for the item and prefetching from that object, then pushes the item on the queue.
Recent measurements show that for ParallelGC, using the approach taken by G1 provides better performance. Either approach seems at least as good as doing nothing at all. Comparing the already forwarded check against prefetch and push without any check, the latter varies from neutral to significantly better than the former, depending on the hardware configuration.
The reason for this difference seems to be (1) the cost of the check is relatively high because it is likely to take a cache miss, and (2) already forwarded objects are uncommon, so the fast path isn't often taken, failing to recover the cost of the check.
Focusing on specjbb2015:
* average fast path rate < 4%
* critical-jOPS improvement for prefetch vs check for forwarded
(1) non-NUMA x64 - no significant difference
(2) x64 2 sockets x 8 cores - 5% improvement
(3) x64 2 sockets x 8 cores (hyperthreading off) - 9.5% improvement
(4) aarch64 - 2.25% improvement
(3) might not be a production configuration, but is interesting because hyperthreading should mitigate cache misses.