Merging heap roots (remembered sets and log buffers) into the card table for later scanning to a large degree constitutes random accesses to the card table.
Putting a small prefetch queue between calculating the card table address and inspecting the given card table value decreases merge remembered set time by 20-30% and merge log buffers time by 40-50% (on x64. AArch64 shows similar if not better improvements).
Applications not having a significant amount of either remembered sets or log buffers do not show significant difference.