Investigate the scaling of the RedirtyLoggedCardsTask with number of worker threads. In the current implementation, the threads contend for access to the Log Buffers through a single BufferNode resulting in a bottleneck as we increase the threads. The cost per card is pretty low, thus the work distribution overhead dominates the task. Increasing the number of threads just increases contention overheads leading to regression in the task execution time.
Please see the images attached where TX_ refers to T[no of threads] run on a 20G heap with BigRamTester microbenchmark.