Scanning card tables for dirty or non-clean cards is a time-consuming operation. Presently there are 3 core routines used for such scans, all in memory/cardTableModRefBS.cpp:
CardTableModRefBS::dirty_card_iterate()
CardTableModRefBs::dirty_card_range_after_reset()
CardTableModRefBS::non_clean_card_iterate_serial()
All three of these routines presently operate by iterating over the individual bytes in a range, comparing each byte to a value of interest (dirty_card for the first two, clean_card for the third).
For dirty_card scans we expect long runs of non-dirty cards. Similarly, we expect long runs of clean cards. (Running various benchmarks with instrumented versions of these routines confirms this expectation.) The loop overhead for processing each byte individually is a substantial fraction of the cost of scanning such runs.
One method for improving the performance would be to reduce loop overhead via loop unrolling. However, an even better approach is to perform a test on a whole word at once, rather than testing each byte separately.
For the case of scanning for non-clean cards this is relatively straight forward, as a word-sized constant containing the clean_card value in all bytes can be constructed, and each word compared to that constant, falling back to a more precise method once a non-matching word is found.
For scanning for dirty_card bytes that approach doesn't work. However, there is a fast method for determining whether any byte in a word is zero, and the dirty_card value is zero. (This fast method generalizes to other values, at additional cost.) This method is described here:
http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
Specifically (using hotspot uintx type for words):
bool haszero(uintx word) {
const uintx mask1 = (~(uintx)0)/0xFF; // each byte == 0x01
const uintx mask1 = (~(uintx)0)/0xFF * 0x80; // each byte == 0x80
return (((word - mask1) & ~word) & mask2) != 0;
}
This compares favorably to an unrolled byte at a time scan on a 32bit platform, and is probably more than a factor of two improvement on a 64bit platform. The improvement is even more substantial compared to the present (probably) not-unrolled byte at a time scan.
There is some cost in handling non-word-aligned prefix and suffix bytes, but that cost is small and fixed for a given run of words, so that the improved processing time for the run will dominate for even relatively short runs.