Bug ID: JDK-8060244 Speed up card table scanning

JDK-8060244 : Speed up card table scanning

Type: Enhancement
Component: hotspot
Sub-Component: gc

Priority: P4
Status: Open
Resolution: Unresolved
OS: generic
CPU: generic

Submitted: 2014-10-13
Updated: 2019-02-11

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :

JDK-7068625 - Testing 8 bytes of card table entries at a time speeds up card-scanning

Description

Scanning card tables for dirty or non-clean cards is a time-consuming operation.  Presently there are 3 core routines used for such scans, all in memory/cardTableModRefBS.cpp:

CardTableModRefBS::dirty_card_iterate()
CardTableModRefBs::dirty_card_range_after_reset()
CardTableModRefBS::non_clean_card_iterate_serial()

All three of these routines presently operate by iterating over the individual bytes in a range, comparing each byte to a value of interest (dirty_card for the first two, clean_card for the third).

For dirty_card scans we expect long runs of non-dirty cards.  Similarly, we expect long runs of clean cards.  (Running various benchmarks with instrumented versions of these routines confirms this expectation.)  The loop overhead for processing each byte individually is a substantial fraction of the cost of scanning such runs.

One method for improving the performance would be to reduce loop overhead via loop unrolling.  However, an even better approach is to perform a test on a whole word at once, rather than testing each byte separately.

For the case of scanning for non-clean cards this is relatively straight forward, as a word-sized constant containing the clean_card value in all bytes can be constructed, and each word compared to that constant, falling back to a more precise method once a non-matching word is found.

For scanning for dirty_card bytes that approach doesn't work.  However, there is a fast method for determining whether any byte in a word is zero, and the dirty_card value is zero.  (This fast method generalizes to other values, at additional cost.)  This method is described here:

  http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord

Specifically (using hotspot uintx type for words):

bool haszero(uintx word) {
  const uintx mask1 = (~(uintx)0)/0xFF;        // each byte == 0x01
  const uintx mask1 = (~(uintx)0)/0xFF * 0x80; // each byte == 0x80
  return (((word - mask1) & ~word) & mask2) != 0;
}

This compares favorably to an unrolled byte at a time scan on a 32bit platform, and is probably more than a factor of two improvement on a 64bit platform.  The improvement is even more substantial compared to the present (probably) not-unrolled byte at a time scan.

There is some cost in handling non-word-aligned prefix and suffix bytes, but that cost is small and fixed for a given run of words, so that the improved processing time for the run will dominate for even relatively short runs.

Comments

https://mischasan.wordpress.com/2011/06/22/what-the-is-sse2-good-for-char-search-in-long-strings/ Using SSE2 to get the position of a specific byte value in a 16 byte sequence. There are likely to be similar architecture-specific approaches possible on other platforms.
31-03-2015
Yet another: CardTableRS::verify_space()
15-12-2014
Found another - CardTableModRefBS::process_chunk_boundaries(), which for some reason is located in parCardTableModRefBS.cpp. This one is quite a bit more complicated, using a more complex predicate for the bytes. It may not be a candidate for the kind of optimized scanner I had in mind for addressing this RFE. However, it might be possible to transform it into a non-clean search + other tests.
13-12-2014
There are at least five independent and different implementations of card table scanning. (1) scavenge_contents_parallel Forward iteration, byte at a time, searching for !clean_card. Then similar search for clean_card to compute interesting range. Used by ParallelOld. (2) CardTableModRefBS::dirty_card_iterate() and CardTableModRefBs::dirty_card_range_after_reset() Forward iteration, byte at a time, searching for dirty_card. Then similar search for !dirty_card to compute interesting range. Used by CMS. (3) ClearNoncleanCardWrapper::do_MemRegion() Reverse iteration, searching for !clean_card. Performs word at a time checks when possible. Then reverse byte at a time search for cleanable card (either concurrent or single threaded) to compute interesting range. Used by non-G1. The word at a time scan here is from 7068625: "Testing 8 bytes of card table entries at a time speeds up card-scanning". (4) CardTableModRefBS::non_clean_card_iterate_serial() Reverse iteration, byte at a time, searching for !clean_card. Then similar search for clean_card to compute interesting range. All calls are via CardTableModRefBS::mod_card_iterate(). No longer used after fixes for 8062206: "Remove unusable G1RSLogCheckCardTable command line argument" and 8061748: "Remove check_ct_logs_at_safepoint()". (5) Found some more iteration over card tables. G1SATBCardTableLoggingModRefBS::invalidate() has a relatively complex card table iteration.
02-12-2014