Bug ID: JDK-8019342 G1: High "Other" time most likely due to card redirtying

JDK-8019342 : G1: High "Other" time most likely due to card redirtying

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: hs24,hs25

Priority: P4
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2013-06-27
Updated: 2015-01-21
Resolved: 2014-04-17

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 8	JDK 9
8u40Fixed	9 b12Fixed

Related Reports

Relates :	JDK-8052170 - G1 asserts at collection exit with -XX:-G1DeferredRSUpdate
Relates :	JDK-8051343 - If G1DeferredRSUpdate is disabled, card redirtying time is not updated, causing a failed assertion
Relates :	JDK-8040002 - Clean up code and code duplication in re-diryting cards for verification
Relates :	JDK-8035398 - Add card redirty time in "Other" time in G1
Relates :	JDK-8035654 - Add times for evacuation failure handling in "Other" time
Relates :	JDK-8040006 - Merge "Other" time parallel phases into a single

Description

We have seen a few instances in recent G1 logs where we have fair amount of unattributed "other" time. An example:

170.552: [GC pause (G1 Evacuation Pause) (young) 170.552: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 15552, predicted base time: 18.87 ms, remaining time: 581.13 ms, target pause time: 600.00 ms]
 170.552: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 25 regions, survivors: 128 regions, predicted young region time: 730.95 ms]
 170.552: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 25 regions, survivors: 128 regions, old: 0 regions, predicted pause time: 749.82 ms, target pause time: 600.00 ms]
, 0.2650650 secs]
   [Parallel Time: 251.5 ms, GC Workers: 23]
      [GC Worker Start (ms): Min: 170552.6, Avg: 170553.1, Max: 170553.6, Diff: 1.0]
      [Ext Root Scanning (ms): Min: 0.1, Avg: 0.8, Max: 1.7, Diff: 1.5, Sum: 18.8]
      [Update RS (ms): Min: 0.9, Avg: 1.6, Max: 1.9, Diff: 1.0, Sum: 36.2]
         [Processed Buffers: Min: 2, Avg: 4.4, Max: 8, Diff: 6, Sum: 101]
      [Scan RS (ms): Min: 32.6, Avg: 32.9, Max: 33.1, Diff: 0.4, Sum: 755.8]
      [Object Copy (ms): Min: 213.8, Avg: 214.2, Max: 215.3, Diff: 1.5, Sum: 4925.9]
      [Termination (ms): Min: 0.0, Avg: 1.1, Max: 1.3, Diff: 1.3, Sum: 26.2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 1.3]
      [GC Worker Total (ms): Min: 250.1, Avg: 250.6, Max: 251.2, Diff: 1.0, Sum: 5764.2]
      [GC Worker End (ms): Min: 170803.7, Avg: 170803.7, Max: 170803.8, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 13.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.7 ms]
      [Ref Enq: 0.0 ms]
      [Free CSet: 2.5 ms]
   [Eden: 100.0M(100.0M)->0.0B(2332.0M) Survivors: 512.0M->48.0M Heap: 1617.6M(12.0G)->1581.4M(12.0G)]
 [Times: user=5.77 sys=0.01, real=0.27 secs] 

In the log snippet above the "other" time was 13ms. The attributable part of that came to 3.2ms leaving 10ms unattributed. The entire pause was ~260ms so 10ms is around 4% of the entire pause.

Another (more extreme) example (from the ECE application):

2013-05-17T19:27:14.480+0000: 10014.025: [GC pause (young)
Desired survivor size 117440512 bytes, new threshold 1 (max 1)
- age   1:   22673072 bytes,   22673072 total
 10014.025: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 7189, predicted base time: 11.81 ms, remaining time: 48.19 ms, target pause time: 60.00 ms]
 10014.025: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 435 regions, survivors: 7 regions, predicted young region time: 17.55 ms]
 10014.025: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 435 regions, survivors: 7 regions, old: 0 regions, predicted pause time: 29.36 ms, target pause time: 60.00 ms]
, 0.2567180 secs]
   [Parallel Time: 26.5 ms, GC Workers: 7]
      [GC Worker Start (ms): Min: 10014025.9, Avg: 10014026.0, Max: 10014026.0, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 4.2, Avg: 4.9, Max: 6.2, Diff: 2.0, Sum: 34.0]
      [Update RS (ms): Min: 3.5, Avg: 4.7, Max: 5.5, Diff: 2.0, Sum: 33.2]
         [Processed Buffers: Min: 3, Avg: 10.6, Max: 17, Diff: 14, Sum: 74]
      [Scan RS (ms): Min: 6.5, Avg: 7.1, Max: 7.3, Diff: 0.8, Sum: 49.4]
      [Object Copy (ms): Min: 9.4, Avg: 9.7, Max: 9.9, Diff: 0.5, Sum: 67.6]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.4]
      [GC Worker Total (ms): Min: 26.3, Avg: 26.4, Max: 26.4, Diff: 0.1, Sum: 184.7]
      [GC Worker End (ms): Min: 10014052.3, Avg: 10014052.4, Max: 10014052.4, Diff: 0.1]
   [Code Root Fixup: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 229.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.1 ms]
      [Ref Enq: 0.0 ms]
      [Free CSet: 2.0 ms]
   [Eden: 1740.0M(1740.0M)->0.0B(1716.0M) Survivors: 28.0M->28.0M Heap: 12.5G(14.0G)->10.8G(14.0G)]
 [Times: user=0.19 sys=0.00, real=0.26 secs]

In this the "other" time was ~230ms (or almost 90% of the pause).

By mapping out the phases of the pause executed by the VMThread, the likely culprit seems to be the card redirtying that takes place near the end of G1CollectedHeap::evacuate_collection_set().

Comments

Note that Card redirtying still is significant sometimes; e.g. BigRamTester from http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2012-July/004748.html gives the following output (from an internal build) - look at the time the "Redirty cards" phase takes. [GC pause (G1 Evacuation Pause) (young), 1.1682209 secs] [Parallel Time: 863.7 ms, GC Workers: 23] [GC Worker Start (ms): Min: 738211.4, Avg: 738211.6, Max: 738211.8, Diff: 0.5] [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 4.1] [Update RS (ms): Min: 12.0, Avg: 12.3, Max: 12.8, Diff: 0.8, Sum: 283.2] [Processed Buffers: Min: 9, Avg: 11.1, Max: 13, Diff: 4, Sum: 256] [Scan RS (ms): Min: 377.1, Avg: 377.5, Max: 377.9, Diff: 0.8, Sum: 8683.6] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] [Object Copy (ms): Min: 471.6, Avg: 471.8, Max: 471.9, Diff: 0.3, Sum: 10850.4] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 1.4] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum: 2.3] [GC Worker Total (ms): Min: 861.6, Avg: 862.0, Max: 862.3, Diff: 0.6, Sum: 19825.2] [GC Worker End (ms): Min: 739073.5, Avg: 739073.6, Max: 739073.9, Diff: 0.4] [Code Root Fixup: 0.0 ms] [Code Root Migration: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 7.2 ms] [Other: 297.3 ms] [Choose CSet: 0.0 ms] [Ref Proc: 1.8 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 292.7 ms] [Free CSet: 2.2 ms, GC Workers: 23] [Free CSet Serial: 0.1 ms] [Young Free CSet: Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.4, Sum: 3.2] [Non-Young Free CSet: Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Eden: 2688.0M(2688.0M)->0.0B(2688.0M) Survivors: 384.0M->384.0M Heap: 56.6G(60.0G)->55.7G(60.0G)] [Times: user=20.13 sys=0.01, real=1.16 secs] The obvious workaround for this is to parallelize this phase.
10-03-2014
Most likely reason for the high "Other" time reporting in the tests I have is evacuation failure handling. All the occurrences of this phenomenon are when there is an allocation failure, i.e. to-space exhausted in G1. The time spent on card redirtying is still too high, particularly in these cases, and should be reduced if possible. I.e. in some benchmarks we have 300k+ cards to dirty.
19-12-2013
Some initial testing showed that this is the case, in particular removing the forwarding pointers takes almost all of the evacuation failure handling attributed to the "Other" time
18-11-2013
While the following measurement is very likely from a different application, card redirtying does not seem to be the issue. A measurement probe put within the if (G1DeferredRSUpdate) { redirty } sometimes gives very low times with a higher total Other time, e.g.: [Other: 37.0 ms] [Choose CSet: 0.0 ms] [Ref Proc: 2.1 ms] [Ref Enq: 0.0 ms] [Card Redirty: 0.2 ms] [Free CSet: 0.5 ms] This unaccounted time is spent elsewhere.
15-11-2013
Further, skipping already dirtied cards does not help as there are typically almost no already dirtied cards (~0.1%) throughout the application. Also, there are almost no consecutive runs of cards, so the suggested fix to dirty cards on a range basis will probably not help. (All that assumes correct statistics gathering code of course :)
15-11-2013
One further comment about the idea to not dirty already dirtied cards: maybe that card could be somewhat marked in hte buffer so that it is not reprocessed again. Although there is the check in the refinement method that filters out already clean (i.e. non-dirty cards), but still this seems a lot of effort until the code gets to that check. Need some statistics about the number of duplicates though.
09-11-2013
As possible cleanup, the _closure member of the DirtyCardQueueSet should probably be removed and the closure passed manually every time. The code pattern for using it mostly looks like: SomeClosure cl; dcqs.set_closure(&cl); dcqs.apply_closure(); The code simply becomes more clear if it does not rely on some global closure set somewhere else. Could be done in a separate cleanup step too though.
09-11-2013
Another possible problem with the existing code is that the do_card_ptr() is called for every card - the call is probably much slower than the actual work performed here. Maybe it is useful to allow per-buffer closures that in the default case forwards to the single-card method, but in cases like this can be implemented directly.
09-11-2013
Please add an additional measurement probe for this phase: the log should contain explicit measuremetns of all major phases in the GC, and currently we can only deduct the time this phase takes by subtracting the timings for all existing sub-phases of the "Other" phase. This is somewhat inconvenient.
07-11-2013
Fairly simple solutions exist for these issues. The code could be parallelized using an AbstractGangTask. The work method would use one of the: bool DirtyCardQueueSet::apply_closure_to_completed_buffer(int worker_i, int stop_at, bool during_pause) { variants or a parallel version of DirtyCardQueueSet::apply_closure_to_all_completed_buffers() (using get_completed_buffer). The effect of the second issue could be mitigated if we try to build up an interval of contiguous cards before dirtying For example: class RedirtyLoggedCardTableEntryFastClosure : public CardTableEntryClosure { jbyte* _lo; jbyte* _hi; // Dirties the cards in the exclusive range [lo, hi) void dirty_range(jbyte* lo, jbyte* hi) { assert(lo < hi, "sanity"); Copy::fill_to_bytes(lo, (hi - lo), CardTableModRefBS::dirty_card_val()); } public: RedirtyLoggedCardTableEntryFastClosure() : _lo(NULL), _hi(NULL) {} bool do_card_ptr(jbyte* card_ptr, int worker_i) { if (_lo == NULL) { assert(_hi == NULL, "none or both"); _lo = card_ptr; _hi = card_ptr + 1; } else if (_lo <= card_ptr && card_ptr < _hi) { // Card is already in the range to be dirtied. Do nothing } else if (card_ptr == _lo - 1) { // Adjacent to lower bound of range => extend range _lo -= 1; } else if (card_ptr == _hi) { // Adjacent to upper bound of range => extend range _hi += 1; } else { // Card pointer is disjoint from the range. Dirty currrent range dirty_range(_lo, _hi); // Start new range _lo = card_ptr; _hi = card_ptr + 1; } assert(_lo <= card_ptr && card_ptr < _hi, "invariant"); return true; } }; The above tries to build up a contiguous interval of cards. As we process cards, if they are immediately adjacent to the current interval we just extend the interval to cover the card. We only dirty the interval when we need to start a new interval, i.e. when we process a card that is neither adjacent to nor contained with the current interval. This should reduce the amount of store traffic. The solve the third issue we can add a check to the code above to see whether the card is already dirty and skip it if it already is, for example: bool do_card_ptr(jbyte* card_ptr, int worker_i) { if (*card_ptr == CardTableModRefBS::dirty_card_val()) { // Skip cards that are already dirty return true; } if (_lo == NULL) { assert(_hi == NULL, "none or both"); _lo = card_ptr; _hi = card_ptr + 1; } else if (_lo <= card_ptr && card_ptr < _hi) { // Card is already in the range to be dirtied. Do nothing } else if (card_ptr == _lo - 1) { // Adjacent to lower bound of range => extend range _lo -= 1; } else if (card_ptr == _hi) { // Adjacent to upper bound of range => extend range _hi += 1; } else { // Card pointer is disjoint from the range. Dirty currrent range dirty_range(_lo, _hi); // Start new range _lo = card_ptr; _hi = card_ptr + 1; } assert(_lo <= card_ptr && card_ptr < _hi, "invariant"); return true; }
28-06-2013
The card redirtying code: if (G1DeferredRSUpdate) { RedirtyLoggedCardTableEntryFastClosure redirty; dirty_card_queue_set().set_closure(&redirty); dirty_card_queue_set().apply_closure_to_all_completed_buffers(); DirtyCardQueueSet& dcq = JavaThread::dirty_card_queue_set(); dcq.merge_bufferlists(&dirty_card_queue_set()); assert(dirty_card_queue_set().completed_buffers_num() == 0, "All should be consumed"); } is part of the deferred RS updates mechanism. While evacuating objects during RSet scanning (i.e. using an instance of: typedef G1ParCopyClosure<false, G1BarrierEvac, false> G1ParScanHeapEvacFailureClosure; We record the card spanned by the reference (p) that points to the object that has just been copied into update buffers associated with special dirty card queues (and a dirty card queue set). These update buffers are then passed to the refinement threads (using the merge_bufferlists call in the code above) so that the refinement threads can update the remembered sets of survivor and promotion regions. Prior to passing these update buffers, the cards that are storeed within them are redirtied. If the cards are not redirtied, the refinement code will see that the cards are clean and assume that they have been refined. This can results in missing remembered entries. The code that redirties these cards is currently serial and applies the following closure to each card: class RedirtyLoggedCardTableEntryFastClosure : public CardTableEntryClosure { public: bool do_card_ptr(jbyte* card_ptr, int worker_i) { *card_ptr = CardTableModRefBS::dirty_card_val(); return true; } }; There are a couple of issues here that could result in the bottle neck that is evidenced by high "other" time. The first is that the code is serial. The second is that we dirty each card individually. Since the cards can come is a fairly random order I would bet that each store is experiencing a cache miss. Thirdly we don't check to see whether a card is already dirty before dirtying it.
28-06-2013