Bug ID: JDK-8297796 GenShen: instrument the remembered set scan

Type: Task
Component: hotspot
Sub-Component: gc
Affected Version: repo-shenandoah

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2022-11-29
Updated: 2023-01-04
Resolved: 2022-12-28

Instrumentation of the remembered set scan loop used in GenShen to inform related performance work.

A dependent PR for the change associated with this ticket is now available at: https://github.com/ysramakrishna/shenandoah/pull/1. ( This PR depends on https://github.com/openjdk/shenandoah/pull/184 )
16-12-2022
Opened JDK-8298597 for `HdrSeq::merge()` needed for the changes stemming from preliminary review feedback.
12-12-2022
A pull request was submitted for review. URL: https://git.openjdk.org/shenandoah/pull/176 Date: 2022-12-01 19:55:45 +0000
08-12-2022
Preliminary/draft pull request can be found at: https://github.com/openjdk/shenandoah/pull/176/ Feedback is welcome. More work is in progress to get this to publishable/reviewable state.
01-12-2022
Ran some numbers on a cloud machine to measure the impact, on the concurrent remembered set scanning phase durations with a specific setting of Extremem, of the instrumentation disabled vs enabled in its current form via a runtime switch: Reference: 8.7 ms Specimen/Default (Disabled): 9.1 ms (+4.5% wrt reference) Specimen/Enabled: 13.9 ms (+52.7% wrt disabled) I'll run some numbers with SpecJBB as well to see how those look. Based on the %ge changes with the Extremem workload above, I'm inclined to hold off on checking this in in its current state until we either get more data to make us more comfortable with the impact with the instrumentation disabled, or have used this data to improve the shape of the instrumented loop itself which might reduce its impact when disabled. I'll update.
01-12-2022
Format of stats produced and how to interpret them: (sample) --------------------------------------------------- [1211.040s][info][gc,remset ] GC(9447) Worker 0 Card Stats Histo: [1211.040s][info][gc,remset ] GC(9447) dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.040s][info][gc,remset ] GC(9447) clean_run: [ 0.00 0.00 0.00 99.61 100.00 ] [1211.040s][info][gc,remset ] GC(9447) dirty_cards: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.040s][info][gc,remset ] GC(9447) clean_cards: [ 0.00 99.61 99.61 99.61 100.00 ] [1211.040s][info][gc,remset ] GC(9447) max_dirty_run: [ 0.00 0.00 0.00 0.00 100.00 ] [1211.040s][info][gc,remset ] GC(9447) max_clean_run: [ 0.00 99.61 99.61 99.61 100.00 ] [1211.040s][info][gc,remset ] GC(9447) dirty_objs: [ 0.00 0.00 0.00 0.00 1392.00 ] [1211.040s][info][gc,remset ] GC(9447) clean_objs: [ 0.00 0.00 0.00 0.00 64.00 ] [1211.040s][info][gc,remset ] GC(9447) dirty_scans: [ 0.00 0.00 0.00 0.00 1297.00 ] [1211.040s][info][gc,remset ] GC(9447) clean_scans: [ 0.00 0.00 0.00 0.00 14.00 ] [1211.040s][info][gc,remset ] GC(9447) alternations: [ 0.00 0.00 0.00 0.00 32.00 ] ... <similarly for each worker> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are: dirty_run: the length of an uninterrupted run of dirty cards, interpreted as a percentage of a chunk of work assignment (cluster) processed by a thread clean_run: as above, but the length of an uninterrupted run of clean cards dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk max_dirty_run & max_clean_run: Similarly for the maximum of each. dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned dirty_scans, clean_scans: numbers of objects scanned by the closure alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk
29-11-2022
The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is no performance impact (measured with extremem & specjbb). With the flag enabled there is a small performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this.
29-11-2022

Relates :	JDK-8298902 - AbsSeq::merge() should not dodge decaying stats
Relates :	JDK-8298597 - HdrSeq: support for a merge() method
Relates :	JDK-8298138 - Shenandoah: HdrSeq asserts "sub-bucket index (512) overflow for value ( 1.00)"