Bug ID: JDK-8309639 GenShen: Regression in LRU cache benchmark

Type: Bug
Component: hotspot
Sub-Component: gc

Priority: P4
Status: Resolved
Resolution: Fixed

Submitted: 2023-06-08
Updated: 2023-08-17
Resolved: 2023-08-17

From Christine Flood, see https://github.com/openjdk/jdk/pull/14185#issuecomment-1579278537 :

I wrote an LRU program back in 2017 which allocates trees and stores them in an array in a round robin fashion, freeing the last allocated. At the time this was written it's purpose was to show how generational GCs can hit the wall and start performing very badly. I ran this on a clean openjdk build, a genshen build in generational mode and a genshen build in non-generational mode. These results are repeatable for me.

I would like to understand where the degradation is coming from before moving forward with this patch since it appears to penalize those who wish to just run traditional Shenandoah.

Clean
cflood@fedora java_programs]$ ~/genshen/cleanjdk/build/linux-x86_64-server-release/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC LRU 1000 1000
Took 341892ms to allocate 1000000 trees in a cache of 1000

Genshen generational (we expect this to be bad)
[cflood@fedora java_programs]$ ~/genshen/jdk/build/linux-x86_64-server-release/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational LRU 1000 1000
Took 442012ms to allocate 1000000 trees in a cache of 1000

Genshen non-generational (shows what I feel is a significant degradation from the clean build)
[cflood@fedora java_programs]$ ~/genshen/jdk/build/linux-x86_64-server-release/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC LRU 1000 1000
Took 395679ms to allocate 1000000 trees in a cache of 1000

I think that generational Shenandoah can be a big win for some applications, but I want to fully understand the cost for all applications.

I can't attach a .java file so here it is inline in the post.

class TreeNode {
  public TreeNode left, right;
  public int val;
}

public class LRU {
static int cache_size;
static int reps;
static int tree_height=16;

private static TreeNode[] trees;

private static int getIndex(int i) {return i % cache_size;}
private static TreeNode makeTree(int h) {
    if (h == 0) { return null;}
    else {
    TreeNode res = new TreeNode();
    res.left = makeTree(h - 1);
    res.right = makeTree(h - 1);
    res.val = h;
    return res;
}
}

public static void main(String[] args) {
    if (args.length != 2) {
        System.err.println("LRU requires args: cache_size reps");
        return;
    }
    cache_size = Integer.parseInt(args[0]);
    reps = Integer.parseInt(args[1]) * cache_size;
    trees = new TreeNode[cache_size];

    long start = System.currentTimeMillis();
    for (int i = 0; i < reps; i++)
        trees[getIndex(i)] = makeTree(tree_height);
    long end = System.currentTimeMillis();
    long ms = end - start;

    System.out.println("Took " + ms + "ms to allocate " + reps + " trees in a cache of " + cache_size);
}
}

A pull request was submitted for review. URL: https://git.openjdk.org/shenandoah/pull/307 Date: 2023-08-16 23:31:33 +0000

16-08-2023

We have confirmed that the Generational Shenandoah version of Single-Generation Shenandoah is performing 623 collections, vs JDK 20 version of Single-Generation Shenandoah is performing only 151 collections. We have further confirmed that all of the Generational Shenandoah collections are abbreviated, and that every one of the collections is labeled as a learning cycle. We believe we understand the origin of this regression: 1. We discovered during testing of Generational Shenandoah that the presence of abbreviated cycles in the recent GC history would sometimes confuse the triggering heuristic. The triggering heuristic uses recent GC times as a predictor for how much time the next GC will take. This results in late triggers and degenerated cycles if the triggering heuristic believes that all GC cycles will be as fast as a recently observed abbreviated cycle. 2. So in the Generational Shenandoah version of this code, we disqualify abbreviated GC cycle times from the calculation of "average GC" time. 3. Apparently, for the same reason, we do not count an abbreviated cycle as a learning cycle, because an abbreviated cycle contributes nothing to our knowledge of how long a normal GC cycle takes. Here's a possible way to address this: At the end of concurrent marking, if we choose to abbreviate this cycle, add the following code: ``` if we are in learning mode and we are abbreviating this cycle: add the approximation that normal-gc-time = concurrent-mark-time * 2.5, and claim credit for one learning cycle ``` Note: For a run of specjbb, which performed 1068 GC cycles, the total time spent in concurrent marking was 1,426 seconds, concurrent evacuation 387 seconds, concurrent update refs 1,161 seconds. The ratio of concurrent marking to total GC time is approximately 2.1. Choosing to multiply by 2.5 allows triggers to be conservatively early.

09-06-2023