Bug ID: JDK-8320165 Parallel: Full GC code is very slow due to quadratic calc_new

JDK-8320165 : Parallel: Full GC code is very slow due to quadratic calc_new_address

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 11,17,21,22

Priority: P4
Status: Resolved
Resolution: Duplicate

Submitted: 2023-11-15
Updated: 2024-05-27
Resolved: 2024-05-27

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdResolved

Related Reports

Duplicate :	JDK-8329203 - Parallel: Investigate Mark-Compact for Full GC to decrease memory usage
Relates :	JDK-8332969 - Improve Parallel Full GC to break even with Serial Full GC sooner
Relates :	JDK-8323791 - Parallel: Using non-atomic for live bytes update during Full GC
Relates :	JDK-8325553 - Parallel: Use per-marker cache for marking stats during Full GC
Relates :	JDK-8310031 - Parallel: Implement better work distribution for large object arrays in old gen

Description

Was working on some GC tests, and notices the major oddity in Parallel GC behavior. The workload like this:

```
import java.util.concurrent.ThreadLocalRandom;

public class Fragger {
	static final int COUNT = 15_000_000;
	static final Object[] ARR = new Object[COUNT];

	public static void main(String... args) {
		for (int c = 0; c < COUNT; c++) {
			ARR[c] = new byte[16];
		}

		long end = System.nanoTime() + 30_000_000_000L;
		while (System.nanoTime() < end) {
			int idxFull = ThreadLocalRandom.current().nextInt(COUNT);
			ARR[idxFull] = new byte[16];
		}
	}
}
```

With Parallel, Full GC times are out of whack:

```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseParallelGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[1.333s][info][gc] GC(2) Pause Full (Ergonomics) 530M->471M(981M) 652.579ms
[3.315s][info][gc] GC(3) Pause Full (Ergonomics) 727M->517M(981M) 1721.082ms
[5.904s][info][gc] GC(4) Pause Full (Ergonomics) 773M->517M(981M) 2283.419ms
[9.001s][info][gc] GC(5) Pause Full (Ergonomics) 773M->517M(981M) 2673.602ms
[12.439s][info][gc] GC(6) Pause Full (Ergonomics) 773M->517M(981M) 3013.188ms
[16.079s][info][gc] GC(7) Pause Full (Ergonomics) 773M->517M(981M) 3219.932ms
[19.654s][info][gc] GC(8) Pause Full (Ergonomics) 773M->517M(981M) 3247.373ms
[23.692s][info][gc] GC(9) Pause Full (Ergonomics) 773M->517M(981M) 3705.432ms
[27.382s][info][gc] GC(10) Pause Full (Ergonomics) 773M->517M(981M) 3346.816ms
[31.079s][info][gc] GC(11) Pause Full (Ergonomics) 773M->517M(981M) 3366.989ms
```

For comparison, Shenandoah's Full GC is 7x (!) faster:

```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -XX:+UnlockDiagnosticVMOptions -XX:ShenandoahGCMode=passive -XX:-ShenandoahDegeneratedGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[1.290s][info][gc] GC(0) Pause Full 1023M->517M(1024M) 389.849ms
[2.328s][info][gc] GC(1) Pause Full 1023M->517M(1024M) 422.842ms
[3.353s][info][gc] GC(2) Pause Full 1023M->517M(1024M) 432.260ms
[4.378s][info][gc] GC(3) Pause Full 1023M->517M(1024M) 433.945ms
[5.501s][info][gc] GC(4) Pause Full 1023M->517M(1024M) 439.280ms
[6.527s][info][gc] GC(5) Pause Full 1023M->517M(1024M) 433.924ms
[7.601s][info][gc] GC(6) Pause Full 1023M->517M(1024M) 430.845ms
[8.629s][info][gc] GC(7) Pause Full 1023M->517M(1024M) 426.524ms
[9.675s][info][gc] GC(8) Pause Full 1023M->517M(1024M) 439.208ms
```

Even Serial is faster!

```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseSerialGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[25.993s][info][gc] GC(35) Pause Full (Allocation Failure) 989M->552M(989M) 978.552ms
[27.330s][info][gc] GC(37) Pause Full (Allocation Failure) 825M->552M(989M) 972.567ms
[28.683s][info][gc] GC(39) Pause Full (Allocation Failure) 825M->552M(989M) 986.712ms
[30.050s][info][gc] GC(41) Pause Full (Allocation Failure) 825M->517M(989M) 1003.309ms
[31.409s][info][gc] GC(43) Pause Full (Allocation Failure) 790M->552M(989M) 992.085ms
```

I think part of it is scalability problem. Running with -XX:ParallelGCThreads=1 shortens the gap against Shenandoah to "only" 2x, and Parallel is 3x slower than Serial. Hacking PSParallelCompact::enqueue_dense_prefix_tasks to hand out a region per thread and adding region distribution logging shows that one region takes the exorbitant amount of time to process.

```
[6.518s][info][gc] GC(4) Worker 0 finished regions [0; 1) in 93 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [1; 2) in 3035901 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [2; 3) in 1 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [3; 4) in 0 us
... all other regions take 0 us too...
[9.731s][info][gc] GC(4) Pause Full (Ergonomics) 773M->517M(981M) 4508.276ms
```

The sample profile points to live_words_in_range as the bottleneck, see "asprof-parallelgc-1thread.html`. The profile gets worse with the default number of parallel GC threads, as all those workers are spinning waiting for a loner worker to complete.

I think that one loner thread is busy iterating the large reference array and figuring out the new locations for objects referenced from it. It does not matter if that array spans multiple regions or not: the walk would still be single-threaded. Plus, since the referenced objects are not guaranteed to be laid out linearly in memory, the cache (JDK-8145318) is not effective.

So we enter the slow path, and to compute the new address, we need compute the number of live words in the block that preceed the object (since those objects would preceed the given object after the sliding). Which means, if we randomly iterate over many blocks, we end up doing one bitmap search for every object, which costs O(N) of number of live objects in the block. If most objects in the block are live, then it amounts to O(N^2) total time to deal with entire thing, where N is the average number of objects per block. Even if the block is small (currently 128 words), N could be large for the ensemble of small objects: about 8 words long object gives 32 objects per block.

This is corroborated by simple performance experiment: twiddling `ParallelCompactData::Log2BlockSize`.

```
# ParallelCompactData::Log2BlockSize = 16 (maximum, fills entire region)
[8.118s][info][gc] GC(3) Pause Full (Ergonomics) 727M->518M(981M) 6134.907ms
[17.144s][info][gc] GC(4) Pause Full (Ergonomics) 774M->517M(981M) 8663.333ms
[28.121s][info][gc] GC(5) Pause Full (Ergonomics) 773M->517M(981M) 10606.085ms
[39.806s][info][gc] GC(6) Pause Full (Ergonomics) 773M->517M(981M) 11390.484ms

# ParallelCompactData::Log2BlockSize = 7 (default)
[22.000s][info][gc] GC(9) Pause Full (Ergonomics) 773M->517M(981M) 3052.236ms
[25.388s][info][gc] GC(10) Pause Full (Ergonomics) 773M->517M(981M) 3094.713ms
[29.115s][info][gc] GC(11) Pause Full (Ergonomics) 773M->517M(981M) 3388.381ms
[33.196s][info][gc] GC(12) Pause Full (Ergonomics) 773M->517M(981M) 3735.875ms

# ParallelCompactData::Log2BlockSize = 4
[26.898s][info][gc] GC(19) Pause Full (Ergonomics) 774M->518M(981M) 1166.031ms
[28.191s][info][gc] GC(20) Pause Full (Ergonomics) 774M->518M(981M) 991.396ms
[29.643s][info][gc] GC(21) Pause Full (Ergonomics) 774M->518M(981M) 1152.030ms
[31.055s][info][gc] GC(22) Pause Full (Ergonomics) 774M->518M(981M) 979.091ms

# ParallelCompactData::Log2BlockSize = 1 (minimum)
[28.284s][info][gc] GC(23) Pause Full (Ergonomics) 774M->518M(981M) 880.228ms
[29.501s][info][gc] GC(24) Pause Full (Ergonomics) 774M->518M(981M) 882.738ms
[30.662s][info][gc] GC(25) Pause Full (Ergonomics) 774M->517M(981M) 844.468ms
[31.871s][info][gc] GC(26) Pause Full (Ergonomics) 773M->517M(981M) 873.624ms
```

It probably a design/implementation question: can we do the pointer calculation like Serial, G1 and Shenandoah Full GCs do, go linearly over the whole objects in the block/region/heap and compute all addresses in one swing, instead of going to possibly random objects and trying to figure out new addresses on the spot? I think current scheme is the part of Scissor, but I need to brush up on that. It would probably require storing the fwdptr information somewhere.

If not, can we do smaller block size to limit the impact of live bitmap scans?

If not, can we chunk the array walks in compact to avoid the single-threaded execution like in these examples?

Comments

It is a time-based test, the number of GCs would be different if GC times are different. I suggest the rest to be handled as part of JDK-8332969.
27-05-2024
The different number of GCs for serial and parallel also indicate a significant difference in the object heap.
27-05-2024
And as [~shade] mentions in JDK-8332969, Serial uses the bits within the object header which is just that much faster wrt to marking (>10 years ago I did some thorough experiments on marking, and there has always been a significant advantage of in-object-header marking even when executing marking one after another in the same pause).
27-05-2024
$ java -Xms1g -Xmx1g -Xlog:gc -XX:MarkSweepDeadRatio=0 -XX:-G1UseAdaptiveIHOP -XX:InitiatingHeapOccupancyPercent=100 -XX:ParallelGCThreads=1 -XX:+UseG1GC Fragger 2>&1 \| grep "Pause Full" [13,725s][info][gc] GC(22) Pause Full (G1 Compaction Pause) 1017M->516M(1024M) 1410,134ms [23,488s][info][gc] GC(32) Pause Full (G1 Compaction Pause) 1017M->516M(1024M) 1462,753ms $ java -Xms1g -Xmx1g -Xlog:gc -XX:MarkSweepDeadRatio=0 -XX:-G1UseAdaptiveIHOP -XX:InitiatingHeapOccupancyPercent=100 -XX:ParallelGCThreads=1 -XX:+UseSerialGC Fragger 2>&1 \| grep "Pause Full" [2,172s][info][gc] GC(2) Pause Full (Allocation Failure) 796M->516M(989M) 1131,518ms [3,677s][info][gc] GC(3) Pause Full (Allocation Failure) 789M->516M(989M) 1251,434ms [5,257s][info][gc] GC(4) Pause Full (Allocation Failure) 789M->516M(989M) 1303,510ms [6,811s][info][gc] GC(5) Pause Full (Allocation Failure) 789M->516M(989M) 1296,091ms [8,369s][info][gc] GC(6) Pause Full (Allocation Failure) 789M->516M(989M) 1310,513ms [9,971s][info][gc] GC(7) Pause Full (Allocation Failure) 789M->516M(989M) 1348,175ms [11,550s][info][gc] GC(8) Pause Full (Allocation Failure) 789M->516M(989M) 1329,150ms [13,136s][info][gc] GC(9) Pause Full (Allocation Failure) 789M->516M(989M) 1334,472ms [14,717s][info][gc] GC(10) Pause Full (Allocation Failure) 789M->516M(989M) 1332,726ms [16,306s][info][gc] GC(11) Pause Full (Allocation Failure) 789M->516M(989M) 1338,752ms [17,909s][info][gc] GC(12) Pause Full (Allocation Failure) 789M->516M(989M) 1351,787ms [19,499s][info][gc] GC(13) Pause Full (Allocation Failure) 789M->516M(989M) 1340,460ms [21,135s][info][gc] GC(14) Pause Full (Allocation Failure) 789M->516M(989M) 1388,541ms [22,722s][info][gc] GC(15) Pause Full (Allocation Failure) 789M->516M(989M) 1327,531ms [24,319s][info][gc] GC(16) Pause Full (Allocation Failure) 789M->516M(989M) 1331,035ms [25,901s][info][gc] GC(17) Pause Full (Allocation Failure) 789M->516M(989M) 1332,279ms [27,484s][info][gc] GC(18) Pause Full (Allocation Failure) 789M->516M(989M) 1333,392ms [29,073s][info][gc] GC(19) Pause Full (Allocation Failure) 789M->516M(989M) 1335,271ms [30,658s][info][gc] GC(20) Pause Full (Allocation Failure) 789M->516M(989M) 1334,799ms $ java -Xms1g -Xmx1g -Xlog:gc -XX:MarkSweepDeadRatio=0 -XX:-G1UseAdaptiveIHOP -XX:InitiatingHeapOccupancyPercent=100 -XX:ParallelGCThreads=1 -XX:+UseParallelGC Fragger 2>&1 \| grep "Pause Full" [2,549s][info][gc] GC(2) Pause Full (Ergonomics) 760M->516M(981M) 1585,332ms [4,567s][info][gc] GC(3) Pause Full (Ergonomics) 772M->516M(981M) 1802,661ms [6,682s][info][gc] GC(4) Pause Full (Ergonomics) 772M->516M(981M) 1900,758ms [8,841s][info][gc] GC(5) Pause Full (Ergonomics) 772M->516M(981M) 1945,947ms [11,031s][info][gc] GC(6) Pause Full (Ergonomics) 772M->516M(981M) 1974,400ms [13,245s][info][gc] GC(7) Pause Full (Ergonomics) 772M->516M(981M) 2000,014ms [15,490s][info][gc] GC(8) Pause Full (Ergonomics) 772M->516M(981M) 2028,764ms [17,731s][info][gc] GC(9) Pause Full (Ergonomics) 772M->516M(981M) 2021,731ms [19,979s][info][gc] GC(10) Pause Full (Ergonomics) 772M->516M(981M) 2024,897ms [22,268s][info][gc] GC(11) Pause Full (Ergonomics) 772M->516M(981M) 2072,345ms [24,572s][info][gc] GC(12) Pause Full (Ergonomics) 772M->516M(981M) 2070,735ms [26,800s][info][gc] GC(13) Pause Full (Ergonomics) 772M->516M(981M) 2013,425ms [29,029s][info][gc] GC(14) Pause Full (Ergonomics) 772M->516M(981M) 2014,165ms [31,248s][info][gc] GC(15) Pause Full (Ergonomics) 772M->516M(981M) 2004,321ms Initially this looks like G1 and serial are roughly the same, while Parallel is worse. The problem is that the algorithms are applied to completely different heaps at least for G1, so I would not count on that... Not sure about the exact reasons, but the marking phase in Parallel takes a lot longer than on Serial (which makes sense because of synchronization), not sure about the G1 results as it probably works on a completely different heap. Just to give an idea on how much that can impact marking time, with fiddling of min/max young gen size I can make G1 full gc pause times go below 1000ms.
27-05-2024
Submitted: JDK-8332969 for Parallel GC. I think G1 Full GC is a less important issue, since G1 Full GC is basically a failure mode, whereas Parallel Full GC is a normal mode.
27-05-2024
> I think it is unfortunate that Serial is still quite a bit faster than Parallel with 1 thread. Do you have a relevant bug to research that? If not, I can submit one. No. I believe Serial full-gc is also faster than G1 full-gc if ParallelGCThreads=1 and MarkSweepDeadRatio=0
27-05-2024
Also attached asprof-parallelgc-1thread-after-8329203.html, which shows the profile on the test after JDK-8329203. I think it shows expected improvement. I think it is unfortunate that Serial is still quite a bit faster than Parallel with 1 thread. Do you have a relevant bug to research that? If not, I can submit one.
27-05-2024
Confirmed, JDK-8329203 makes the whole thing an order of magnitude better. Changeset right at JDK-8329203: % build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseParallelGC -Xlog:gc Fragger.java 2>&1 \| grep "Pause Full" [4.550s][info][gc] GC(8) Pause Full (Ergonomics) 808M->551M(981M) 253.212ms [5.090s][info][gc] GC(9) Pause Full (Ergonomics) 807M->551M(981M) 257.742ms [5.734s][info][gc] GC(10) Pause Full (Ergonomics) 807M->552M(981M) 259.942ms [6.395s][info][gc] GC(11) Pause Full (Ergonomics) 808M->551M(981M) 253.697ms Changeset right before JDK-8329203: % build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseParallelGC -Xlog:gc Fragger.java 2>&1 \| grep "Pause Full" [21.611s][info][gc] GC(8) Pause Full (Ergonomics) 808M->551M(981M) 3172.968ms [25.062s][info][gc] GC(9) Pause Full (Ergonomics) 807M->552M(981M) 3140.458ms [28.492s][info][gc] GC(10) Pause Full (Ergonomics) 808M->552M(981M) 3124.207ms [31.552s][info][gc] GC(11) Pause Full (Ergonomics) 808M->551M(981M) 2756.006ms
27-05-2024
I need to verify JDK-8329203 fixes it. On the surface, I think it does. I'll close as duplicate when I confirm.
27-05-2024
[~shade] Now that JDK-8329203 is merged, do you think if this ticket can be closed? For potential further improvement, maybe it's better to create a new ticket capturing precisely what is intended.
27-05-2024
JDK-8329203 fixes (part of) the scalability issue. On my box, single-threaded Parallel is slower than Serial, but catches up using 3 gc-threads, and the trend of shorter pause with more gc-threads is clear.
14-05-2024
> I think part of it is scalability problem. Probably some overlapping with JDK-8323791
25-01-2024