Bug ID: JDK-8332969 Improve Parallel Full GC to break even with Serial Full GC sooner

Type: Enhancement
Component: hotspot
Sub-Component: gc
Affected Version: 23
Priority: P4
Status: New
Resolution: Unresolved
Submitted: 2024-05-27
Updated: 2024-05-27
Other
tbdUnresolved
On some workloads, Parallel Full GC is slower in single thread than a (naively equivalent) Serial GC. Parallel breaks even with Serial when given 2..3 threads after JDK-8329203, which pushed the break-even point quite low. It would be good to understand if the remaining issues are the intrinsic tradeoffs in Parallel GC implementation, and/or whether the break even point could be pushed lower. 

This would simplify deployments, as users would not be forced to choose Serial over Parallel when there is only 1 CPU is available for the GC to run on.

Example workload:

```
% cat Fragger.java 
import java.util.concurrent.ThreadLocalRandom;

public class Fragger {
	static final int COUNT = 15_000_000;
	static final Object[] ARR = new Object[COUNT];

	public static void main(String... args) {
		for (int c = 0; c < COUNT; c++) {
			ARR[c] = new byte[16];
		}

		long end = System.nanoTime() + 30_000_000_000L;
		while (System.nanoTime() < end) {
			long nextGC = System.nanoTime() + 500_000_000L;
			while (System.nanoTime() < nextGC) {
  				int idxFull = ThreadLocalRandom.current().nextInt(COUNT);
				ARR[idxFull] = new byte[16];
			}
			System.gc();
		}
	}
}

% build/macosx-aarch64-server-release/images/jdk/bin/java  -Xmx4g -Xms4g -XX:+UseSerialGC -Xlog:gc Fragger.java
[24.901s][info][gc] GC(16) Pause Full (System.gc()) 790M->654M(3959M) 901.744ms
[26.374s][info][gc] GC(17) Pause Full (System.gc()) 1010M->654M(3959M) 972.383ms
[27.800s][info][gc] GC(18) Pause Full (System.gc()) 947M->654M(3959M) 925.915ms
[29.214s][info][gc] GC(19) Pause Full (System.gc()) 905M->518M(3959M) 914.314ms
[30.659s][info][gc] GC(20) Pause Full (System.gc()) 873M->654M(3959M) 943.951ms

% build/macosx-aarch64-server-release/images/jdk/bin/java  -Xmx4g -Xms4g  -XX:+UseParallelGC -XX:ParallelGCThreads=1 -Xlog:gc Fragger.java
[23.677s][info][gc] GC(11) Pause Full (System.gc()) 793M->518M(3925M) 1441.179ms
[25.616s][info][gc] GC(12) Pause Full (System.gc()) 793M->518M(3925M) 1438.795ms
[27.556s][info][gc] GC(13) Pause Full (System.gc()) 832M->518M(3925M) 1439.856ms
[29.512s][info][gc] GC(14) Pause Full (System.gc()) 891M->518M(3925M) 1455.564ms
[31.468s][info][gc] GC(15) Pause Full (System.gc()) 812M->518M(3925M) 1455.337ms
```

The collectors behave a bit differently, even with periodic Full GCs, so the difference might hide there as well. But, the difference is still present even in cycles that do similar amount of work, e.g. "905M->518M(3959M) 914.314ms" vs "891M->518M(3925M) 1455.564ms".

Initial profiles are attached. The difference is likely to be on lower level, e.g. the use of marking bitmaps by Parallel vs. marking the object headers by Serial, etc. parallel-1.html shows that we are spending quite a bit of time in `ParCompactionManager::mark_and_push`, I think it would be interesting to dive there.

Some initial questions to ponder:
 a) Can/should Parallel use object headers for marking? (This likely has implications for the complexity of Compact Object Headers).
 b) Can/should Parallel user memory_order_relaxed for bitmap accesses?
 c) Maybe there are some minor code polishings in `ParCompactionManager::MarkingStatsCache::push` and friends that can give us a bit more edge?