The Version table provides details related to the release that this issue/RFE will be addressed.
Unresolved : Release in which this issue/RFE will be addressed. Resolved: Release in which this issue/RFE has been resolved. Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.
Seems to be related to JDK-8298048. Seen on linux and macos.
Comments
The analysis indicates everything is working as expected after the change to the CDS. Thanks everyone for pitching in.
14-06-2023
The problem is that now the heap takes much less space (CDS data/interned strings are only a fraction of a single region, so they will be compacted together with other data at the first system.gc), leading to an overall much smaller java heap (at these heap sizes, heap scaling according to live data is very aggressive), leading to more garbage collections.
E.g. running this on an internal machine, total heap chosen by the heap sizing policy is completely different and smaller. This leads to 25% more GCs with the change that are slightly slower (because with the changes heap sizes are so small that thread scaling based on heap size kicks in aggresively).
This is expected behavior from a GC POV.
If the heap is fixed, including young gen, there is no performance difference (in fop at least).
14-06-2023
The regression in "Dacapo fop" from JDK 21 b20 build 1596 to 1597 (before and after JDK-8298048) may be caused by
- More "Concurrent Mark Cycle" events, and/or
- The "Concurrent Mark Cycle" events take longer to execute
I'm reassigning to [~tschatzl] to take a look from the GC perspective.
06-06-2023
Some analysis of this benchmark and the changes in JDK-8298048:
Before JDK-8298048 (Combine CDS archive heap into a single block), all the archived Java strings were stored in the "closed" G1 archive region. These strings do not move and never get collected. I suspect that make GC faster.
After JDK-8298048, there's no more "closed" region. All archived Java objects (including strings) are mapped in an "old" region.
The "Dacapo fop" bechmark is very sensitive to GC speed. When running with this:
$ ./1597/jdk-21/bin/java -Xlog:gc -Xmx256m -jar dacapo-9.12-MR1.jar --size default --iterations 200 fop
===== DaCapo 9.12-MR1 fop completed warmup 198 in 139 msec =====
[29.238s][info][gc] GC(1025) Pause Full (System.gc()) 73M->6M(27M) 10.304ms
===== DaCapo 9.12-MR1 fop starting warmup 199 =====
[29.254s][info][gc] GC(1026) Pause Young (Normal) (G1 Evacuation Pause) 17M->7M(27M) 1.398ms
[29.268s][info][gc] GC(1027) Pause Young (Normal) (G1 Evacuation Pause) 17M->9M(27M) 1.913ms
[29.281s][info][gc] GC(1028) Pause Young (Normal) (G1 Evacuation Pause) 18M->11M(27M) 2.333ms
[29.295s][info][gc] GC(1029) Pause Young (Normal) (G1 Evacuation Pause) 18M->12M(142M) 3.123ms
===== DaCapo 9.12-MR1 fop completed warmup 199 in 124 msec =====
[29.373s][info][gc] GC(1030) Pause Full (System.gc()) 71M->6M(27M) 9.842ms
===== DaCapo 9.12-MR1 fop starting =====
[29.391s][info][gc] GC(1031) Pause Young (Normal) (G1 Evacuation Pause) 18M->7M(27M) 1.569ms
[29.409s][info][gc] GC(1032) Pause Young (Normal) (G1 Evacuation Pause) 18M->9M(27M) 2.923ms
[29.424s][info][gc] GC(1033) Pause Young (Normal) (G1 Evacuation Pause) 19M->12M(27M) 3.336ms
[29.437s][info][gc] GC(1034) Pause Young (Concurrent Start) (G1 Evacuation Pause) 19M->14M(142M) 2.233ms
[29.437s][info][gc] GC(1035) Concurrent Mark Cycle
[29.444s][info][gc] GC(1035) Pause Remark 16M->16M(67M) 2.502ms
[29.445s][info][gc] GC(1035) Pause Cleanup 18M->18M(67M) 0.023ms
[29.445s][info][gc] GC(1035) Concurrent Mark Cycle 8.485ms
[29.507s][info][gc] GC(1036) Pause Young (Prepare Mixed) (G1 Evacuation Pause) 64M->26M(81M) 9.574ms
===== DaCapo 9.12-MR1 fop PASSED in 142 msec =====
You can see that iteration 199 (124ms) is significantly faster than iteration 200 (142ms) because the latter performs more work in GC.
The benchmark calls System.gc() before every iteration, but doesn't seem to reduce the variation over the iterations.
In our harness, we run "fop" for 200 iterations, but take only the score of the last iteration. So the score is dominated by whether the last run happens to enter the "Concurrent Mark Cycle".