Bug ID: JDK-8322484 22-b26 Regression in J2dBench-bimg_misc-G1 (and more) on Windows-x64 and macOS-x64

Type: Bug
Component: hotspot
Sub-Component: gc
Affected Version: 22

Priority: P2
Status: Closed
Resolution: Fixed
OS: linux,os_x,windows
CPU: x86_64

Submitted: 2023-12-19
Updated: 2024-07-03
Resolved: 2024-01-29

JDK 22	JDK 23
22.0.2Fixed	23 b08Fixed

Integration of JDK-8318706 into 22-b26 has regressed J2dBench-bimg_misc-G1 on Windows-x64 and macOS-x64 by about ~ 1%. Linux-aarch64 also shows the regression, but at a smaller size.

Regression was isolated by measuring CI builds.

Additional benchmarks afftected:
1% J2dBench-bimg_imageio-G1 on Linux-x64
2% J2dBench-vimg_images_opq-G1 on Windows-x64
2% J2dBench-vimg_shapes_gradient-G1 on Windows-x64
2% J2dBench-vimg_shapes_solid-G1 on Windows-x64
9% J2dBench-vimg_copyarea-G1 on Windows-x64

A pull request was submitted for review. URL: https://git.openjdk.org/jdk22u/pull/36 Date: 2024-01-30 09:13:24 +0000
08-02-2024
jdk22u fix request: Reason: performance regression for any application that uses Get/ReleasePrimitiveArrayCritical. The reason why it only shows up in these tests on Windows is that j2dbench is the only application in our perf testing that does that. Change: Implements a per-thread cache that reduces the overhead of Get/ReleasePrimitiveArrayCritical methods. Risk estimate: low due to fairly high test coverage testing the affected code Test coverage: tier1-7 in jdk-jdk repo, no issues in jdk-jdk since initial push. (Also ran tier1-7 in jdk22 repo with no issues)
06-02-2024
jdk22 defer request Reason: Asking to defer this issue out of jdk22 after getting a rejection on the jdk22 integration request as the risk/reward ratio is too high.
30-01-2024
Fix request for JDK 22 GA rejected. Not simple changes which fix small performance regression only on Windows in some corner cases. I would suggest to give more time for testing in mainline and not limit to JDK 22 GA timeline. We have JDK 22u repo open already and you can push it there when ready.
29-01-2024
Fix request Reason: causes up to 9% performance regression after implementation of JEP 423 (https://bugs.openjdk.org/browse/JDK-8318706) for applications that perform lots of native access via Get/ReleasePrimitiveArrayCritical (read: at least a few 100k/s like for some graphics APIs). The issue affects all platforms, the reason why the regression only shows up in windows is that for other platforms the OS API calls do not do any Get/ReleasePrimitiveArrayCritical accesses. Change: Implements a per-thread cache that reduces the overhead of Get/ReleasePrimitiveArrayCritical methods. Risk estimate: low due to fairly high test coverage testing the affected code Test coverage: tier1-7 in jdk-jdk repo, same in jdk22 repo (tier5-7 still running at time of writing). Given that there is some time left in the rdp2 process, may want to wait for a few more days baking in mainline.
29-01-2024
A pull request was submitted for review. URL: https://git.openjdk.org/jdk22/pull/99 Date: 2024-01-29 09:26:18 +0000
29-01-2024
Changeset: 0d5f5e15 Author: Thomas Schatzl <tschatzl@openjdk.org> Date: 2024-01-29 08:36:51 +0000 URL: https://git.openjdk.org/jdk/commit/0d5f5e15d43f94a79c6133baecd5af217365d176
29-01-2024
The reason why only Windows is affected is that Atomic::add implementations for Windows ignore the memory order argument, always doing a full barrier. That matters in this case where the call has memory_order_relaxed.
25-01-2024
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/17552 Date: 2024-01-24 12:38:09 +0000
25-01-2024
Particularly that copyarea benchmark does nothing but lock/do short graphics op/unlock hence the large regression (numbers above for that benchmark). A special build that eschews the actual atomic lock/unlock (because there are no garbage collections between complete iterations) gets performance back to previous level.
11-01-2024
The problem seems to be stemming from the increased length of the pin/unpin operations, most likely the majority caused by the additional atomic operations, two per locked objects. Within the 18s the benchmark runs, around 142M objects are locked and unlocked (in total 184M additional atomic operations). Investigating why only Windows is affected, and options for mitigating the impact.
11-01-2024

Relates :	JDK-8322985 - [BACKOUT] 8318562: Computational test more than 2x slower when AVX instructions are used
Relates :	JDK-8318706 - Implement JEP 423: Region Pinning for G1
Relates :	JDK-8324823 - G1: Provide automated way to check for pin count underflows