JDK-8312188 : Performance regression in SharedRuntime::frem/drem() on non-Windows x86 after JDK-8302191
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 21,22
  • Priority: P2
  • Status: Open
  • Resolution: Unresolved
  • OS: linux
  • CPU: x86
  • Submitted: 2023-07-17
  • Updated: 2023-08-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 22
22 b16Unresolved
Related Reports
Relates :  
Relates :  
Relates :  
Description
There are two performance regressions around float/double modulo:
1) The first one (this bug) is observed after JDK-8302191 which changed SharedRuntime::frem/drem() for non-Windows x64 systems to no longer use the fmod() C-library implementation but instead use direct x86 assembly.
2) The second regression only affects JDK 22 and is observed after the intrinsification of float/double modulo with JDK-8308966. The details for that regression can be found separately in JDK-8312233. 


The regression introduced by JDK-8302191 can be observed when running Blender2.java with a product/release VM (the machine used for the numbers below has AVX512 support - but the regression is also observed on AVX2 only machines):

Commit just before JDK-8302191 which is JDK-8304683 (https://github.com/openjdk/jdk/commit/760c0128a4ef787c8c8addb26894c072ba8b2eb1):

$ java Blender2.java

Output:
164 ms
161 ms
166 ms
164 ms
162 ms
160 ms
168 ms
163 ms
161 ms
167 ms
Average: 163 ms


Commit of JDK-8302191 (https://github.com/openjdk/jdk/commit/37774556da8a5aacf55884133ae936ed5a28eab2):

$ java Blender2.java

Output:
255 ms
260 ms
256 ms
307 ms
257 ms
258 ms
265 ms
255 ms
260 ms
255 ms
Average: 262 ms


This suggests that the direct x86 assembly of SharedRuntime::frem/drem() is slower than the code executed by fmod(). We should have a closer look at the assembly produced by fmod() and improve our x86 assembly to fix the observed regressions.


---- Original Report ----
There is potentially a significant (~25%) performance regression on a micro’-ish’-benchmark in JDK 21 - see attachment.

The regression appears to be at least on Linux x86, but didn't appear on macOS x86.   No other platforms were tried.

With Blender.java, the 2nd Java sample on the blog [1], a significant drop in performance between JDK 20.0.1 and JDK 21 using the latest binaries [2].

On the Ubuntu workstation,
    JDK 20.0.1 runs Blender in  822ms
    JDK 21 runs Blender in 1125ms

(see attachments for full source code and sample output)

[1] https://www.graalvm.org/22.1/examples/java-performance-examples/
[2] https://jdk.java.net/


Comments
[~mtrudeau] can someone test the issue reported here using the proposed fix for JDK-8314056 so we can see if this regression is addressed sufficiently by that issue? If so I will close this as a duplicate. Thanks.
18-08-2023

[~dholmes] Please see my latest comment in JDK-8314056. I believe the issues reported in all of its linked JBSs will be resolved with the latest PR in JDK-8314056. There is still the "issue" of the magnitude difference of the numerator and denominator showing varying performance values for the non-AVX versions of the code, but I believe that's an un-fixable problem.
16-08-2023

[~sgibbons] so can we close this issue as a duplicate of something else, or is there further specific work that may be needed to alleviate this issue compared to that in JDK-8312233 or JDK-8314056?
15-08-2023

I believe so. Check https://bugs.openjdk.org/browse/JDK-8312233.
14-08-2023

[~sgibbons] Is this issue mitigated by the proposed change in JDK-8314056?
14-08-2023

I do not believe backing out the changes to the fmod / dmod is the correct approach in general. My reasoning follows. I have been looking into this and have discovered that the performance of fmod / dmod is highly dependent on the relative magnitude of the numerator and denominator of the mod operation. I have attached Belnder4.java to this report, where the only difference is increasing the magnitude of the numerator of the mod operation, which shows very large improvement in performance between JDK20 and JDK21/22. Experimentation has shown that the libc implementation's performance (JDK20) decreases almost linearly with the ratio of the numerator to the denominator. The x87 and AVX2 versions' performance (JDK21 and JDK22, respectively) are virtually flat. libc appears to be faster when the ratio is low, but is slower when the ratio is above ~50000 (i.e., 50 % 42 is faster for libc, but 2e6 % 42 is faster using the other algorithms). I have also stripped each of these algorithms out of the JDK and into pure C/assembly. In this case, AVX2 always outperforms the other two algorithms regardless of the parameter values. I'm still investigating why their inclusion into the JDK changes their relative performance. I suspect there's something to do with either path length or parameter assembly / casting, but that's yet to be determined. I'm not sure what, if anything, we can say about the "typical" usage of floating-point modulus. It could be that 90% of all fmods are of the type in Blender.java, in which case the libc implementation would probably give better performance in the JDK overall. However, there may be a large percentage where the AVX2 algorithm is better. I'd like to hear form others with more knowledge of typical usage before making a judgement. Output from Blender4: (fmod)$ ~/jdk-20.0.1/bin/java -XX:UseAVX=2 Blender4.java 1181 ms 1182 ms 1182 ms 1179 ms 1179 ms 1179 ms 1182 ms 1180 ms 1181 ms 1183 ms Average: 1180 ms (fmod)$ ~/jdk-21/bin/java -XX:UseAVX=2 Blender4.java 334 ms 335 ms 335 ms 335 ms 335 ms 334 ms 335 ms 335 ms 334 ms 335 ms Average: 334 ms (fmod)$ ~/jdk-22/bin/java -XX:UseAVX=2 Blender4.java 262 ms 262 ms 262 ms 262 ms 262 ms 262 ms 262 ms 262 ms 262 ms 262 ms Average: 262 ms
25-07-2023

If that helps, I used GCC 11.2.0.
21-07-2023

More data. The JDK versions are the same as above. Blender2 always shows improvement JDK20 vs. JDK22. Blender is always worse (but was better on i7-11700). I'm investigating the cause for Blender. Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz (Sky Lake) gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-15) JDK 20: Blender - 97ms, Blender2 - 159ms JDK 21: Blender - 96ms, Blender2 - 160ms JDK 22: Blender - 103ms, Blender2 - 132ms Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz (Comet Lake) gcc (Ubuntu 12.2.0-3ubuntu1) 12.2.0 JDK 20: Blender - 79ms, Blender2 - 136ms JDK 21: Blender - 79ms, Blender2 - 136ms JDK 22: Blender - 86ms, Blender2 - 118ms
20-07-2023

On the machine I used, I could see the regression with UseAVX=0, 1, 2, or 3.
20-07-2023

I've started my investigation and found the following results. I'm still trying to figure out what this means, and will be trying other machines as well. I'll keep this JBS updated as I progress. ** It seems my formatting was lost. The first average is for Blender.java and the second is for Blender2.java. All tests were done with -XX:UseAVX=2. CPU: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 USEAVX=2 Blender.java Blender2.java "openjdk 20.0.2 2023-07-18 OpenJDK Runtime Environment (build 20.0.2+9-78) OpenJDK 64-Bit Server VM (build 20.0.2+9-78, mixed mode, sharing)" Average: 90 ms Average: 128 ms "openjdk 21-ea 2023-09-19 OpenJDK Runtime Environment (build 21-ea+31-2444) OpenJDK 64-Bit Server VM (build 21-ea+31-2444, mixed mode, sharing)" Average: 91 ms Average: 128 ms "openjdk 22-ea 2024-03-19 OpenJDK Runtime Environment (build 22-ea+6-393) OpenJDK 64-Bit Server VM (build 22-ea+6-393, mixed mode, sharing)" Average: 74 ms Average: 93 ms
19-07-2023

And I could only observe it with a product/release build, with fastdebug, the numbers seemed to be more or less equal.
19-07-2023

[~sgibbons] and here is the machine I used: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz stepping : 4 microcode : 0x1 cpu MHz : 1995.312 cache size : 16384 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit bogomips : 3990.62 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual
19-07-2023

[~sgibbons] one example is a Dell Precision 3440 desktop, Intel Core i7-10700 CPU @ 2.90GHz x16 running Ubuntu 22.04LTS. Hopefully [~chagedorn] has further details of the machines he used. processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 165 model name : Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz stepping : 5 microcode : 0xf4 cpu MHz : 2900.000 cache size : 16384 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit srbds mmio_stale_data retbleed eibrs_pbrsb bogomips : 5799.77 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:
19-07-2023

I have been unable to reproduce the regression on any of my machines, so I'm guessing it is somehow platform-dependent. I see a consistent 5-15% performance improvement with 21 over 20. Can you please tell me the platform configuration on which the bad numbers are produced?
19-07-2023

[~sgibbons] can you also take a look at this one please. There is serious consideration that we should back out JDK-8302191 from JDK 21 due to this impact on some systems.
19-07-2023

ILW = HLH = P2 I: High (There is potentially a significant (~25%) performance regression on a micro’-ish’-benchmark) L: L (a specific micro-benchmark, not across the board) W: H - do not know work around yet
18-07-2023

It looks like this is a regression from JDK-8302191 when changing SharedRuntime::frem/drem(). Moving to runtime for further analysis.
18-07-2023

Emanuel is on vacation, Christian will have a look.
18-07-2023

I can reproduce regression on my iMac-x64 (AVX2) with specified JDKs. But on one of Labs Intel linux-x64 machines (also AVX2 but older) it is opposite: jdk21 is 5% faster than jdk20.0.1
18-07-2023

As far as I can see the hot part of generated inner loop code is identical. There is no vectors here. There are allocation, fields assignment and call to C lib drem() function. We need to profile code to see which instructions/calls become more expensive.
17-07-2023