JDK-8312233 : Performance regression in SharedRuntime::frem/drem() on x86 with AVX2 after JDK-8308966
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 22
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • CPU: x86
  • Submitted: 2023-07-18
  • Updated: 2023-10-19
  • Resolved: 2023-10-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 22
22Resolved
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
There is a performance regression for non-AVX512 x86 systems after the integration of JDK-8308966 which intrinsifies float/double modulo. This can be observed/isolated by running Blender.java with flags to disable most of the C2 optimizations and only compiling test(). On AVX512 there is a small regression of 2-3% which might also be worth looking into. The regression can also be observed with the interpreter only by using -Xint:


Test on AVX2
========
Setup:
- AVX512 not available
- AVX2 available
- FMA instructions available
- fastdebug build

--- JDK 22+7/mainline ---

Default:
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java

Output:
2847 ms
2860 ms
2876 ms
2861 ms
2868 ms
2867 ms
2877 ms
2875 ms
2880 ms
2880 ms
Average: 2869 ms


Disabling FMA instruction with -XX:-UseFMA:
$ java -XX:-UseFMA -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java

Output:
329 ms
329 ms
330 ms
330 ms
331 ms
331 ms
332 ms
330 ms
332 ms
332 ms
Average: 330 ms


--- JDK 21+31 ---

$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java

Output:
341 ms
340 ms
341 ms
341 ms
340 ms
341 ms
340 ms
341 ms
341 ms
341 ms
Average: 340 ms


-----> SUMMARY: ~9x regression in JDK 22 for AVX2 without AVX512


=== Interpreter only ===

--- JDK 22+7/mainline ---

Default:
$ java -Xint Blender.java

Output:
3311 ms
3310 ms
3314 ms
3324 ms
3320 ms
3333 ms
3343 ms
3350 ms
3343 ms
3336 ms
Average: 3328 ms

Disabling FMA instruction with -XX:-UseFMA:
$ java -XX:-UseFMA -Xint Blender.java

Output:
956 ms
877 ms
865 ms
886 ms
897 ms
917 ms
886 ms
876 ms
863 ms
903 ms
Average: 892 ms

--- JDK 21+31 ---

$ java -Xint Blender.java

Output:
917 ms
930 ms
951 ms
973 ms
941 ms
926 ms
948 ms
963 ms
971 ms
975 ms
Average: 949 ms


-----> SUMMARY: ~3x regression in JDK 22 for AVX2 without AVX512 with interpreter only



Test on AVX512
=========
Setup:
- AVX512 available where VM_Version::supports_avx512vlbwdq() is true
- fastdebug

--- JDK 22+7/mainline ---

Default:
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java

Output:
907 ms
907 ms
908 ms
907 ms
917 ms
908 ms
908 ms
908 ms
910 ms
907 ms
Average: 908 ms

--- JDK 21+31 ---

888 ms
884 ms
884 ms
884 ms
885 ms
884 ms
888 ms
890 ms
884 ms
884 ms
Average: 885 ms


-----> SUMMARY: ~2-3% regression in JDK 22 for AVX512


Comments
Thanks. Closing as duplicate.
19-10-2023

Yes, this should be closed as the fix in JDK-8314056 resolves this issue.
19-10-2023

[~sgibbons], any update on this?
19-10-2023

Should this now be closed as a dup of JDK-8314056, or is there more to be done?
20-09-2023

I have created a PR (https://github.com/openjdk/jdk/pull/15210) that has some fixes for fmod/dmod performance (JBS: https://bugs.openjdk.org/browse/JDK-8314056). My current stats are below. ========================================= Using fastdebug openjdk version "22-internal" 2024-03-19 OpenJDK Runtime Environment (fastdebug build 22-internal-adhoc.scottgi.jdk) OpenJDK 64-Bit Server VM (fastdebug build 22-internal-adhoc.scottgi.jdk, mixed mode, sharing) $ ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:UseAVX=2 -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java CompileCommand: compileonly Blender.test bool compileonly = true 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms 311 ms 310 ms 310 ms Average: 310 ms $ ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:-UseFMA -XX:UseAVX=2 -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java CompileCommand: compileonly Blender.test bool compileonly = true 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms 310 ms Average: 310 ms ========================================= Using the release (not fastdebug) version of java: openjdk version "22-internal" 2024-03-19 OpenJDK Runtime Environment (build 22-internal-adhoc.scottgi.jdk) OpenJDK 64-Bit Server VM (build 22-internal-adhoc.scottgi.jdk, mixed mode, sharing) $ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:UseAVX=2 -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java CompileCommand: compileonly Blender.test bool compileonly = true 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms Average: 45 ms $ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:-UseFMA -XX:UseAVX=2 -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java CompileCommand: compileonly Blender.test bool compileonly = true 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms 45 ms Average: 45 ms ======================================== The current released JDK22: openjdk version "22-ea" 2024-03-19 OpenJDK Runtime Environment (build 22-ea+6-393) OpenJDK 64-Bit Server VM (build 22-ea+6-393, mixed mode, sharing) )$ ~/jdk-22/bin/java -XX:UseAVX=2 -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java CompileCommand: compileonly Blender.test bool compileonly = true 74 ms 74 ms 74 ms 74 ms 74 ms 74 ms 74 ms 74 ms 74 ms 74 ms Average: 74 ms $ ~/jdk-22/bin/java -XX:-UseFMA -XX:UseAVX=2 -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java CompileCommand: compileonly Blender.test bool compileonly = true 94 ms 95 ms 95 ms 95 ms 95 ms 95 ms 95 ms 94 ms 95 ms 95 ms Average: 94 ms =================================== Interpreter only: $ ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -Xint Blender.java 828 ms 850 ms 851 ms 821 ms 851 ms 821 ms 871 ms 821 ms 857 ms 821 ms Average: 839 ms $ ./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:-UseFMA -Xint Blender.java 826 ms 849 ms 819 ms 849 ms 848 ms 819 ms 868 ms 819 ms 854 ms 819 ms Average: 837 ms $ ./build/linux-x86_64-server-release/images/jdk/bin/java -Xint Blender.java 482 ms 483 ms 482 ms 482 ms 482 ms 482 ms 481 ms 482 ms 481 ms 482 ms Average: 481 ms $ ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:-UseFMA -Xint Blender.java 482 ms 482 ms 482 ms 481 ms 482 ms 482 ms 482 ms 482 ms 481 ms 482 ms Average: 481 ms $ ~/jdk-22/bin/java -Xint Blender.java 480 ms 481 ms 481 ms 481 ms 480 ms 481 ms 481 ms 481 ms 481 ms 484 ms Average: 481 ms $ ~/jdk-22/bin/java -XX:-UseFMA -Xint Blender.java 512 ms 514 ms 513 ms 513 ms 514 ms 515 ms 514 ms 512 ms 513 ms 512 ms Average: 513 ms
10-08-2023

[~sgibbons] sure, here is the CPU info which showed the regression: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 154 model name : 12th Gen Intel(R) Core(TM) i7-12800H stepping : 3 microcode : 0x429 cpu MHz : 1782.168 cache size : 24576 KB physical id : 0 siblings : 20 core id : 0 cpu cores : 14 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 32 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb bogomips : 5606.40 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual
19-07-2023

Thanks [~chagedorn] for the heads up. I've tried reproducing this on my current machine and I see a constant ~341ms using all of the command lines above. Can you please tell me the specifics of your platform? $ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java Output: 341 ms 340 ms 343 ms 341 ms 341 ms 338 ms 338 ms 341 ms 343 ms 341 ms Average: 341 ms
18-07-2023

Hi [~sgibbons], can you have a look?
18-07-2023

ILW = Performance regression in fmod stub, medium, use -XX:-UseFMA = MMM = P3
18-07-2023