JDK-8302524 : Performance regression for float/double modulo operation
  • Type: Bug
  • Component: performance
  • Sub-Component: hotspot
  • Affected Version: 11,19,20,21
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2023-02-08
  • Updated: 2023-02-15
Description
ADDITIONAL SYSTEM INFORMATION :
Linux, any vendor, any OpenJDK (tested 8 and 21=trunk)

A DESCRIPTION OF THE PROBLEM :
The peformance regression depends on GCC version being used. It has regressed between gcc-4.8 and gcc-4.9:
[PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only.
 = https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098
Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod:
 * gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod()
 * gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable.
According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly:
 * https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem
 * https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483
The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK/Zulu:
 * glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization.
 * gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there.
 * clang does not have any fprem instruction optimization, it only calls glibc fmod().
The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64.
I did not test Oracle Java 8 whether it was faster or not, it depends which compiler was Oracle using.
It has regressed for example from:
  CentOS-7.1
  java-1.8.0-openjdk-1.8.0.31-2.b13.el7.x86_64
  GNU C 4.8.3 20140911 (Red Hat 4.8.3-9) -mtune=generic -march=x86-64 -g -O3 -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fPIC
  gcc-4.8.3-9.el7.src.rpm does not yet contain the problematic patch
to:
  CentOS-7.9
  java-1.8.0-openjdk-1.8.0.362.b08-1.el7_9.x86_64
  GNU C++ 4.8.5 20150623 (Red Hat 4.8.5-44) -m64 -mtune=generic -march=x86-64 -g -g -O3 -std=gnu++98 -fPIC -fno-rtti -fno-exceptions -fcheck-new -fvisibility=hidden -fno-strict-aliasing -fno-omit-frame-pointer -fstack-protector -fstack-protector-strong -fpch-deps --param ssp-buffer-size=4
  gcc-4.8.5-44.el7 already contains the problematic patch


REGRESSION : Last worked in version 8

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
wget https://jankratochvil.net/t/DivisionDemo.java https://jankratochvil.net/t/benchmark.sh
# edit old= and new= in benchmark.sh
bash benchmark.sh


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
JVM version: 1.8.0_31
Iteration 0 regression case Took : 92 noMod case took: 63 noPower case took: 70
Iteration 1 regression case Took : 89 noMod case took: 63 noPower case took: 69
Iteration 2 regression case Took : 62 noMod case took: 63 noPower case took: 70
Iteration 3 regression case Took : 62 noMod case took: 63 noPower case took: 70
Iteration 4 regression case Took : 62 noMod case took: 63 noPower case took: 70
Iteration 5 regression case Took : 65 noMod case took: 63 noPower case took: 70
Iteration 6 regression case Took : 63 noMod case took: 63 noPower case took: 69
Iteration 7 regression case Took : 63 noMod case took: 63 noPower case took: 69
Iteration 8 regression case Took : 62 noMod case took: 64 noPower case took: 69
Iteration 9 regression case Took : 62 noMod case took: 64 noPower case took: 69
 - each line contains about the same 3 numbers

ACTUAL -
JVM version: 1.8.0_362
Iteration 0 regression case Took : 472 noMod case took: 63 noPower case took: 98
Iteration 1 regression case Took : 465 noMod case took: 63 noPower case took: 96
Iteration 2 regression case Took : 462 noMod case took: 42 noPower case took: 95
Iteration 3 regression case Took : 458 noMod case took: 38 noPower case took: 106
Iteration 4 regression case Took : 470 noMod case took: 63 noPower case took: 96
Iteration 5 regression case Took : 465 noMod case took: 63 noPower case took: 102
Iteration 6 regression case Took : 465 noMod case took: 63 noPower case took: 96
Iteration 7 regression case Took : 465 noMod case took: 63 noPower case took: 97
Iteration 8 regression case Took : 465 noMod case took: 63 noPower case took: 96
Iteration 9 regression case Took : 457 noMod case took: 39 noPower case took: 85
 - the first test of modulo is up to 7x slower


---------- BEGIN SOURCE ----------
https://jankratochvil.net/t/DivisionDemo.java
https://jankratochvil.net/t/benchmark.sh
This reproducer was not written by me.

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
https://jankratochvil.net/t/openjdk-asm.patch
It could be also fixed either in GCC or in glibc (or both).


FREQUENCY : always



Comments
Issue is reproduced. Performance regression is observed in JDK 11 and above. The performance in JDK 8u361 is similar to 8u40 but there is a spike in the time taken to test float modulo and double modulo operations in JDK 11 and above OS: Windows 10 JDK 8u40 and JDK 8u361 : Pass Output: =============================================== Benchmark ouptut with old: 1.8.0.40 and new : 1.8.0.361 8.0.40 version Java HotSpot(TM) 64-Bit Server VM (25.40-b25) for windows-amd64 JRE (1.8.0_40-b27), built on Mar 13 2015 04:42:43 by "java_re" with MS VC++ 10.0 (VS2010) 8.0.361 version Java HotSpot(TM) 64-Bit Server VM (25.361-b09) for windows-amd64 JRE (1.8.0_361-b09), built on Jan 9 2023 08:38:53 by "java_re" with MS VC++ 15.9 (VS2017) testing with 8.0.40 with long mod JVM version: 1.8.0_40 Iteration 0 regression case Took : 44 noMod case took: 42 noPower case took: 50 Iteration 1 regression case Took : 35 noMod case took: 42 noPower case took: 47 Iteration 2 regression case Took : 35 noMod case took: 35 noPower case took: 40 ....... testing with 8.0.361 with long mod JVM version: 1.8.0_361 Iteration 0 regression case Took : 39 noMod case took: 43 noPower case took: 48 Iteration 1 regression case Took : 36 noMod case took: 34 noPower case took: 40 Iteration 2 regression case Took : 36 noMod case took: 35 noPower case took: 37 ........ ======================================== JDK 11.0.18: Fail JDK 19.0.2: Fail JDK 20ea : Fail JDK 21ea: Fail Output: ========================================== Benchmark ouptut with old : 1.8.0.361 and new : JDK21ea7 8.0.361 version Java HotSpot(TM) 64-Bit Server VM (25.361-b09) for windows-amd64 JRE (1.8.0_361-b09), built on Jan 9 2023 08:38:53 by "java_re" with MS VC++ 15.9 (VS2017) 21 version OpenJDK 64-Bit Server VM (21-ea+7-472) for windows-amd64 JRE (21-ea+7-472), built on 2023-01-25T18:31:08Z by "mach5one" with MS VC++ 17.1 (VS2022) testing with 8.0.361 with long mod JVM version: 1.8.0_361 Iteration 0 regression case Took : 47 noMod case took: 43 noPower case took: 50 Iteration 1 regression case Took : 37 noMod case took: 37 noPower case took: 39 Iteration 2 regression case Took : 39 noMod case took: 40 noPower case took: 48 ........ testing with 21 with long mod JVM version: 21-ea Iteration 0 regression case Took : 45 noMod case took: 44 noPower case took: 52 Iteration 1 regression case Took : 39 noMod case took: 44 noPower case took: 52 Iteration 2 regression case Took : 34 noMod case took: 38 noPower case took: 38 ....... testing with 8.0.361 with double modulo JVM version: 1.8.0_361 Iteration 0 regression case Took : 96 noMod case took: 36 noPower case took: 92 Iteration 1 regression case Took : 60 noMod case took: 23 noPower case took: 83 Iteration 2 regression case Took : 94 noMod case took: 35 noPower case took: 86 ....... testing with 21 with double modulo JVM version: 21-ea Iteration 0 regression case Took : 434 noMod case took: 22 noPower case took: 97 Iteration 1 regression case Took : 431 noMod case took: 22 noPower case took: 97 Iteration 2 regression case Took : 429 noMod case took: 23 noPower case took: 96 ....... testing with 8.0.361 with float modulo JVM version: 1.8.0_361 Iteration 0 regression case Took : 98 noMod case took: 16 noPower case took: 84 Iteration 1 regression case Took : 97 noMod case took: 15 noPower case took: 87 Iteration 2 regression case Took : 59 noMod case took: 29 noPower case took: 83 Iteration 3 regression case Took : 58 noMod case took: 7 noPower case took: 85 ......... testing with 21 with float modulo JVM version: 21-ea Iteration 0 regression case Took : 423 noMod case took: 18 noPower case took: 104 Iteration 1 regression case Took : 423 noMod case took: 17 noPower case took: 103 Iteration 2 regression case Took : 429 noMod case took: 12 noPower case took: 102 ........ ILW = Regression, reproducible on GA build , workaround available = HLM = P3 Moving it to dev team for further evaluation
15-02-2023