ADDITIONAL SYSTEM INFORMATION : Linux, any vendor, any OpenJDK (tested 8 and 21=trunk) A DESCRIPTION OF THE PROBLEM : The peformance regression depends on GCC version being used. It has regressed between gcc-4.8 and gcc-4.9: [PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only. = https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod: * gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod() * gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable. According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly: * https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem * https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483 The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK/Zulu: * glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization. * gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there. * clang does not have any fprem instruction optimization, it only calls glibc fmod(). The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64. I did not test Oracle Java 8 whether it was faster or not, it depends which compiler was Oracle using. It has regressed for example from: CentOS-7.1 java-1.8.0-openjdk-1.8.0.31-2.b13.el7.x86_64 GNU C 4.8.3 20140911 (Red Hat 4.8.3-9) -mtune=generic -march=x86-64 -g -O3 -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fPIC gcc-4.8.3-9.el7.src.rpm does not yet contain the problematic patch to: CentOS-7.9 java-1.8.0-openjdk-1.8.0.362.b08-1.el7_9.x86_64 GNU C++ 4.8.5 20150623 (Red Hat 4.8.5-44) -m64 -mtune=generic -march=x86-64 -g -g -O3 -std=gnu++98 -fPIC -fno-rtti -fno-exceptions -fcheck-new -fvisibility=hidden -fno-strict-aliasing -fno-omit-frame-pointer -fstack-protector -fstack-protector-strong -fpch-deps --param ssp-buffer-size=4 gcc-4.8.5-44.el7 already contains the problematic patch REGRESSION : Last worked in version 8 STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : wget https://jankratochvil.net/t/DivisionDemo.java https://jankratochvil.net/t/benchmark.sh # edit old= and new= in benchmark.sh bash benchmark.sh EXPECTED VERSUS ACTUAL BEHAVIOR : EXPECTED - JVM version: 1.8.0_31 Iteration 0 regression case Took : 92 noMod case took: 63 noPower case took: 70 Iteration 1 regression case Took : 89 noMod case took: 63 noPower case took: 69 Iteration 2 regression case Took : 62 noMod case took: 63 noPower case took: 70 Iteration 3 regression case Took : 62 noMod case took: 63 noPower case took: 70 Iteration 4 regression case Took : 62 noMod case took: 63 noPower case took: 70 Iteration 5 regression case Took : 65 noMod case took: 63 noPower case took: 70 Iteration 6 regression case Took : 63 noMod case took: 63 noPower case took: 69 Iteration 7 regression case Took : 63 noMod case took: 63 noPower case took: 69 Iteration 8 regression case Took : 62 noMod case took: 64 noPower case took: 69 Iteration 9 regression case Took : 62 noMod case took: 64 noPower case took: 69 - each line contains about the same 3 numbers ACTUAL - JVM version: 1.8.0_362 Iteration 0 regression case Took : 472 noMod case took: 63 noPower case took: 98 Iteration 1 regression case Took : 465 noMod case took: 63 noPower case took: 96 Iteration 2 regression case Took : 462 noMod case took: 42 noPower case took: 95 Iteration 3 regression case Took : 458 noMod case took: 38 noPower case took: 106 Iteration 4 regression case Took : 470 noMod case took: 63 noPower case took: 96 Iteration 5 regression case Took : 465 noMod case took: 63 noPower case took: 102 Iteration 6 regression case Took : 465 noMod case took: 63 noPower case took: 96 Iteration 7 regression case Took : 465 noMod case took: 63 noPower case took: 97 Iteration 8 regression case Took : 465 noMod case took: 63 noPower case took: 96 Iteration 9 regression case Took : 457 noMod case took: 39 noPower case took: 85 - the first test of modulo is up to 7x slower ---------- BEGIN SOURCE ---------- https://jankratochvil.net/t/DivisionDemo.java https://jankratochvil.net/t/benchmark.sh This reproducer was not written by me. ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : https://jankratochvil.net/t/openjdk-asm.patch It could be also fixed either in GCC or in glibc (or both). FREQUENCY : always
|