JDK 20 | JDK 21 |
---|---|
20.0.2Fixed | 21 b16Fixed |
Duplicate :
|
|
Relates :
|
|
Relates :
|
|
Relates :
|
As reported by Jan Kratochvil: There was a performance degradation (about 6x slowdown) for float/double modulo operations in Java on Linux. It happened and went unnoticed after a change in GCC between gcc-4.8 and gcc-4.9. So, it is easy to compare performance of two separate builds of jdk8 built by different versions of GCC compiler. The affected native hotspot code is the same even today. Applying the same fix as in jdk8 to the trunk (jdk 21) does show the problem (and solution) with all recent versions of gcc. The gcc was slow since this commit (performance regression): [PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only. = https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=4f2611b6e872c40e0bf4da38ff05df8c8fe0ee64 https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 (backport) The performance regression got fixed/reverted by this commit: [PATCH] i386: Do not constrain fmod and remainder patterns with flag_finite_math_only [PR108922] = https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612918.html https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=8020c9c42349f51f75239b9d35a2be41848a97bd Attached are * reproducer org.apache.spark.DivisionDemo.java; * jdk8 timings with gcc-4.8 builds before/after remainder change in gcc; * jdk8 timings with gcc-4.8 after change in gcc with a fix in hotspot; * before/after the fix in hotspot timings for jdk 21 with gcc-12; the fix applicable to all versions of jdk (with path adjustment for jdk8) Reproducer should be run as java -cp . -Xmx1024m -Xms1024m -XX:+AlwaysPreTouch org.apache.spark.DivisionDemo 10 f with the last parameter f for float, d for double. Analysis: Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod: * gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod() * gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable. According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly: * https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem * https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483 The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK: * glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization. * gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there. * clang does not have any fprem instruction optimization, it only calls glibc fmod(). The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64. It is hard to detect a regression with a performance fix, so noreg-perf.
|