JDK-8302191 : Performance degradation for float/double modulo on Linux
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 8,11,17,20,21
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: linux
  • CPU: x86_64
  • Submitted: 2023-02-10
  • Updated: 2023-07-18
  • Resolved: 2023-03-22
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20 JDK 21
20.0.2Fixed 21 b16Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Relates :  
Description
As reported by Jan Kratochvil:

There was a performance degradation (about 6x slowdown) for float/double modulo operations in Java on Linux.
It happened and went unnoticed after a change in GCC between gcc-4.8 and gcc-4.9.
So, it is easy to compare performance of two separate builds of jdk8 built by different versions of GCC compiler. 

The affected native hotspot code is the same even today. Applying the same fix as in jdk8 to the trunk (jdk 21) does show the problem (and solution) with all recent versions of gcc.

The gcc was slow since this commit (performance regression):
[PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only.
 = https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=4f2611b6e872c40e0bf4da38ff05df8c8fe0ee64
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 (backport)

The performance regression got fixed/reverted by this commit:
[PATCH] i386: Do not constrain fmod and remainder patterns with flag_finite_math_only [PR108922]
 = https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612918.html
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=8020c9c42349f51f75239b9d35a2be41848a97bd

Attached are
 * reproducer org.apache.spark.DivisionDemo.java; 
 * jdk8 timings with gcc-4.8 builds before/after remainder change in gcc; 
 * jdk8 timings with gcc-4.8 after change in gcc with a fix in hotspot;
 * before/after the fix in hotspot timings for jdk 21 with gcc-12; 

the fix applicable to all versions of jdk (with path adjustment for jdk8)

Reproducer should be run as
java -cp . -Xmx1024m -Xms1024m -XX:+AlwaysPreTouch org.apache.spark.DivisionDemo 10 f
with the last parameter f for float, d for double.

Analysis:
  
Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod:
 * gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod()
 * gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable.
  
According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly:
 * https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem
 * https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483
  
The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK:
 * glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization.
 * gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there.
 * clang does not have any fprem instruction optimization, it only calls glibc fmod().

The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64.

It is hard to detect a regression with a performance fix, so noreg-perf. 

Comments
[17u] In head, this is a brand new change. Please flag it again for 17u once it has been live for a while, e.g. in 20.0.2. Then I will reconsider.
24-04-2023

Ok, I saw the text " I see no regression on OpenJDK-8 Linux x86_64. " and assumed that was about the performance regression, but I guess it may actually be about the patch testing (which is unusual to see commented on in a bug description) CentOS is not one person, but maintainers of different packages, primarily for RHEL. As maintainer for the RHEL/CentOS/Fedora OpenJDK packages, my reason is as above. I don't intend to include it locally in the OpenJDK RPM if I'm against doing so upstream. As it sounds like the RHEL/CentOS gcc team have backported the change which created the regression to an older gcc, it is probably worth filing a bug in the Red Hat Bugzilla asking them to include the reversion (which may just mean dropping a local patch to gcc 4.8). I would warn, however, that the barrier for getting changes into 7.9 is also pretty high by this point. These are all pretty old releases by this point. But fixing the mistake at its source would also fix other software too, without changing any code.
20-04-2023

Andrew Hughes: " My understanding of this fix is that it is needed on certain GCC versions on x86_32 only," That problem does affect x86_64 (primarily). David Holmes: "This is a performance enhancement, not a bug." The goal of the backports is to fix a performance regression for OpenJDK builds built by gcc <=4.8. For example: fast: CentOS-7.1 java-1.8.0-openjdk-1.8.0.31-2.b13.el7.x86_64 GNU C 4.8.3 20140911 (Red Hat 4.8.3-9) -mtune=generic -march=x86-64 -g -O3 -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fPIC gcc-4.8.3-9.el7.src.rpm does not yet contain the problematic patch slow: CentOS-7.9 java-1.8.0-openjdk-1.8.0.362.b08-1.el7_9.x86_64 GNU C++ 4.8.5 20150623 (Red Hat 4.8.5-44) -m64 -mtune=generic -march=x86-64 -g -g -O3 -std=gnu++98 -fPIC -fno-rtti -fno-exceptions -fcheck-new -fvisibility=hidden -fno-strict-aliasing -fno-omit-frame-pointer -fstack-protector -fstack-protector-strong -fpch-deps --param ssp-buffer-size=4 gcc-4.8.5-44.el7 already contains the problematic patch (I did not check which exact CentOS version did regress.) Sure CentOS can also either backport this patch themselves to OpenJDK or they can backport the GCC fix or they can also use some of the GCC compilation options. Still the OpenJDK fix would fix the regression for any OpenJDK vendor.
19-04-2023

[11u-no] See 8u reasoning.
19-04-2023

As I understand it, the change that caused the regression has been reverted in gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922 Also a NACK for 8u from me too. Honestly, it seems risky to add this to 8u, 11u or 17u, which are stable releases that should be seeing bug fixes only. To quote David above, "This is a performance enhancement, not a bug." My understanding of this fix is that it is needed on certain GCC versions on x86_32 only, but alters code shared by x86_32 & x86_64 on all operating systems. So we have a risk for more widely used platforms with no gain. There is also a workaround which is to build on GCC < 4.9 or one with https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108922 fixed Given the age of GCC 4.9 by this point, I think this would have been flagged much earlier if it was a critical performance issue. GCC 4.9.0 was released on 2014-04-22 (and so celebrates its 9th birthday this week).
18-04-2023

Definitely not 8u. I'm wondering about 11u. Is this a bug fix, or an enhancement? From the form of the change it certainly looks like an enhancement, but I guess it'd be possible to argue that it's a fix for a regression elsewhere. In which case I'd wonder why the regression hasn't been fixed elsewhere.
18-04-2023

I'd like to port JDK-8302191 to 8u due to it has similar problems as trunk does. The patch had to be changed as jdk11 has sharedRuntime_x86.cpp while jdk8 has sharedRuntime_x86_32.cpp and sharedRuntime_x86_64.cpp.
05-04-2023

I'd like to port JDK-8302191 to 11u due to it has similar problems as trunk does. Fix applies cleanly.
05-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk8u-dev/pull/298 Date: 2023-04-04 15:11:00 +0000
04-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1824 Date: 2023-04-04 13:00:13 +0000
04-04-2023

I'd like to port JDK-8302191 to 17u due to it has similar problems as trunk does. Fix applies cleanly.
04-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk17u-dev/pull/1234 Date: 2023-04-04 12:00:51 +0000
04-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk19u/pull/108 Date: 2023-04-03 14:44:18 +0000
03-04-2023

I'd like to port JDK-8302191 to 20u due to it has similar problems as trunk does. Fix applies cleanly.
03-04-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk20u/pull/46 Date: 2023-04-03 14:30:07 +0000
03-04-2023

Changeset: 37774556 Author: Jan Kratochvil <jkratochvil@azul.com> Committer: Sandhya Viswanathan <sviswanathan@openjdk.org> Date: 2023-03-22 15:55:57 +0000 URL: https://git.openjdk.org/jdk/commit/37774556da8a5aacf55884133ae936ed5a28eab2
22-03-2023

More current reference for the semantics of drem: https://docs.oracle.com/javase/specs/jvms/se19/html/jvms-6.html#jvms-6.5.drem "The result of a drem instruction is not the same as the result of the remainder operation defined by IEEE 754, due to the choice of rounding policy in the Java Virtual Machine (ยง2.8). The IEEE 754 remainder operation computes the remainder from a rounding division, not a truncating division, and so its behavior is not analogous to that of the usual integer remainder operator. Instead, the Java Virtual Machine defines drem to behave in a manner analogous to that of the integer remainder instructions irem and lrem, with an implied division using the round toward zero rounding policy; this may be compared with the C library function fmod. The result of a drem instruction is governed by the following rules, which match IEEE 754 arithmetic except for how the implied division is computed: If either value1 or value2 is NaN, the result is NaN. If neither value1 nor value2 is NaN, the sign of the result equals the sign of the dividend. If the dividend is an infinity or the divisor is a zero or both, the result is NaN. If the dividend is finite and the divisor is an infinity, the result equals the dividend. If the dividend is a zero and the divisor is finite, the result equals the dividend. In the remaining cases, where neither operand is an infinity, a zero, or NaN, the floating-point remainder result from a dividend value1 and a divisor value2 is defined by the mathematical relation result = value1 - (value2 * q), where q is an integer that is negative only if value1 / value2 is negative, and positive only if value1 / value2 is positive, and whose magnitude is as large as possible without exceeding the magnitude of the true mathematical quotient of value1 and value2. Despite the fact that division by zero may occur, evaluation of a drem instruction never throws a run-time exception. Overflow, underflow, or loss of precision cannot occur. "
03-03-2023

This is a performance enhancement, not a bug.
13-02-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/12508 Date: 2023-02-10 09:06:56 +0000
10-02-2023