JDK-8029302 : Performance regression in Math.pow intrinsic
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 7u40,8
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: linux
  • Submitted: 2013-11-15
  • Updated: 2017-11-07
  • Resolved: 2014-05-15
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8 JDK 9
7u80Fixed 8u20 b15Fixed 9Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
FULL PRODUCT VERSION :
java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)


FULL OS VERSION :
Linux spica 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux
(CentOS 6)

EXTRA RELEVANT SYSTEM CONFIGURATION :
/proc/cpuinfo:
processor: 0
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 0
siblings: 6
core id: 0
cpu cores: 6
apicid: 0
initial apicid: 0
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.75
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 1
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 0
siblings: 6
core id: 1
cpu cores: 6
apicid: 2
initial apicid: 2
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.24
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 2
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 0
siblings: 6
core id: 2
cpu cores: 6
apicid: 4
initial apicid: 4
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.23
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 3
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 0
siblings: 6
core id: 3
cpu cores: 6
apicid: 6
initial apicid: 6
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.24
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 4
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 0
siblings: 6
core id: 4
cpu cores: 6
apicid: 8
initial apicid: 8
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.24
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 5
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 0
siblings: 6
core id: 5
cpu cores: 6
apicid: 10
initial apicid: 10
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.24
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 6
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 1
siblings: 6
core id: 0
cpu cores: 6
apicid: 32
initial apicid: 32
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.28
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 7
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 1
siblings: 6
core id: 1
cpu cores: 6
apicid: 34
initial apicid: 34
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.30
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 8
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 1
siblings: 6
core id: 2
cpu cores: 6
apicid: 36
initial apicid: 36
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.29
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 9
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 1
siblings: 6
core id: 3
cpu cores: 6
apicid: 38
initial apicid: 38
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.29
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 10
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 1
siblings: 6
core id: 4
cpu cores: 6
apicid: 40
initial apicid: 40
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.27
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:

processor: 11
vendor_id: GenuineIntel
cpu family: 6
model: 45
model name: Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
stepping: 7
cpu MHz: 2501.000
cache size: 15360 KB
physical id: 1
siblings: 6
core id: 5
cpu cores: 6
apicid: 42
initial apicid: 42
fpu: yes
fpu_exception: yes
cpuid level: 13
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips: 4999.27
clflush size: 64
cache_alignment: 64
address sizes: 46 bits physical, 48 bits virtual
power management:


A DESCRIPTION OF THE PROBLEM :
It seems the Math.pow() implementation has changed between 7u25 and 7u40, with a strong performance regression.

Attached test case shows on my machine:
 - 7u25: ~1700ms
 - 7u40: ~8500ms

Using "-XX:+UnlockDiagnosticVMOptions -XX:+PrintIntrinsics" shows the intrinsic implementation is used in both cases.


THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Yes

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

REGRESSION.  Last worked in version 7u25

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
- Compile attached code
- Run with JDK 7u25 and 7u40

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.Random;

public class Main {

    public static void main(String[] args) throws Exception {

        while (true) {

            final Random random = new Random();
            final double[] values = new double[100_000_000];
            for (int i = 0; i < values.length; i++)
                values[i] = random.nextDouble();

            System.gc();

            final long start = System.currentTimeMillis();

            double blackhole = 0;
            for (int i = 0; i < values.length; i++)
                blackhole += Math.pow(values[i], 2);

            final long elapsed = System.currentTimeMillis() - start;

            System.out.println(elapsed + "ms (" + blackhole + ")");
        }
    }
}
---------- END SOURCE ----------
Comments
Hi Azeem, On 04/08/2014 11:53 AM, Azeem Jiva wrote: > Joe, > There is a Math.pow performance regression where power of 2 inputs run slower than other values. One recommendation was to fix this in the libraries rather then a special case in the JVM. Can we fix this in 8u20 in the libraries? What are your thoughts on this? > Let's give credit / blame where it is due: the work done under JDK-7133857 introduced a performance regression in some cases along with some correctness bugs (JDK-7174532). Way back when when Cliff was still around, I worked with him (and an intern IIRC) to intrinsify pow. (This led to the development of some fruitful pow tests [1]). I don't know how the ultimate x87 instruction sequences used in JDK-7133857 differ from what was done previously, but given the bug tail of JDK-7133857, the results of the effort seems a bit suspect. The fdlibm code used for StrictMath.pow does have an explicit up-front check for an exponent of 2. I do *not* support adding another check for an exponent of 2 in the JDK libraries. Any bug here is with the HotSpot intrinsification of pow and IIRC that is where the fix / work-around should go. I also recommend reexamining the work done under JDK-7133857 to make sure it meets other correctness properties that we might not have tests for. (I was not asked to review that work before it when back.) Cheers, -Joe [1] https://blogs.oracle.com/darcy/entry/finding_a_bug_in_fdlibm
18-04-2014

On a second thought I doubt we can intrinsify StrictMath.pow but we could add another private method to Math which we can intrinsify. Something like: public static double pow(double a, double b) { if (b == 2) { return a * a; } return powImpl(a, b); } private static double powImpl(double a, double b) { return StrictMath.pow(a, b); // default impl. is either intrinsified or delegates to StrictMath }
06-12-2013

A rather "easy" fix for this and other special cases would be to add these special cases to Math.pow, like: public static double pow(double a, double b) { if (b == 2) { return a * a; } return StrictMath.pow(a, b); // default impl. delegates to StrictMath } and intrinsify StrictMath.pow in the compiler. Here is the speedup: cthaling@macbook:~/ws$ java -Xbootclasspath/p:$HOME/ws/jdk8-tl/jdk/classes Main 103ms (3.3331021752770673E7) 92ms (3.332789835208015E7) 102ms (3.3333771275472376E7)
06-12-2013

ILW=MMM=P3 Impact: Medium, performance regression is only with power of 2 values Likelihood: Medium, again only with power of 2 values Workaround: Medium, the developer can write code similar to: if (y == 2) { return y * y; } return Math.pow(x, y); The regression is limited to certain values and in general Math.pow has significant improvements from 7u25 to 7u40 (see Christian's comment above).
05-12-2013

Vladimir reminded me of a very important point: the C++ implementation has a special case for power-of-2 values. Here the same numbers for 3 instead of 2: blackhole += Math.pow(values[i], 3); cthaling@intelsdv03.us.oracle.com:~/ws$ /java/re/jdk/7u25/latest/binaries/linux-x64/bin/java Main 24041ms (2.4998553978965953E7) 24084ms (2.50029473511136E7) 24088ms (2.5001841408912268E7) cthaling@intelsdv03.us.oracle.com:~/ws$ /java/re/jdk/7u40/latest/binaries/linux-x64/bin/java Main 8853ms (2.5002001402242936E7) 8987ms (2.500130893984287E7) 9024ms (2.499793506756205E7) We might have to enhance the intrinsic to special-case power-of-2 values.
04-12-2013

I can reproduce the regression on one of our machines: cthaling@intelsdv03.us.oracle.com:~/ws$ /java/re/jdk/7u25/latest/binaries/linux-x64/bin/java Main 1752ms (3.333572549821999E7) 1775ms (3.333630157987734E7) 1775ms (3.3339959311967324E7) cthaling@intelsdv03.us.oracle.com:~/ws$ /java/re/jdk/7u40/latest/binaries/linux-x64/bin/java Main 8860ms (3.3332716430023532E7) 8982ms (3.333500694103926E7) 8872ms (3.3330793038973276E7)
03-12-2013

To the best of my understanding, the compiler team owns the vm intrinsics library.
01-12-2013