JDK-5105765 : REGRESSION: JNI Performance Regression in 1.5.0 -server
  • Type: Bug
  • Component: hotspot
  • Sub-Component: runtime
  • Affected Version: 5.0,5.0u1
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic,windows_2000
  • CPU: x86
  • Submitted: 2004-09-22
  • Updated: 2013-06-21
  • Resolved: 2004-11-24
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other JDK 6
5.0u3Fixed 6 b14Fixed
Related Reports
Duplicate :  
Duplicate :  
Relates :  
Relates :  
Description
Name: gm110360			Date: 09/22/2004


FULL PRODUCT VERSION :
java version "1.5.0-rc"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-rc-b63)
Java HotSpot(TM) Server VM (build 1.5.0-rc-b63, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows 2000 [Version 5.00.2195]

EXTRA RELEVANT SYSTEM CONFIGURATION :
1.7GHz Xeon processor (dual processor but test is single threaded)

A DESCRIPTION OF THE PROBLEM :
This is refinement of the bug I filed earlier with an internal review ID of: 311233

Using double math just after a JNI call results in a dramatic performance slowdown (factor of 5 in the test case below) under 1.5.0rc-server.  This
slowdown does not occur under 1.4.2-server or any client version.  I've included
a simple program below that demonstrates the problem.
VTune reports an extreme level of  Streaming SIMD Input Assists in this case,
though I don't know why this should occur

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Create a simple JNI function that simply always returns the value 1.0f
and compile it into a dll.  Then run the program below to time various
operations.  Notice that mixing double math operations and JNI calls results
in a dramatic and unexplained performance loss by roughly a factor of 5!

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
These are the results I get when running under 1.4.2-server

avg    2.4 ns   total 4.70E-2 s  for assign             (~ 4.0 cycles)
avg    2.3 ns   total 4.60E-2 s  for mult               (~ 3.9 cycles)
avg  162.5 ns   total 3.25E0 s   for JNI                (~ 276.2 cycles)
avg  175.0 ns   total 3.50E0 s   for JNI and mult        (~ 297.5 cycles)
avg    2.4 ns   total 4.70E-2 s  for assign             (~ 4.0 cycles)
avg    2.4 ns   total 4.70E-2 s  for mult               (~ 4.0 cycles)
avg  162.5 ns   total 3.25E0 s   for JNI                (~ 276.2 cycles)
avg  168.0 ns   total 3.36E0 s   for JNI and mult        (~ 285.6 cycles)

ACTUAL -
These are the results I get under 1.5.0rc-server

java version "1.5.0-rc"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-rc-b63)
Java HotSpot(TM) Server VM (build 1.5.0-rc-b63, mixed mode)

avg    2.4 ns   total 4.70E-2 s  for assign             (~ 4.0 cycles)
avg    2.4 ns   total 4.70E-2 s  for mult               (~ 4.0 cycles)
avg  194.5 ns   total 3.89E0 s   for JNI                (~ 330.6 cycles)
avg  892.2 ns   total 1.78E1 s   for JNI and mult       (~ 1516.7 cycles)
avg    2.4 ns   total 4.70E-2 s  for assign             (~ 4.0 cycles)
avg    2.4 ns   total 4.70E-2 s  for mult               (~ 4.0 cycles)
avg  193.0 ns   total 3.86E0 s   for JNI                (~ 328.0 cycles)
avg  890.6 ns   total 1.78E1 s   for JNI and mult        (~ 1514.0 cycles)

Note that JNI calls have gotten 1.2x more expensive and mixing JNI calls and double multiplies has gotten 5.3x more expensive under 1.5.0rc-server as compared to 1.4.2-server

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------

import java.text.DecimalFormat;
import java.util.Random;

/** Test that show a dramatic slowdown for programs that attempt to use
 * floating point math after a JNI call.
 *
 * @author  bjw  Bruce Walter, Cornell Program of Computer Graphics 2004
 */
public class JNIOpsTest {
  //target total number of repetitions of the operation
  public static final int opTargetRepetitions = 20000000;
  //size of arrays that are operated on
  public static final int arraySize = 10000;
  //number of times we need to process each array to reach total target count
  public static final int reps = opTargetRepetitions/arraySize;
  //pretty print the output numbers to make them easier to read
  public static final DecimalFormat decForm = new DecimalFormat("###0.0");
  public static final DecimalFormat sciForm = new DecimalFormat("0.00E0");
  //my processor is a 1.7GHz Xeon (actually it is a dual processor, but this test is single threaded)
  public static final double ghzProcSpeed = 1.7; //my processor is 1.7GHz
  
  //native function that simply always returns the value 1.0
  public static native float funcStaticVoid();
  
  static {
    System.load("C:/IntelCode/JNITest/JNITest/Release/JNITest.dll");
  }
  
  public static void runTimingTest(TestOp op, double result[], double src[], boolean print) {
    long time = System.currentTimeMillis();
    for(int i=0;i<reps;i++) {
      op.performOp(result,src);
    }
    time = System.currentTimeMillis() - time;
    double denom = 1000000.0/(reps*src.length);
    if (print) {
      String ps = decForm.format(time*denom);
      while (ps.length()<6) ps = " "+ps;
      ps = "avg "+ps+" ns   total "+sciForm.format(time/1000.0)+" s";
      while (ps.length()<32) ps += " ";
      ps = ps+" for "+op.toString();
      while (ps.length()<50) ps += " ";
      System.out.println(ps+"\t(~ "+decForm.format(time*denom*ghzProcSpeed)+" cycles)");
    }
  }
    
  public static void main(String[] args) throws InterruptedException {
    double src[] = new double[arraySize];
    double result[] = new double[arraySize];
    Random ran = new Random(5232482349538L);
    //set the src array to be random values between zero and one exclusive
    for(int i=0;i<src.length;i++) {
      do {
        src[i] = ran.nextDouble();
      } while (src[i] == 0);
    }
    TestOp tests[] = { new AssignOp(), new MultOp(), new JNIOp(), new JNIMulOp()};
    //warm up hotspot
    for(int i=0;i<tests.length;i++) {
      runTimingTest(tests[i],result,src,false);
    }
    //now run the real tests and print the timings
    for(int i=0;i<tests.length;i++) {
      runTimingTest(tests[i],result,src,true);
    }
    //do it again to show the timings are reasonably consistent
    for(int i=0;i<tests.length;i++) {
      runTimingTest(tests[i],result,src,true);
    }
  }
  
  public abstract static class TestOp {
    public abstract void performOp(double result[], double src[]);
  }
  
  public static class AssignOp extends TestOp {
    public String toString() { return "assign"; }
    public void performOp(double result[], double src[]) {
      for(int i=0;i<src.length;i++) {
        result[i] = src[i];
      }
    }
  }
  
  public static class MultOp extends TestOp {
    public String toString() { return "mult"; }
    public void performOp(double result[], double src[]) {
      for(int i=0;i<src.length;i++) {
        result[i] = 0.143*src[i];
      }
    }
  }
  
  public static class JNIOp extends TestOp {
    public String toString() { return "JNI"; }
    public void performOp(double result[], double src[]) {
      for(int i=0;i<src.length;i++) {
        JNIOpsTest.funcStaticVoid();
      }
    }
  }
    
  public static class JNIMulOp extends TestOp {
    public String toString() { return "JNI and mult"; }
    public void performOp(double result[], double src[]) {
      for(int i=0;i<src.length;i++) {
        JNIOpsTest.funcStaticVoid();
        result[i] = 0.143*src[i];
      }
    }
  }
 
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
None that I know of other than not upgrading to 1.5.0rc and staying with 1.4.2

Release Regression From : 1.4.2
The above release value was the last known release where this 
bug was known to work. Since then there has been a regression.

(Incident Review ID: 311434) 
======================================================================
###@###.### 10/12/04 19:40 GMT

Comments
WORK AROUND There are a few worksarounds for this issue: 1. Use the client compiler The client compiler does not use the XMM registers therefore it doesn't have to restore this MXCSR register on native calls. 2. Set a Java flag that will disable the use of XMM for the server compiler. The flag is -XX:UseSSE=0 . This flag is not documented and there is not guarantee that it will be in any future releases. ###@###.### 10/12/04 19:40 GMT
12-10-2004

EVALUATION This performance regression is caused by a change made in the 1.5 sources to fix bugid 5003738. On the return from every native call, the MXCSR is being restored to the value that the Java VM expects. The x86 ABI defines this register as volatile on every x86 platform, except Windows AMD64. This is a performance versus correctness issue. There are a few worksarounds for this issue: 1. Use the client compiler The client compiler does not use the XMM registers therefore it doesn't have to restore this MXCSR register on native calls. 2. Set a Java flag that will disable the use of XMM for the server compiler. The flag is -XX:UseSSE=0 . This flag is not documented and there is not guarantee that it will be in any future releases. ###@###.### 10/12/04 19:40 GMT The fix that will be implemented, backs out most of the original fix that was done to defensively restore the contents of the MXCSR register. We preserved the code which saves and initializes the state of this register when the VM enters Java code, but we no longer set the VMs expected value when we return from calling native JNI code. With this fix, the VM has now reverted to the original functionaliy where it expects any native code to preserve the state of the MSCSR registers if it needs to change it. In addition to the changes described above, we have added a -XX flag which will allow the VM to revert to it's current functionality. The default state of this flag is false. -XX:+RestoreMXCSROnJNICalls In addition, if -Xcheck:jni is used, the VM will verify that the MXCSR register contains the VM's expected value on return from native (JNI) code. This bug fix is targeted for J2SE 1.5 update 3 and all future releases. ###@###.### 2004-11-29 16:41:04 GMT
12-10-2004