JDK-5070061 : Performance of server VM about 30% worse than client VM when using an AthlonXP
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 1.4.2
  • Priority: P3
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2004-06-30
  • Updated: 2004-10-14
  • Resolved: 2004-10-14
Description
Name: jl125535			Date: 06/30/2004


FULL PRODUCT VERSION :
1.5.0 beta 2

Also occurs on 1.4.2_03, 1.4.2_04

FULL OS VERSION :
WindowsXP, SP1

EXTRA RELEVANT SYSTEM CONFIGURATION :
AthlonXP 2200+ processor, 512Mb memory

A DESCRIPTION OF THE PROBLEM :
I'm seeing in many occasions that the server VM performs worse than the client VM when running on an AthlonXP CPU.

The source I'm providing here demonstrates this. It's a mandelbrot generator which performs about 30% worse on the server than on the client when running on an Athlon, where the same program performs 3.5 times better on the server than the client when running on a P4 CPU.

In this particular program I would expect a P4 to run better than an Athlon, given the SSE2 optimizations done on a P4, but on an athlon the server should still perform better than the client.

If you replace all doubles with floats, you will see that the problem vanishes as the server will then perform adequately faster than the client.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile the provided source, run with 'java JMandel >logclient.txt' and then with 'java -server JMandel >logserver.txt'.
The directing to files is necessary to cope with the large amounts of output generated, so that you can still see the timing results.

If you let the for-loop in the run() method loop 5 times instead of 100, the test will complete a lot faster while still giving the VM some warm-up time.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The server VM to perform better than the client VM even when running on an Athlon, esp. because in this test a lot of array access is going on.

The txt file should look something like 

0000000000000000000000000000000000000...
Runtime ms=4775 88.2469329842932 MegaIters per second
0000000000000000000000000000000000000...
Runtime ms=5077 82.99765708095332 MegaIters per second
0000000000000000000000000000000000000...
Runtime ms=4634 90.93204682779457 MegaIters per second
0000000000000000000000000000000000000...
Runtime ms=4855 86.79281256436663 MegaIters per second
0000000000000000000000000000000000000...
Runtime ms=4731 89.06766117099978 MegaIters per second


ACTUAL -
The server VM takes 30% more time to complete the test (see the output of the program for the numbers), when the test is run on an AthlonXP.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
public class JMandel {
    
    int[] rowBuffer_ = new int[500];
    int w_ = 500;
    int h_ = 500;
    int maxi_ = 9999;
    double ax_ = -2.0f;
    double ay_ = -1.5f;
    double ex_ = 1.0f;
    double ey_ = 1.5f;
    double sx_ = (ex_ - ax_) / ((double) w_);
    double sy_ = (ey_ - ay_) / ((double) h_);
    
    long msStart_;
    long msEnd_;
    long itersTotal_;
    long rendertime;
    
    private void run() {
        for (int i = 0; i < 100; i++) {
            itersTotal_ = 0;
            rendertime = 0;
            msStart_ = System.currentTimeMillis();
            for (int y = 0; y < h_; y++) {
                calcPixelRow(y, maxi_);
                rendertime += printResults();
            }
            msEnd_ = System.currentTimeMillis() - rendertime;
            long msTotal = msEnd_ - msStart_;
            double its = ((double) itersTotal_) / (double) msTotal;
            System.out.println(
                "\n\nRuntime ms=" + msTotal + " " + its / 1000.0 
                + " MegaIters per second");
        }
    }
    
    private long printResults() {
        long start = System.currentTimeMillis();
        for (int i = 0; i < rowBuffer_.length; i++) {
            System.out.print(rowBuffer_[i]);
        }
        return System.currentTimeMillis() - start;
    }
    
    private boolean calcPixelRow(int row, int maxi) {
        double cx = ax_;
        double cy = ay_ + sy_ * ((double) row);
        double zx, zy;
        double zx2, zy2;
        
        for (int x = 0; x < w_; x++) {
            // Calc Pixel
            zx = cx;
            zy = cy;
            int i;
            for (i = 0; i < maxi; i++) {
                zx2 = zx * zx;
                zy2 = zy * zy;
                if ((zx2 + zy2) > 4)
                    break;
                zy = 2 * zx * zy;
                zx = zx2 - zy2;
                zx += cx;
                zy += cy;
            }
            cx += sx_;
            itersTotal_ += i;
            rowBuffer_[x] = i;
        }
        
        return true;
    }
    
    public static void main(String[] args) {
        JMandel jMandel = new JMandel();
        jMandel.run();
    }
}
---------- END SOURCE ----------
(Incident Review ID: 280969) 
======================================================================
###@###.### 10/14/04 17:59 GMT

Comments
EVALUATION Since the Athlon does not support SSE2 (double precision), C2 will use the FPU to do the calculations. Using Floats is fast on the Athlon because SSE1 (single precision) is supported on the Athlon. I can simulate this on a P4 by using the following flag -XX:UseSSE=1 (or 0 for no SSE). There really isn't much we can do to speed this up. ###@###.### 2004-07-01
01-07-2004