JDK-6329104 : Occasional performance degradation by factor of 3
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 6
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2005-09-27
  • Updated: 2013-11-01
  • Resolved: 2005-10-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other JDK 6
5.0u8Fixed 6 b56Fixed
Related Reports
Relates :  
Description
While benchmarking some library code, we noticed that,
on multi-processor runs with -server, performance of the benchmarked code
was occasionally a factor of 3 slower, as measured by wall clock or "user" CPU time.

True of releases from 1.5.0 through 6.0-b53.
Seen on solaris-sparc, solaris-i586 opteron, and linux-amd64 MP
To see the effect, run the code below using a loop like

while true; do time jver 6.0 java -server ListBash 900 1000; done

and observe the occasional (10% ?) factor of 3 increase in cpu time
-----------------------------------------------------------
import java.util.*;

public class ListBash {
    static final Random rnd = new Random();
    static int numItr;
    static int listSize;

    public static void main(String[] args) throws Throwable {
        numItr = Integer.parseInt(args[0]);
        listSize = Integer.parseInt(args[1]);

        oneRun();
        oneRun();
        oneRun();
        oneRun();
    }

    static void oneRun() {
        long startTime = System.nanoTime();
	for (int i=0; i<numItr; i++)
            elementLoop();
        long elapsed = System.nanoTime() - startTime;
        System.out.println("Time: " + (elapsed/1000000000.0) + "s");
    }

    static void elementLoop() {
        List<Integer> s1 = new ArrayList();
        AddRandoms(s1, listSize);

        List<Integer> s2 = new ArrayList();
        AddRandoms(s2, listSize);

	clone(s1);
    }

    // Done inefficiently so as to exercise toArray
    static List<Integer> clone(List<Integer> s) {
        List a = Arrays.asList(s.toArray());

	List<Integer> clone = new ArrayList();
	clone.addAll(a);
	if (!s.equals(clone))
	    fail("List not equal to copy.");
	if (!s.containsAll(clone))
	    fail("List does not contain copy.");

	return clone;
    }

    static void AddRandoms(List<Integer> s, int n) {
	for (int i=0; i<n; i++) {
	    int r = rnd.nextInt() % n;
	    Integer e = new Integer(r < 0 ? -r : r);

	    int preSize = s.size();
	    if (!s.add(e))
		fail ("Add failed.");
	    int postSize = s.size();
	    if (postSize-preSize != 1)
		fail ("Add didn't increase size by 1.");
	}
    }

    static void fail(String s) {
	System.out.println(s);
	System.exit(1);
    }
}
-----------------------------------------------------------

Here is an excerpt from output:
Time: 1.211352022s
Time: 1.121192678s
Time: 0.995122533s
Time: 1.009511974s
jver 6.0 java -server ListBash 900 1000  4.14s user 0.24s system 94% cpu 4.627 total
Time: 3.664018005s
Time: 3.466817495s
Time: 3.606883598s
Time: 3.362159027s
jver 6.0 java -server ListBash 900 1000  13.47s user 0.10s system 94% cpu 14.409 total
Time: 1.144527022s
Time: 1.075664236s
Time: 0.987576517s
Time: 0.979692477s
jver 6.0 java -server ListBash 900 1000  4.11s user 0.22s system 97% cpu 4.439 total
Here's a backport of the test source, which can be used to demonstrate that
the variable performance effect does *NOT* appear to occur with 1.4.2.

--------------------------------------------------------------------
import java.util.*;

public class ListBash {
    static final Random rnd = new Random();
    static int numItr;
    static int listSize;

    public static void main(String[] args) throws Throwable {
        numItr = Integer.parseInt(args[0]);
        listSize = Integer.parseInt(args[1]);

        oneRun();
        oneRun();
        oneRun();
        oneRun();
    }

    static void oneRun() {
        long startTime = System.currentTimeMillis();
	for (int i=0; i<numItr; i++)
            elementLoop();
        long elapsed = System.currentTimeMillis() - startTime;
        System.out.println("Time: " + (elapsed/1000.0) + "s");
    }

    static void elementLoop() {
        List s1 = new ArrayList();
        AddRandoms(s1, listSize);

	clone(s1);
    }

    // Done inefficiently so as to exercise toArray
    static List clone(List s) {
        List a = Arrays.asList(s.toArray());

	List clone = new ArrayList();
	clone.addAll(a);
	if (!s.equals(clone))
	    fail("List not equal to copy.");
	if (!s.containsAll(clone))
	    fail("List does not contain copy.");

	return clone;
    }

    static void AddRandoms(List s, int n) {
	for (int i=0; i<n; i++) {
	    int r = rnd.nextInt() % n;
	    Integer e = new Integer(r < 0 ? -r : r);

	    int preSize = s.size();
	    if (!s.add(e))
		fail ("Add failed.");
	    int postSize = s.size();
	    if (postSize-preSize != 1)
		fail ("Add didn't increase size by 1.");
	}
    }

    static void fail(String s) {
	System.out.println(s);
	System.exit(1);
    }
}
Doug Lea writes:

---------------------------------------------------------
I do think that whatever is causing this has been getting
worse in recent builds. If more data would be more helpful,
unpack the attached zip. About half the time I run things
using b54 on solaris-amd I get about factor of 2 slowdown.

% java CollectionLoops RWCollection
[results omitted]

Possibilities that occur to me but I don't know how to check include
   * Have inlining thresholds changed lately?
   * Are there new races with or stalls versus class resolution or
     loading that cause devirtualization to give up?
---------------------------------------------------------
(Doug's test is available as an attachment)

I can reproduce the performance loss Doug reports.

I tried running benchmarks as follows, (varying the build number):

(PATH=/java/re/jdk/6.0/promoted/all/b12/binaries/solaris-amd64/bin:$PATH; java -version; repeat 30 time java CollectionLoops RWCollection > /dev/null)

and get results dramatically worse in b54 than in b12.  Not only is the incidence
of "slow" results greater, but the "fast" results are slower in b54 than in b12,
indicating two different causes of slowdown.  This is observed on Solaris 10 and 11.

Here are the results from one particular solaris10-amd64 machine:

Bottom line:
b54: 895.99s user 6.89s system 127% cpu 11:48.20 total
b12: 682.12s user 39.92s system 108% cpu 11:02.88 total

This is a serious mustang performance regression, esp. for user cpu time.

(mb29450@mho) ~/src/toy/6329104/3oct05 $ (PATH=/java/re/jdk/6.0/promoted/all/b12/binaries/solaris-amd64/bin:$PATH; java -version; repeat 30 time java CollectionLoops RWCollection > /dev/null)
java version "1.6.0-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-ea-b12)
Java HotSpot(TM) Server VM (build 1.6.0-ea-b12, mixed mode)
java CollectionLoops RWCollection > /dev/null  18.40s user 1.24s system 110% cpu 17.826 total
java CollectionLoops RWCollection > /dev/null  18.32s user 1.30s system 111% cpu 17.518 total
java CollectionLoops RWCollection > /dev/null  18.22s user 1.26s system 111% cpu 17.519 total
java CollectionLoops RWCollection > /dev/null  18.08s user 1.16s system 111% cpu 17.310 total
java CollectionLoops RWCollection > /dev/null  18.07s user 1.31s system 111% cpu 17.408 total
java CollectionLoops RWCollection > /dev/null  45.78s user 1.59s system 104% cpu 45.130 total
java CollectionLoops RWCollection > /dev/null  18.06s user 1.16s system 109% cpu 17.501 total
java CollectionLoops RWCollection > /dev/null  30.19s user 1.52s system 107% cpu 29.429 total
java CollectionLoops RWCollection > /dev/null  29.99s user 1.55s system 108% cpu 29.171 total
java CollectionLoops RWCollection > /dev/null  18.20s user 1.26s system 111% cpu 17.480 total
java CollectionLoops RWCollection > /dev/null  18.15s user 1.24s system 111% cpu 17.338 total
java CollectionLoops RWCollection > /dev/null  18.08s user 1.28s system 111% cpu 17.429 total
java CollectionLoops RWCollection > /dev/null  18.01s user 1.23s system 110% cpu 17.360 total
java CollectionLoops RWCollection > /dev/null  19.02s user 1.27s system 111% cpu 18.279 total
java CollectionLoops RWCollection > /dev/null  17.96s user 1.16s system 110% cpu 17.348 total
java CollectionLoops RWCollection > /dev/null  17.89s user 1.31s system 109% cpu 17.489 total
java CollectionLoops RWCollection > /dev/null  18.10s user 1.25s system 111% cpu 17.330 total
java CollectionLoops RWCollection > /dev/null  18.16s user 1.32s system 111% cpu 17.448 total
java CollectionLoops RWCollection > /dev/null  30.02s user 1.45s system 108% cpu 29.129 total
java CollectionLoops RWCollection > /dev/null  51.06s user 1.51s system 104% cpu 50.511 total
java CollectionLoops RWCollection > /dev/null  18.06s user 1.28s system 111% cpu 17.318 total
java CollectionLoops RWCollection > /dev/null  30.24s user 1.49s system 109% cpu 28.919 total
java CollectionLoops RWCollection > /dev/null  18.02s user 1.25s system 110% cpu 17.369 total
java CollectionLoops RWCollection > /dev/null  18.18s user 1.28s system 111% cpu 17.399 total
java CollectionLoops RWCollection > /dev/null  34.30s user 1.52s system 106% cpu 33.568 total
java CollectionLoops RWCollection > /dev/null  18.21s user 1.24s system 111% cpu 17.508 total
java CollectionLoops RWCollection > /dev/null  18.20s user 1.28s system 111% cpu 17.469 total
java CollectionLoops RWCollection > /dev/null  18.83s user 1.32s system 111% cpu 18.048 total
java CollectionLoops RWCollection > /dev/null  30.08s user 1.58s system 106% cpu 29.608 total
java CollectionLoops RWCollection > /dev/null  18.17s user 1.26s system 111% cpu 17.449 total
(; PATH=/java/re/jdk/6.0/promoted/all/b12/binaries/solaris-amd64/bin:$PATH ; )  682.12s user 39.92s system 108% cpu 11:02.88 total
(mb29450@mho) ~/src/toy/6329104/3oct05 $ (PATH=/java/re/jdk/6.0/promoted/all/b54/binaries/solaris-amd64/bin:$PATH; java -version; repeat 30 time java CollectionLoops RWCollection > /dev/null)
java version "1.6.0-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-ea-b54)
Java HotSpot(TM) Server VM (build 1.6.0-ea-b54, mixed mode)
java CollectionLoops RWCollection > /dev/null  23.06s user 0.24s system 125% cpu 18.592 total
java CollectionLoops RWCollection > /dev/null  23.28s user 0.23s system 125% cpu 18.689 total
java CollectionLoops RWCollection > /dev/null  23.35s user 0.22s system 124% cpu 18.879 total
java CollectionLoops RWCollection > /dev/null  29.55s user 0.23s system 129% cpu 23.007 total
java CollectionLoops RWCollection > /dev/null  23.00s user 0.22s system 124% cpu 18.659 total
java CollectionLoops RWCollection > /dev/null  43.36s user 0.23s system 129% cpu 33.588 total
java CollectionLoops RWCollection > /dev/null  22.99s user 0.23s system 124% cpu 18.668 total
java CollectionLoops RWCollection > /dev/null  23.08s user 0.23s system 125% cpu 18.548 total
java CollectionLoops RWCollection > /dev/null  22.21s user 0.23s system 126% cpu 17.700 total
java CollectionLoops RWCollection > /dev/null  29.65s user 0.23s system 130% cpu 22.927 total
java CollectionLoops RWCollection > /dev/null  43.17s user 0.23s system 129% cpu 33.528 total
java CollectionLoops RWCollection > /dev/null  23.16s user 0.22s system 124% cpu 18.719 total
java CollectionLoops RWCollection > /dev/null  23.44s user 0.23s system 123% cpu 19.109 total
java CollectionLoops RWCollection > /dev/null  44.02s user 0.23s system 129% cpu 34.048 total
java CollectionLoops RWCollection > /dev/null  22.36s user 0.23s system 127% cpu 17.670 total
java CollectionLoops RWCollection > /dev/null  29.10s user 0.23s system 127% cpu 23.037 total
java CollectionLoops RWCollection > /dev/null  42.00s user 0.23s system 129% cpu 32.599 total
java CollectionLoops RWCollection > /dev/null  22.78s user 0.22s system 124% cpu 18.468 total
java CollectionLoops RWCollection > /dev/null  41.38s user 0.22s system 129% cpu 32.238 total
java CollectionLoops RWCollection > /dev/null  41.43s user 0.23s system 129% cpu 32.210 total
java CollectionLoops RWCollection > /dev/null  23.15s user 0.23s system 124% cpu 18.729 total
java CollectionLoops RWCollection > /dev/null  42.79s user 0.23s system 130% cpu 33.057 total
java CollectionLoops RWCollection > /dev/null  43.39s user 0.23s system 129% cpu 33.628 total
java CollectionLoops RWCollection > /dev/null  23.07s user 0.22s system 125% cpu 18.579 total
java CollectionLoops RWCollection > /dev/null  29.65s user 0.22s system 129% cpu 23.037 total
java CollectionLoops RWCollection > /dev/null  23.46s user 0.23s system 126% cpu 18.720 total
java CollectionLoops RWCollection > /dev/null  23.53s user 0.23s system 126% cpu 18.827 total
java CollectionLoops RWCollection > /dev/null  29.93s user 0.23s system 129% cpu 23.258 total
java CollectionLoops RWCollection > /dev/null  37.22s user 0.22s system 126% cpu 29.529 total
java CollectionLoops RWCollection > /dev/null  23.36s user 0.23s system 126% cpu 18.670 total
(; PATH=/java/re/jdk/6.0/promoted/all/b54/binaries/solaris-amd64/bin:$PATH ; )  895.99s user 6.89s system 127% cpu 11:48.20 total

Comments
SUGGESTED FIX Webrev: http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/c2_baseline/2005/20051005111734.kvn.6329104/workspace/webrevs/webrev-2005.10.05/index.html 1. Adjust method's mileage if invocation or backedge counters carry bit is set. 2. Make sure we not prevent inlining when we reset counters in CompilationPolicy::reset_counter_for_back_branch_event() method. We don't need it in CompilationPolicy::reset_counter_for_invocation_event() since it sets countee's carry bit which is checked by the method methodOopDesc::was_executed_more_then(n).
07-10-2005

EVALUATION The difference is that for slow run we generate virtual calls for some call sites whne for fast run it is inlining based on type profiling. The question is why the type profiling for those call sites doesn't work always. ---- When a method's invocation counter reaches the compilation threshold we put the method on the compilation queue and RESET the counter but also set 'carry' flag indicating that the counter reached the threshold. There is the method is_mature() which compare the invocation counter with other limit (about 5300) to determine if the method was called enough time to invoke some optimizations during compilation. We use this method to decide should we do virtual call for a call site or do inlining based on type profiling. Unfortunately is_mature() don't check the carry flag so it fails to recognize the method as mature when we reset the counter. The only check it does is the compare with 3300 and if the counter is less than it decided there was reset and treats the method as mature. So it is only the range between 3300 and 5300 when it fails. I don't know why it becomes more visible now. It exists since Tiger when is_mature() was added.
27-09-2005