United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-5028425 : (coll) Collection.retainAll(Collection) implementations are not optimized

Details
Type:
Enhancement
Submit Date:
2004-04-07
Status:
Open
Updated Date:
2012-10-08
Project Name:
JDK
Resolved Date:
Component:
core-libs
OS:
windows_2000
Sub-Component:
java.util:collections
CPU:
x86
Priority:
P4
Resolution:
Unresolved
Affected Versions:
1.4.2
Targeted Versions:

Related Reports

Sub Tasks

Description

Name: jl125535			Date: 04/07/2004


FULL PRODUCT VERSION :
java version "1.4.2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-b28)
Java HotSpot(TM) Server VM (build 1.4.2-b28, mixed mode)

java version "1.5.0-beta2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-beta2-b45)
Java HotSpot(TM) Client VM (build 1.5.0-beta2-b45, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows 2000 [Version 5.00.2195]

A DESCRIPTION OF THE PROBLEM :
Collection.retainAll(Collection) implementations aren't optimized.
When a sufficiently large collection is used as an argument, the
performance is pathologically slow.  (See attached program results)

There are several factors that could be taken into account to optimize the
performance of the retainAll(Collection) method.  The default implementation
provided by AbstractCollection could, modify its behavior depending on
the size of the specified collection and the cost of its containment test.

The attached code demonstrates a very simple way to avoid the current
pathological worst case (by creating a HashSet).  If the cost of constructing an unnecessary HashSet is to be avoided, an interface could be defined indicating that a Collection has an O(1) contains() method.

The retainAll(Collection) method can be viewed as an operation on two collections.

The concrete Collection implementations (HashSet, TreeSet, ArrayList, etc...)
have at their disposal.
- the sizes of both collections
- the element insertion speed of the implementing collection
- the element deletion speed of the implementing collection
- the containment test speed of the implementing collection
Additionally, the following information is likely to be known
- the element insertion speed of the argument collection
- the element deletion speed of the argument collection
- the containment test speed of the argument collection

So, not only could the implementation provided by AbstractCollection be
greatly improved, but the implementations provided by the concrete
implementations could be improved beyond that.

Similar optimizations should be made for the containsAll(Collection) and
removeAll(Collection) methods, too.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Invoke retainAll(collection) on a collection, using an argument with a relatively slow (>O(1)) containment check.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The execution speed of x.retainAll(y) should be no greater than O(x.size() + y.size()).
ACTUAL -
The execution spped of x.retainAll(y) can be as slow as O(x.size() * y.size()).
Partial execution results follow:

java.util.ArrayList took 32 ms  for 512
java.util.LinkedList took 0 ms  for 512
java.util.HashSet took 0 ms  for 512
java.util.TreeSet took 0 ms  for 512
Test$FastSet took 16 ms  for 512
java.util.ArrayList took 0 ms  for 1024
java.util.LinkedList took 0 ms  for 1024
java.util.HashSet took 0 ms  for 1024
java.util.TreeSet took 0 ms  for 1024
Test$FastSet took 0 ms  for 1024
java.util.ArrayList took 47 ms  for 2048
java.util.LinkedList took 15 ms  for 2048
java.util.HashSet took 16 ms  for 2048
java.util.TreeSet took 31 ms  for 2048
Test$FastSet took 16 ms  for 2048
java.util.ArrayList took 63 ms  for 4096
java.util.LinkedList took 31 ms  for 4096
java.util.HashSet took 47 ms  for 4096
java.util.TreeSet took 47 ms  for 4096
Test$FastSet took 31 ms  for 4096
java.util.ArrayList took 203 ms  for 8192
java.util.LinkedList took 156 ms  for 8192
java.util.HashSet took 156 ms  for 8192
java.util.TreeSet took 141 ms  for 8192
Test$FastSet took 0 ms  for 8192
java.util.ArrayList took 891 ms  for 16384
java.util.LinkedList took 703 ms  for 16384
java.util.HashSet took 719 ms  for 16384
java.util.TreeSet took 719 ms  for 16384
Test$FastSet took 16 ms  for 16384
java.util.ArrayList took 11515 ms  for 32768
java.util.LinkedList took 11047 ms  for 32768
java.util.HashSet took 10031 ms  for 32768
java.util.TreeSet took 9281 ms  for 32768
Test$FastSet took 62 ms  for 32768
java.util.ArrayList took 98015 ms  for 65536
java.util.LinkedList took 92234 ms  for 65536
java.util.HashSet took 91921 ms  for 65536
java.util.TreeSet took 94109 ms  for 65536
Test$FastSet took 78 ms  for 65536
java.util.ArrayList took 492825 ms  for 131072
java.util.LinkedList took 475794 ms  for 131072
java.util.HashSet took 476044 ms  for 131072
java.util.TreeSet took 473700 ms  for 131072
Test$FastSet took 281 ms  for 131072
...

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.*;

public class Test {

private static class FastSet extends HashSet {

    public FastSet(Collection collection) {
        super(collection);
    }

    /**
     * A very easy way to avoid the worst case performance.
     * There is still <b>lots</b> of room for improvement here.
     */
    public boolean retainAll(Collection collection) {
        if (collection.size() > 1000)
	        collection = new HashSet(collection);
	    return super.retainAll(collection);
    }
} // Fast Set

private static void testRetainAll(Collection full, Collection half) {
    int  size  = full.size();
    long start = System.currentTimeMillis();
    full.retainAll(half);
    long end   = System.currentTimeMillis();
    long time  = end - start;
    String name = full.getClass().getName();
    System.out.println(name + " took " + time + " ms " + " for " + size);
}

private static void testRetainAll(int size) {

    // create a list of size integers
    List full = new ArrayList(size);
    for (int i=0; i<size; i++)
        full.add(new Integer(i));

    // create a random list half the size of the full set
    List half = new ArrayList(full);
    java.util.Collections.shuffle(half);
    while (half.size() > full.size() / 2 )
        half.remove(0);

    testRetainAll(new ArrayList(full),half);
    testRetainAll(new LinkedList(full),half);
    testRetainAll(new HashSet(full),half);
    testRetainAll(new TreeSet(full),half);
    testRetainAll(new FastSet(full),half);
}

public static void main(String[] args) {
    int x=512;
    for (int i=0; i<28; i++) {
        testRetainAll(x);
        x *= 2;
    }
}

}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Workarounds:

a) Extend HashSet, TreeSet, etc.. to avoid the problem, and use the replacements, instead.
b) Make sure the collection arguments are transformed, when needed.  In other words, use something like this:
  x.retainAll(new HashSet(y));
c) Write an efficient library method like:
  public static Collection retainAll(Collection a, Collection b);
(Incident Review ID: 227973) 
======================================================================

                                    

Comments
EVALUATION

Some optimizations along the lines suggested might be justified.  Performance work on general-purpose collections is a tricky business.  It's not clear whether to optimize for worst-case or expected case behavior.  The reporter claims that the implementation is likely to know the cost of containment tests on the argument collection, but there is no reliable way to do this (in general).  It would be possible to add a marker interface akin to RandomAccess that allows a collection to say whether it has a "quick" containment test (whatever that means; probably O(log n) or better).

More generally it's not clear where our perforamnce tuning resources are best spent.  There is a relatively easy workaround for this one: if you're going to do removeAll or retainAll and the argument collection is large doesn't have a quick containment test, copy it yourself:

   foo.removeAll(new HashSet(baz));

###@###.### 2004-04-13
                                     
2004-04-13



Hardware and Software, Engineered to Work Together