JDK-8178425 : The AbstractSet.removeAll() method can be more efficient when removing a large list
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.util:collections
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2017-04-09
  • Updated: 2019-02-14
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
A DESCRIPTION OF THE REQUEST :
The AbstractSet.removeAll() method has two strategies:
1. Iterate through the collection to be removed and remove every element.
2. Iterate through itself, checking whether the element is in the collection and should be removed.
The strategy is chosen based on the size of each collection - the smallest collection is iterated through. If the collection to be removed is a List with an equal or larger size than *this* the method completes in O(n^2) time rather than O(n) time. This is because List implementations usually exhibit O(n) time for their contains() methods, whereas Set implementations usually exhibit O(1) time.

JUSTIFICATION :
O(n) time is more desirable than O(n^2) time

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Removing any collection from a Set should work in O(n) time with n being the size of the collection to be removed
ACTUAL -
Removing a large List from a Set of equal or smaller size exhibits O(n^2) time

---------- BEGIN SOURCE ----------
package main;

import java.util.ArrayList;
import java.util.Collection;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;

public class HashSetWithFastRemoveAll<T> extends HashSet<T> {
        private static final long serialVersionUID = 1L;

        @Override
        public boolean removeAll(Collection<?> c) {
                /*
                 * copied verbatim from AbstractSet implementation apart from
                 * "|| instanceof List"
                 */
                boolean modified = false;
                if (size() > c.size() || c instanceof List) {
                        for (Iterator<?> i = c.iterator(); i.hasNext();)
                                modified |= remove(i.next());
                } else {
                        for (Iterator<?> i = iterator(); i.hasNext();) {
                                if (c.contains(i.next())) {
                                        i.remove();
                                        modified = true;
                                }
                        }
                }
                return modified;
        }

        // ===testing code===
        public static void main(String[] args) {
                Set<String> elements = new HashSet<>();
                for (int i = 0; i < 20_000; i++) {
                        elements.add(randomString());
                }
                testSet(true, elements, new HashSet<>());
                testSet(true, elements, new HashSetWithFastRemoveAll<>());
                testSet(false, elements, new HashSet<>());
                testSet(false, elements, new HashSetWithFastRemoveAll<>());
        }

        private static void testSet(boolean isWarmUp, Set<String> elements, Set<String> set) {
                Set<String> removed = new HashSet<>();
                for (int i = 0; i < 20_000; i++) {
                        removed.add(randomString());
                }
                // removing all elements is ~4x faster
                // removed = elements;
                List<String> largeList = new ArrayList<>(removed);
                Set<String> largeSet = new HashSet<>(removed);
                set.clear();
                set.addAll(elements);
                double listMS = timeMS(() -> set.removeAll(largeList));
                set.clear();
                set.addAll(elements);
                double setMS = timeMS(() -> set.removeAll(largeSet));
                if (!isWarmUp) {
                        System.out.println("Testing: " + set.getClass());
                        System.out.println("Remove all from List: " + listMS);
                        System.out.println("Remove all from Set: " + setMS);
                        System.out.println();
                }
        }

        private static double timeMS(Runnable runnable) {
                long pre = System.nanoTime();
                runnable.run();
                long post = System.nanoTime();
                return 1E-6 * (post - pre);
        }

        private static String randomString() {
                StringBuilder sb = new StringBuilder();
                for (int i = 0; i < 5; i++) {
                        char ch = (char) (Math.random() * 26 + 'a');
                        sb.append(ch);
                }
                return sb.toString();
        }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
In AbstractSet.removeAll() method, a precheck can be added to the size comparison ie.

if (size() > c.size()) {

can be replaced with

if (size() > c.size() || c instanceof List) {


Comments
The size-based heuristic is problematic for semantic reasons, namely choosing which collection to iterate over is NOT symmetric. The argument collection might have a different notion of equality for the contains() method, e.g. if one is a SortedSet whose comparator isn't consistent with equals. See JDK-6394757.
12-06-2017

Collections.disjoint() has similar heuristics based on size and whether one or both collections is a Set. Possibly this is preferable: if (!(c instanceof Set) || size() > c.size()) as it makes it more explicit that the assumption is that Set.contains() is O(1). The compatibility implications should be examined closely. The current behavior is specified precisely to be based on size, not on the type of the other collection, so this will need to be adjusted if we proceed with this change.
15-04-2017