JDK-8143577 : optimize ArrayList.removeIf
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.util:collections
  • Priority: P5
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2015-11-21
  • Updated: 2018-08-17
  • Resolved: 2016-11-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 9
9 b148Fixed
Related Reports
Duplicate :  
Relates :  
Description
A straightforward iterator loop removing selected elements from an ArrayList is O(n^2). (More precisely it's O(m*n) where m is the number of deletions and n is the number of elements.) This is because each removal causes the right-hand tail of the array to be copied one space to the left, and elements farther to the right in the array are copied multiple times.

The current implementation of ArrayList.removeIf avoids this behavior by using a two-pass algorithm: it calls the predicate on each element, recording the result in a BitSet. It then scans the BitSet and copies elements once directly into the correct location. This is O(2n) time -- really O(n). However, it's O(n) space for the BitSet, admittedly more like O(n/32) because of the space efficiency of BitSet.

An alternative algorithm could be implemented that is O(n) and avoids allocating any space. It involves walking two pointers (really, indexes) into the array from left to right. The leading pointer keeps track of calling the predicate on each element. The trailing pointer keeps track of the destination location to which an element is be copied if it is not to be removed. Elements to be removed are simply skipped. When the leading pointer has passed the last element, the array slots between the leading and trailing pointers are nulled out.

This should be faster and should allocate less space than the BitSet approach. However, the BitSet approach has the characteristic that if the predicate throws an exception, the list remains unchanged. With the two-pointer approach, if the predicate throws an exception, the list becomes corrupted.

Bulk operations don't generally have all-or-nothing semantics. For example, consider list.forEach(consumer) where the consumer has a side effect of modifying each element. If the consumer were to throw an exception, some elements would be modified and some would not be. It is essentially impossible to recover from this. So, while the all-or-nothing characteristic of the current ArrayList.removeIf implementation seems attractive, it's basically unnecessary.
Comments
I'm content with the decision to allow predicates in bulk traversal methods to read but not write the underlying collection. It's a little slower, but safer in the spirit of java.
05-12-2016

It may have been wise not to switch to a single-pass algorithm. I just ran across this Stack Overflow question: http://stackoverflow.com/questions/40974247/collection-removeif-works-strangely Essentially the submitter has a predicate that calls list.contains(). The straightforward iterator/remove loop behaves differently from the list.removeIf() call. While switching to a single-pass removeIf() algorithm probably wouldn't make this case behave consistently with the iterator/remove loop, this example does show that people sometimes call methods on a collection while iterating over it. The single-pass algorithm would potentially expose inconsistent state to callers, which seems bad.
05-12-2016

We decided to (continue to?) support read access to the underlying collection during destructive traversals like removeIf, by using the two-pass strategy. It's only slightly slower than a one-pass strategy. All array-based collections will see performance improvements.
19-11-2016

We now have implementations of faster .forEach and .removeIf etc for ArrayList ArrayDeque Vector, ArrayBlockingQueue in jsr166 CVS. When we benchmark e.g. ABQ.forEach, it's 20x faster! For removeIf, we can do even better - we have a O(n) instead of O(n^2) algorithm. BUT ... we're calling user code in the middle of a traversal, and that user code might reentrantly call methods on the collection. I don't think we have any spec outlawing this (there is no explicit Iterator). The problem can even already be seen in jdk8 with e.g. ABQ.contains which calls equals, which could conceivably modify the collection in the middle of iteration. But that's very unlikely. Who would dare file a bug? The problem with removeIf is worse than the problem with forEach, since with the former the predicate can not even call read-only methods on the collection, because the invariants are temporarily broken during iteration. A plausible use of reading while traversing with forEach might be for the predicate to call contains() on related elements while processing some element. The performance improvements are so huge we want to give them to our users, and almost all users will want them, but we should make the conditions on user code clear. For the case of removeIf, we can make read access from the predicate safe if we use the existing two-pass strategy of collecting indices of elements to be removed in a bitset, then physically removing them in a second pass. A bit slower, but preserving the O(n) property. Maybe for blocking queues, interior access and traversal operations aren't really important, and users of ABQ should not be encouraged to use methods like removeIf?
10-11-2016

Something like this would do the trick: (note, untested) public boolean removeIf(Predicate<? super E> filter) { // find first element to delete int firstDel = 0; while (true) { if (firstDel == size) { return false; } @SuppressWarnings("unchecked") E e = (E)elementData[firstDel]; if (filter.test(e)) { break; } firstDel++; } // skip and copy int dest = firstDel; while (++firstDel < size) { @SuppressWarnings("unchecked") E e = (E)elementData[firstDel]; if (!filter.test(e)) { elementData[dest++] = e; } } // null out the tail of array while (dest < size) { elementData[dest++] = null; } return true; } // BUG: size needs to be updated
05-10-2016

I agree with the reporter's analysis. With minimal effort we can ensure that the data structure is never actually corrupted in case of an exception, even though only some of the elements were removed.
05-10-2016