JDK-6982173 : (coll) Performance anomalies in AbstractSet.removeAll
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util:collections
  • Affected Version: 7
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2010-09-03
  • Updated: 2020-11-12
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
As written about in
 http://msmvps.com/blogs/jon_skeet/archive/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza.aspx

there can be performance anomolies in AbstractSet.removeAll under some conditions; quoting:

" This implementation [of AbstractSet.removeAll] determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method.

Now that sounds reasonable on the surface of it - iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn't mean it's going to be fast. In our case, the collections are the same size - but checking for the presence of an item in the HashSet is O(1) whereas checking in the ArrayList is O(N)... whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the ArrayList, we've got an O(M * N) solution overall instead of an O(N) solution. Ouch. The removeAll method is making an "optimization" based on assumptions which just aren't valid in this case."

Comments
Note from Josh: Read this: http://msmvps.com/blogs/jon_skeet/archive/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza.aspx Clearly I goofed. It would appear that I didn't think of that case when I specified the behavior:( Probably the best way to fix it would be with an interface akin to RandomAccess that allows a collection to indicate that it can do a fast (log n or better?) containment test. Maybe AssociativeAccess, or some such? Josh P.S. I'm ccing Bill Pugh because this could be turned into a performance puzzler.
12-11-2020

Jon Skeet's article now appears at this location: https://codeblog.jonskeet.uk/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza/
12-06-2017

EVALUATION or perhaps something could be done with a marker interface so that collections can identify themselves as having certain performance characteristics and that information could be used by removeAll()
03-09-2010

EVALUATION It's unfortunate that AbstractSet.removeAll() is specified to behave exactly in this way. So, while in one sense it's not a bug, it clearly is still a performance anomaly. Maybe, removeAll() could be overridden in HashSet to use some more sophisticated heuristics, based on what we know about the actual types and their performance characteristics. I'm not sure if the spec. or implementation of AbstractSet.removeAll() could be changed in the same way though. If not, then we would probably have to look at the other uses of AbstractSet within java.util.
03-09-2010