A colleague reported:
ValidationState is used to validate XML ID and IDREF elements (among other types). To do so, it keeps data structures containing all IDs and IDREFs that have been seen in a document. The only methods ever called on the ID container are add() and contains(), so substituting HashSet for ArrayList makes no difference in behavior, while improving performance by orders of magnitude in large documents. No change is necessary/possible, however, for the IDREF container.This one is only ever iterated over -- never searched -- and order matters, so ArrayList is appropriate.
On a test document with ~1.5M elements, ~330K IDs, and ~430K IDREFs, this
change speeds up parsing (with validation enabled) by a factor of 26 (from
21.4 minutes, ~800ms/element, to 49 seconds, ~31��s/element).
There are further obvious algorithmic improvements possible for additional
constant-factor gains, but this is simple, safe, and brings schema validation
from O(n^2+mn) to O(n+m), where n is the number of IDs and m is the number of
IDREFs.