Bug ID: JDK-4979028 BitSet.hashCode() unnecessarily slow due to extra scanning of zero bits

Type: Enhancement
Component: core-libs
Sub-Component: java.util
Affected Version: 1.4.2

Priority: P5
Status: Resolved
Resolution: Fixed
OS: windows_xp
CPU: x86

Submitted: 2004-01-15
Updated: 2005-02-19
Resolved: 2005-02-19

JDK 6
6 b25Fixed

Name: jl125535			Date: 01/15/2004


A DESCRIPTION OF THE REQUEST :
The algorithm to compute the hashCode of a BitSet scans bits[] entry past the last used entry (which are normally all set to 0).


JUSTIFICATION :
If a BitSet is created then a "high" bit is set and then cleared, computing the hashCode will be unnecessarily slowly scanning the whole size of the BitSet, instead of just hashing the used part from 0 to (unitsInUse-1) (the remaining elements are normally all zeroes and should not be hashed.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The hashing loop should stop at bits[unitsInUse-1] as the hashcode must only rely on elements stored in the set, independantly of its actual size.
Also the statistical distribution of elements in the resulting hash should be stronger, as the computed hash value depends on the position of elements in the bits[] table, with some positions being non primes.
ACTUAL -
The hashing loop currently stop at bits[bits.length - 1] including all units at bits[bits.unitsInUse .. bits.length] which are all zeroes.


Note that BitSets use a dynamic reallocation of the backing store by growing exponentially.

This means that the time to compute the hashCode of a Bitset whose elements are added linearily from the lowest to the highest does not grow linearily with the elements added, but is much slower and "jumps" as soon as some "key" elements are added.

---------- BEGIN SOURCE ----------
The following code provides an optimized version for dynamic BitSets where some bits have been cleared, or been dynamically allocated but still not used. It also provides a bettter distribution of hashCodes, when BitSets are inserted in hash tables, with much less collisions:

    public int hashCode() {
        long h = 1234;
        for (int i = units.unitsInUse - 1; i >= 0; )
            h ^= (units[i--] ^ (h << 3)) + 1;

        return (int)(h >>> 32) ^ (int)h;
    }

It relies on the following documented invariants:
     * INVARIANT: The words in units[] above unitInUse - 1 are zero.
     * INVARIANT: unitsInUse is nonnegative.
     * INVARIANT: units[unitsInUse - 1] is nonzero unless unitsInUse is zero.

The final return has also a small optimization to perform XOR on 32-bit ints instead of 64 bit ints and dropping the unneeded highest 32 bits at end (it does not rely on tricky optimizations by javac at compile-time or at run-time in the Hotspot compiler).

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
No work around necessary. This is a performance issue.
(Incident Review ID: 233280) 
======================================================================

EVALUATION A good idea. ###@###.### 2005-1-25 02:41:04 GMT Unfortunately, the current code does not always maintain the invariant that bits[unitsInUse-1] != 0; That needs to be fixed first. 6222207: BitSet internal invariants may be violated ###@###.### 2005-2-09 20:06:33 GMT

25-01-2005