Bug ID: JDK-8134141 Arrays.Hashcode broken for small arrays

Type: Bug
Component: core-libs
Sub-Component: java.util
Affected Version: 7,8

Priority: P4
Status: Closed
Resolution: Won't Fix

Submitted: 2015-08-16
Updated: 2017-10-19
Resolved: 2015-08-21

FULL PRODUCT VERSION :
openjdk version "1.8.0_51"
OpenJDK Runtime Environment (build 1.8.0_51-b16)
OpenJDK 64-Bit Server VM (build 25.51-b03, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
4.1.3-201.fc22.x86_64

EXTRA RELEVANT SYSTEM CONFIGURATION :
32GB RAM Quad core sandybridge

A DESCRIPTION OF THE PROBLEM :
I have recently noticed that the built in java Arrays.hashcode is broken for small arrays, this results in massive collisions for arrays that only have a few values causing hash maps that store these arrays as keys (using wrappers overriding hashcode to use Arrays.hashcode) to become nothing but a series of Lists or completely unusable.

eg, running arrays.hashcode on a byte[] of length 4 a quarter billion times (where contents of array is different each time) results in 248 million collisions out of 248,720,625 loops.

Here's my test code so you can see for yourself (lots of ram needed)

HashSet<Integer> hs = new HashSet<>(100000000,(float) 1.0);
                        long collide = 0;
                        long totalLoops = 0;
                        byte[] ba = new byte[4];
                        System.gc();
                        int hash=1;
                        long time = System.currentTimeMillis();
                        for(byte d=0; d<15; d++) {
                                ba[0] = d;
                        for(byte i=-128; i<127; i++) {
                                ba[1] = i;
                                for(byte k=-128; k<127; k++) {
                                        ba[2] = k;
                                for(byte j=-128; j<127; j++) {
                                        ba[3] = j;
                                        hash = Arrays.hashcode(ba);
                                        if(hs.contains(hash)) {
                                                collide++;
                                        } else {
                                                hs.add(hash);
                                        }
                                        totalLoops++;
                                }
                                }
                        }
                        }
                        System.out.println("total time =" + (System.currentTimeMillis() - time));
                        System.out.println("collisions=" + collide + " loops=" + totalLoops);
                
                        System.exit(0);

This occurs because the 31 multipliers range is not high enough. If you replace 31 with 257 prime you will get 0 collisions with about 4 times the run time. The massive lookup time from running a hashmap thats become a series of lists from using a poor array hash functions will be an order of magnitude more than the 4x speed slow down from using 257 instead of 31. And 10 seconds to hash a quarter billion entries is not so bad considering other good hash functions like murmur3 took 65 seconds on the test above. I tried other primes but I either got millions of collisions or more runtime. I would advise you to replace the 31 in Arrays.hashcode with 257.


REPRODUCIBILITY :
This bug can be reproduced always.

Arrays.hashCode(byte[]) is defined in terms of List.hashCode() of the equivalent Byte values. [1] In turn, List.hashCode() is defined as a computation involving multiply-by-31-and-add of the list's values in sequence. [2] This is where the value 31 comes from. Finally, the hashCode() of a byte value is simply that value as an integer. [3] All of these are defined by specification and cannot be changed compatibly. Doing so would break voluminous amounts of code. Closing as Won't Fix. It's unfortunate that given these specifications, the hash code of a four-byte array has a total range of only about 700,000 values (out of 4 billion possible input values and int hash code values). Thus, hashing 250 million such arrays will inevitably result in a large number of collisions. As an alternative, four bytes can be encoded into a single int, and the hash code of an int is simply that int, which will at least give a hash code range equivalent to that of the input. [1] http://docs.oracle.com/javase/8/docs/api/java/util/Arrays.html#hashCode-byte:A- [2] http://docs.oracle.com/javase/8/docs/api/java/util/List.html#hashCode-- [3] http://docs.oracle.com/javase/8/docs/api/java/lang/Byte.html#hashCode--
19-10-2017
As an aside, changing the hash code to use 257 instead of 31 does seem to work a lot better. The reason the test run ran out of memory was probably because it was working better. The original (as-specified) hash code results in only somewhat fewer than 700,000 values, which can be stored easily in a HashSet. The modified hash code probably results in around 250 million unique values, all of which will be stored in the HashSet. Given that an Integer consumes 16 bytes, and a HashMap.Node consumes 32 bytes, that's 12GB right there. Additional memory is required for runtime overhead. If the computer has 16GB, the maximum Java heap size might be lower than what's required. The program probably ran for 30 minutes attempting to do garbage collection for most of the time. In addition, the initial HashSet allocation size is "only" 100,000,000 so it has to be recopied a couple times, resulting in more memory pressure and copying time. I was able to get the modified run to complete in a couple minutes by setting maximum heap size to 20GB (-Xmx20g -Xms20g) and changing the initial HashSet size to 300 million.
21-08-2015
Might be a duplicate of JDK-6530203, not sure at this point. At least the issues are very similar. The hash computed by Arrays.hashCode(byte[]) is fully specified, and relies on hashes computed by Byte.hashCode and List.hashCode. These cannot be changed compatibly.
21-08-2015
Please note that this bug has been raised on OpenJDK. I tried this in JDK 8u60 .There are indeed collisions for small sized arrays when the Arrays.hashcode is calculated a quarter billion times, each array content being different. The user has suggested replacing 31 with 257 prime in hashcode calculation. I ran the test case using 31 multiplier in hashCode calculation, it took around 3 seconds, however running with 257 multiplier continued for 30 minutes and finally threw OutOfMemory error.(I tried on my Lenovo laptop Windows 7 and Intel(R) Core (TM) i5-5300U CPU @2.30 GHz and 16.0 GB RAM. ) The test case is attached. Moving to dev team for further consideration.
21-08-2015