United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4045622 : java.lang.String.hashCode spec incorrectly describes the hash algorithm

Details
Type:
Bug
Submit Date:
1997-04-17
Status:
Closed
Updated Date:
1999-01-15
Project Name:
JDK
Resolved Date:
1999-01-15
Component:
specification
OS:
solaris_2.5
Sub-Component:
language
CPU:
sparc
Priority:
P2
Resolution:
Fixed
Affected Versions:
1.1.1
Fixed Versions:
1.2.0 (1.2alpha)

Related Reports

Sub Tasks

Description
The JLS 20.12.10 (page 535) describes the java.lang.String.hashCode method in great detail.  Unfortunately, it doesn't describe what the code actually does.  The JLS says that for short strings the hashCode is (excuse the made up syntax for integer exponentiation):

	((c[0] * 37^0) + (c[1] * 37^1) + (c[2] * 37^2) + ...)

and something similar, but sampling, for long strings.  In fact, the code we have been shipping since 1994 is

    public int hashCode() {
	int h = 0;
	int off = offset;
	char val[] = value;
	int len = count;

	if (len < 16) {
	    for (int i = len ; i > 0; i--) {
		h = (h * 37) + val[off++];
	    }
	} else {
	    // only sample some characters
	    int skip = len / 8;
	    for (int i = len ; i > 0; i -= skip, off += skip) {
		h = (h * 39) + val[off];
	    }
	}
	return h;
    }

in which the exponentiation has been strength-reduced to multiplication, but in the process is applied to the characters of the string in reverse order.

Here's a test program that shows the difference:

% cat java/HashTest.java
public class HashTest {
    public static void main(String[] argv) {
	String one = new String("\u0001");
	int oneHash = one.hashCode();
	System.out.println("oneHash: " + Integer.toString(oneHash));
	int oneJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))));
	System.out.println("((1 * 37^0)): " +
			   Integer.toString(oneJLS));

	String two = new String("\u0001\u0002");
	int twoHash = two.hashCode();
	System.out.println("twoHash: " + Integer.toString(twoHash));
	int twoJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))) +
		      (((int) '\u0002') * ((int) Math.pow(37.0, 1))));
	System.out.println("((1 * 37^0) + (2 * 37^1)): " +
			   Integer.toString(twoJLS));

	String three = new String("\u0001\u0002\u0003");
	int threeHash = three.hashCode();
	System.out.println("threeHash: " + Integer.toString(threeHash));
	int threeJLS = ((((int) '\u0001') * ((int) Math.pow(37.0, 0))) +
			(((int) '\u0002') * ((int) Math.pow(37.0, 1))) +
			(((int) '\u0003') * ((int) Math.pow(37.0, 2))));
	System.out.println("((1 * 37^0) + (2 * 37^1) + (3 * 37^2)): " +
			   Integer.toString(threeJLS));
    }
}

% $VM12/build/bin/java -classpath classes:$VM12/build/classes HashTest
oneHash: 1
((1 * 37^0)): 1
twoHash: 39
((1 * 37^0) + (2 * 37^1)): 75
threeHash: 1446
((1 * 37^0) + (2 * 37^1) + (3 * 37^2)): 4182

You can see that it's off even by the two-character string.

Usually I side with the JLS over the code, but in this case I think the JLS was trying to describe what the code was doing but got it wrong.  I don't think we can change the code, since there will be a lot of codes out there that are depending on the unchanging computation of the String.hashCode method (I think of all those serialized hash tables which would be confused if they were deserialized in a VM whose String.hashCode method was different).

Either way, something needs to be changed, and quickly.

peter.kessler@Eng 1997-04-16

   It turns out that there's another, unrelated bug in the spec:  there is an error in the formulae for calculating the increment (k) and maximum value (m) 
that are used in the sigma expression for the "sampled hash" that is used
if n > 15.  As a result, the specified expression references charcters
that lie out of bounds, and would cause runtime exceptions if correctly
implemented.

joshua.bloch@Eng 1997-04-24

                                    

Comments
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
generic

FIXED IN:
1.2alpha

INTEGRATED IN:
1.2alpha


                                     
2004-06-14
SUGGESTED FIX

The following is the accepted proposal for the new string hash function,
followed by a report justifying its selection:

		New String Hash Function (Bugs 4045622, 1258091)
		================================================

The Problem:	The currently specified String hash function does not match
		the currently implemented function.  The specified function
		is not implementable.  (It addresses characters outside of 
		the input string.)  The implemented function performs very
		poorly on certain classes of strings, including URLs.  (The
		poor performance is due to the "sampling" nature of the
		function for strings over 15 characters.)  I view the
		specification problem as the perfect opportunity to replace
		the unfortunate implementation.

Requesters:	The problems with the implementation have been mentioned
		on comp.sys.java.lang.programmer, though the extent of the
		problem may not be known outside of JavaSoft.  The problems
		with the spec were discovered by Peter Kessler and myself.

Proposed API change:  No API would change, per se.  The function computed
		      by String.hashCode() would change to:

			 s[0] * 31^(n-1) + s[1] * 31^(n-2) + ... + s[n-1]

		      where s[i] is the ith character of string s.

		      The Java Language Specification (which specifies the
		      value to be returned by String.hashCode()) would be
		      modified to reflect this.

		      The new hash function was selected after a fair amount
		      of study, as described in Exhibit A.  In the unlikely
		      event that you want even more detail, see me.

Implementation:	      Trivial.  (4 lines of code, which have already been 
		      written.)

Performance impact:   Hashing large strings will be somewhat more expensive,
		      as the new hash function examines every character,
		      but Hashtables performance will, on balance, improve,
		      as hash collisions will be vastly reduced.  While one
		      could construct a program to demonstrate the reduced
		      speed of the new hash function, I would be very
		      surprised if any real applications were adversely
		      affected in any measurable way.

Risk Assessment:      Surprisingly small.  Serialization of Hashtables does
		      not depend on consistent hash values.  It is possible
		      that some customer has implemented some persistent data
		      structure that relies on the String hash function not
		      changing, but it is highly unlikely that more than a
		      handful of customers have done so.

Testing Impact:	      None, really.  I've performed a fair number of tests
		      on the proposed hash function already.  (See below.)

Doc Impact:	      The release notes would have to mention the change, so
		      that any "researchy" customers who do rely on the precise
		      behavior of the hash function would be alerted.
		      The JLS would have to be modified in its next edition.
		      The actual text would be very short (1-2 paragraphs).

******************************************************************************

Exhibit A

    As per Guy's request, here's a note describing the "Pedigree" of my
proposed String hash function.  Surprisingly, I could find very little in the
way of published literature on practical, general-purpose string hash
functions; much of what is published is either outdated (e.g., Knuth V. 3,
Pp. 508-513) or arcane (lots of journal articles on 'perfect hashing').
Most of the standard algorithms and data structures textbooks barely
mention the subject.  The best treatment I was able to find was in Aho,
Sethi and Ullman's "Compilers: Principles, Techniques and Tools" AKA "The
Dragon Book" (Pp. 433-438).

    The class of hash function that I recommend using is polynomials whose
coefficients are the characters in the string:

    P(x) = s[0] * x^(n-1) + s[1] * x^(n-2) + ... + s[n-1]

This polynomial is calculated by the following code fragment:

    int hashVal = 0;
    for (int i=0; i<string.length; i++)
	hashVal = x*hashVal + s.charAt(i);

The value of x is part of the definition of the hash function; choosing
this value is somewhat of a black art.  Clearly 1 or any multiple of 2 is
bad.  Primes seem safer than composite numbers, though some composite
numbers yield excellent performance (as shown below).

    While this class of hash function is recommended in The Dragon Book
(P(65599)) and Kernighan and Ritchie's "The C Programming Language, 2 Ed."
(P(31)), it is not attributed in either of these books.  I went so far as
to call up Dennis Ritchie, who said that he did not know where the hash
function came from.  He walked across the hall and asked Brian Kernighan,
who also had no recollection.  A shareware library from Germany called
"Matpack" contains ten string hash functions, including P(33), which it
attributes to noted Unix hacker Chris Torek.  I sincerely doubt that it's
original with him.  P(37) is used in the definition of the current Java
String hash function for short (< 16 characters) strings.  I'm afraid that
the origins of this little function are lost to history.

     So why do I think we should use this function?  Simply put, it's the
best general purpose string hash function that I was able to find, and
it's cheap to calculate.  By 'general purpose', I mean that it's not
optimized for any specific type of strings (e.g., compiler symbols), but
seems to resist collisions across every collection of strings that I was
able to throw at it.  This is critical given that we have no idea what
sort of strings people will store in Java hash tables.  Also, the
performance of this class of hash functions seems largely unaffected by
whether the size of the hash table is prime or not.  This is important
because Java's auto-growing hash table is typically does not contain a
prime number of buckets.

    In addition to the class of hash functions described above, The Dragon
Book highly recommends a slightly more complex hash function due to P. J.
Weinberger, originally used internally in his C compiler.  Matpak's author,
Dr. Berndt M. Gammel (of Technische Universit??t M??nchen), however, claims that
Weinberger's function does not work well for large collections of arbitrary
strings.  It turns out that Matpak's implementation of Weinberger's function
contains a typo, and the resulting function is far, far worse even than the
currently implemented Java string hash function (see table below).  I notified
Dr. Gammel, and he promised to fix the typo.  The published version of
Weinberger's function (in the Dragon book) differs slightly from the version
that he (Peter Weinberger) personally recommends.  (They differ in the number
of bits to which the hash value is shifted: the published value is 24, the
recommended value is 28.)  The recommended version is significantly better
(see Table).

    Gammel recommends only three of the ten hash functions in Matpak for
"general use."  One is P(33), one is a minor variant on P(129), ascribed
to Phong Vo, and the third is a table lookup based, S-box like algorithm,
relying on a table of 256 "random" 32-bit numbers, from the WAIS project.
(The Vo variant merely adds the constant 987654321 to each character to
form the polynomial coefficients.)

    The table below summarizes the performance of the various hash functions
described above, for three data sets:

    1) All of the words and phrases with entries in Merriam-Webster's
       2nd Int'l Unabridged Dictionary (311,141 strings, avg length 10 chars).

    2) All of the strings in /bin/*, /usr/bin/*, /usr/lib/*, /usr/ucb/*
       and /usr/openwin/bin/*  (66,304 strings, avg length 21 characters).

    3) A list of URLs gathered by a web-crawler that ran for several
       hours last night (28,372 strings, avg length 49 characters).

    The performance metric shown in the table is the "average chain size"
over all elements in the hash table (i.e., the expected value of the
number of key compares to look up an element).

			Webster's	Code Strings	URLs
			---------       ------------	----
Current Java Fn.	1.2509		1.2738	       13.2560
P(37)    [Java]		1.2508		1.2481		1.2454
P(65599) [Aho et al]	1.2490		1.2510		1.2450
P(31)	 [K+R]		1.2500		1.2488		1.2425
P(33)    [Torek]	1.2500		1.2500		1.2453
Vo's Fn			1.2487		1.2471		1.2462
WAIS Fn			1.2497		1.2519		1.2452
Weinberger's Fn(MatPak) 6.5169		7.2142	       30.6864
Weinberger's Fn(24)	1.3222		1.2791		1.9732
Weinberger's Fn(28)	1.2530		1.2506		1.2439

    Looking at this table, it's clear that all of the functions except for the
current Java function and the two broken versions of Weinberger's function
offer excellent, nearly indistinguishable performance.  I strongly conjecture
that this performance is essentially the "theoretical ideal", which is what
you'd get if you used a true random number generator in place of a hash
function.

    I'd rule out the WAIS function as its specification contains pages of
random numbers, and its performance is no better than any of the far simpler
functions.  Any of the remaining six functions seem like excellent choices,
but we have to pick one.  I suppose I'd rule out Vo's variant and Weinberger's
function because of their added complexity, albeit minor.  Of the remaining
four, I'd probably select P(31), as it's the cheapest to calculate on a RISC
machine (because 31 is the difference of two powers of two).  P(33) is
similarly cheap to calculate, but it's performance is marginally worse, and
33 is composite, which makes me a bit nervous.

					Josh

******************************************************************************

From: Guy Steele - Sun Microsystems Labs <###@###.###>
To: jbloch@Eng
Cc: gls@East, James.Gosling@Eng, wnj@central, ###@###.###,
        riggs@East, ###@###.###
Subject: Re: String Hash Function Redux (long)
Date: Mon, 28 Apr 1997 09:47:22 -0400 (EDT)

Good work!  Besides the fact that you have clearly established that
we ought to revamp Java's hash function and have found a good solution,
I think you have an MPU here (Minimum Publishable Unit)---that is, I would
encourage you to write this up even more carefully and send it to an
ACM conference (PLDI?) or journal.

--Guy
                                     
2004-06-11
PUBLIC COMMENTS

    The String hash function implemented in JDK releases through 1.1.3
did not match the function specified in the first edition of the Java
Language Specification.  In fact, the specified function is not
implementable, in that it addresses characters outside of the input
string.  Additionally, the implemented function performed very poorly on
certain classes of strings, including URLs.  (The poor performance is
due to the "sampling" nature of the function for strings over 15
characters.)

    To bring the implementation into accord with the specification, and
to fix the performance problems, we are modifying the specification and
implementation in tandem.  The new String hash function as of JDK1.2 is
specified to be:

	 s[0] * 31^(n-1) + s[1] * 31^(n-2) + ... + s[n-1]

where s[i] is the ith character of string s.

    This change should have no effect on the overwhelming majority of
Java applications.  If an application has persistent data that depends
on actual String hash values, it could theoretically be affected, but
the serialized representation of a Hashtable does NOT depend on the
actual hash values of the keys stored in the Hashtable. Thus,
applications relying on serialization for storage of persistent data
will be unaffected.

                                     
2004-06-10
EVALUATION

    The spec and implementation do not match.  The specified function CANNNOT
be implemented, as it would throw StringIndexOutOfBoundsExceptions.
Thus, we must change the spec.  I feel strongly that it would be in our
best interests to replace the hash function with one that does NOT "sample"
strings, but includes every character in the hash value computation

joshua.bloch@Eng 1997-04-24
                                     
1997-04-24



Hardware and Software, Engineered to Work Together