JDK-6393408 : javac uses a poor hashing algorithm for names
  • Type: Bug
  • Component: tools
  • Sub-Component: javac
  • Affected Version: 6
  • Priority: P2
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2006-03-03
  • Updated: 2010-04-03
  • Resolved: 2006-06-21
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6
6 b89Fixed
Description
A DESCRIPTION OF THE FIX :
javac uses a poor hashing algorithm for names.

The current hash algorithm is based on first, middle and last character, length and three constants. In some cases, notably synthetic names, the code behaves like a linear search, giving O(n^2) performance.

The patch replaces the sampling algorithm with one based on that of String (since 1.1). Diff is against mustang b73.

Comments
EVALUATION I have a performance test, and the results show the the proposed change is consistently faster when reading identifiers in src.zip than before. The test reads src.zip, processes it with a Scanner to get a list of 1,806,693 Java identifiers in a List<String> and then it times how long it takes to convert all those strings to names. The new hash algorithm is consistently faster, but there is not much difference to get excited about in the normal case. Averaged over 10 runs for each algorithm, the new hash algorithm took 1.25 seconds to process the identifiers in src.zip, as compared to 1.36 seconds for the old hash algorithm. In addition, the slowest new time was faster than the fastest old time. So, on a "typical" distribution of names, the new algorithm is better, both in overall performance, and by analyzing the hash chain distribution that I did before, and we know the new algorithm will do much better on synthetically generated names.
13-06-2006

EVALUATION noreg-perf/noreg-trivial Simple fix well tested with performance analysis.
13-06-2006

EVALUATION I have additional information regarding performance and quality of hash for a number of samples. I ran timing tests to parse a single file (HelloWorld), a medium number of files (431 javac source files) and a large number of files (12659 J2SE source files.) I ran each test 3 times, and in all cases the best time for the new hash algorithm did marginally better than the best time for the older algorithm, although the difference may not be statistically significant. (all times in ms) -Hello-World- ---javac-------- --------j2se------- old 388 388 380 2277 2271 2272 41914 40168 41343 new 377 378 378 2261 2258 2254 40012 40112 39859 In addition, I checked the distribution of the hash table under the old and new functions; as expected, the distribution is more even under the new algorithm: In the following table, the number at the beginning of each line is the weighted average length of hash chain; then in the list that follows, x=y gives the count (y) for a hash chain of length (x). Hello World 1.07 {0=31897, 1=820, 2=47, 3=2, 4=2} 1.01 {0=31850, 1=908, 2=10} javac/tools 1.62 {0=19035, 1=8767, 2=3076, 3=1112, 4=417, 5=162, 6=84, 7=32, 8=29, 9=19, 10=12, 11=12, 12=5, 13=4, 14=1, 24=1} 1.38 {0=16585, 1=11344, 2=3794, 3=873, 4=147, 5=22, 6=3} J2SE 3.94 {0=2549, 1=5168, 2=5929, 3=5403, 4=4176, 5=3007, 6=2111, 7=1424, 8=870, 9=634, 10=417, 11=315, 12=205, 13=175, 14=99, 15=85, 17=45, 16=45, 19=26, 18=24, 21=7, 20=9, 23=5, 22=6, 25=8, 24=6, 27=3, 26=6, 28=2, 30=4, 32=1, 33=1, 38=1, 37=1, 40=1} 3.73 {0=840, 1=3162, 2=5715, 3=6901, 4=6233, 5=4610, 6=2833, 7=1449, 8=617, 9=258, 10=103, 11=43, 12=4} In all these cases, the new algorithm is slightly better than the old one. Thus, it would appear that for a variety of sizes of typical hand-written code, the new hash function is marginally better, both in performance and weighted average of hash chain length. Thus it seems worth adopting this new algorithm, given it will also probably be better for synthesized code as well.
11-05-2006

SUGGESTED FIX Moved from description#1: --- /home/tackline/mustang/current/j2se/src/share/classes/com/sun/tools/javac/util/Name.java 2006-02-26 18:08:46.000000000 +0000 +++ com/sun/tools/javac/util/Name.java 2006-02-13 21:41:57.000000000 +0000 @@ -48,14 +48,13 @@ /** The hashcode of a name. */ private static int hashValue(byte cs[], int start, int len) { - if (len > 0) - return - len * (41 * 41 * 41) + - cs[start] * (41 * 41) + - cs[start + len - 1] * 41 + - cs[start + (len >> 1)]; - else - return 0; + int h = 0; + int off = start; + + for (int i = 0; i < len; i++) { + h = 31*h + cs[off++]; + } + return h; } /** Is (the utf8 representation of) name equal to [ JUnit TESTCASE : /* This program generates code to illustrate the dangers of the hash algorithm. Run as: javac -Dinners=2000 Gen2 javac -Dinners=4000 Gen2 time javac test2000/Test2000.java time javac test4000/Test4000.java */ import java.io.*; class Gen2 { public static void main(String[] args) throws IOException { final int inners = Integer.getInteger("inners"); File dir = new File("test"+inners); if (!dir.mkdir()) { throw new Error("mkdir: "+dir); } File pkgDir = dir; String cls = "Test"+inners; Writer out = new BufferedWriter(new FileWriter(new File( pkgDir, cls+".java" ))); out.write( "class "+cls+" {\n" +" {\n" +"\n" ); for (int ct=0; ct<inners; ++ct) { out.write( " new Runnable() {\n" +" public void run() {\n" +" }\n" +" };\n" ); } out.write( "\n" +" }\n" +"}\n" ); out.close(); } }
05-03-2006

EVALUATION We are currently waiting for additional information from the submitter regarding how the change in hashing affects compilation of regular Java programs.
05-03-2006

EVALUATION Contribution-Forum:https://jdk-collaboration.dev.java.net/servlets/ProjectForumMessageView?messageID=11758&forumID=1463
03-03-2006