JDK-6378821 : bitCount() should use POPC on SPARC processors and AMD+10h
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 6,7
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: solaris,windows_xp
  • CPU: x86,sparc
  • Submitted: 2006-01-30
  • Updated: 2013-11-01
  • Resolved: 2011-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7 Other
6u18Fixed 7Fixed hs15Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
bitCount() should use POPC on SPARC processors where POPC is implemented directly in hardware.  (The existing bitCount() implementation comes from "Hacker's Delight" and is fairly fast).  Beware, however, that POPC is implemented by kernel-level trap-based emulation on some processors.  In those environments we want to use the existing bitCount() implemenation.  isainfo (try "isainfo -x") should allow the JVM to identify those processors that support POPC in hardware.

x86 processors include a popcnt instruction in SSE4a for AMD and SSE4.2 for Intel.

Comments
EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/c771b7f43bbf
13-03-2009

PUBLIC COMMENTS And the numbers for AMD Shanghai: $ gamma -XX:-UsePopCountInstruction test sum: 629085184 time: 8504 $ gamma -XX:+UsePopCountInstruction test sum: 629085184 time: 1807 4.7x speedup. $ gamma -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test sum: 629085184 time: 9622 $ gamma -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test sum: 629085184 time: 2577 3.73x speedup.
12-03-2009

EVALUATION The same numbers on a T2: $ java -XX:-UsePopCountInstruction test VM option '-UsePopCountInstruction' sum: 629085184 time: 35676 $ java -XX:+UsePopCountInstruction test VM option '+UsePopCountInstruction' sum: 629085184 time: 20007 And without loop unrolling: $ java -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '-UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 41509 $ java -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '+UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 29470 The speedup is 1.78 and 1.41 respectively.
03-03-2009

PUBLIC COMMENTS Just for the record to see how much slower the kernel-level trap-based emulation on SPARC is (with 20 * 1000000 loops): $ gamma -XX:-UsePopCountInstruction test VM option '-UsePopCountInstruction' sum: 238869248 time: 1011 $ gamma -XX:+UsePopCountInstruction test VM option '+UsePopCountInstruction' sum: 238869248 time: 76985
03-03-2009

EVALUATION A very simple micro-benchmark like this: public class test { public static void main(String[] args) { int sum = 0; long start = System.currentTimeMillis(); for (int i = 0; i < 2000 * 1000000; i++) { sum += Integer.bitCount(i); } long end = System.currentTimeMillis(); System.out.println("sum: " + sum); System.out.println("time: " + (end - start)); } } shows a 5x speedup on a Nehalem processor: $ gamma -XX:-UsePopCountInstruction test VM option '-UsePopCountInstruction' sum: 629085184 time: 8132 $ gamma -XX:+UsePopCountInstruction test VM option '+UsePopCountInstruction' sum: 629085184 time: 1604 And with disabled loop unrolling to get more accurate numbers: $ gamma -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '-UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 8657 $ gamma -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '+UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 1458 It's interesting to see that a tighter loop with popcnt is faster.
19-02-2009