Bug ID: JDK-6378821 bitCount() should use POPC on SPARC processors and AMD+10h

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 6,7

Priority: P4
Status: Closed
Resolution: Fixed
OS: solaris,windows_xp
CPU: x86,sparc

Submitted: 2006-01-30
Updated: 2013-11-01
Resolved: 2011-03-08

JDK 6	JDK 7	Other
6u18Fixed	7Fixed	hs15Fixed

bitCount() should use POPC on SPARC processors where POPC is implemented directly in hardware.  (The existing bitCount() implementation comes from "Hacker's Delight" and is fairly fast).  Beware, however, that POPC is implemented by kernel-level trap-based emulation on some processors.  In those environments we want to use the existing bitCount() implemenation.  isainfo (try "isainfo -x") should allow the JVM to identify those processors that support POPC in hardware.

x86 processors include a popcnt instruction in SSE4a for AMD and SSE4.2 for Intel.

EVALUATION http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/c771b7f43bbf
13-03-2009
PUBLIC COMMENTS And the numbers for AMD Shanghai: $ gamma -XX:-UsePopCountInstruction test sum: 629085184 time: 8504 $ gamma -XX:+UsePopCountInstruction test sum: 629085184 time: 1807 4.7x speedup. $ gamma -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test sum: 629085184 time: 9622 $ gamma -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test sum: 629085184 time: 2577 3.73x speedup.
12-03-2009
EVALUATION The same numbers on a T2: $ java -XX:-UsePopCountInstruction test VM option '-UsePopCountInstruction' sum: 629085184 time: 35676 $ java -XX:+UsePopCountInstruction test VM option '+UsePopCountInstruction' sum: 629085184 time: 20007 And without loop unrolling: $ java -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '-UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 41509 $ java -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '+UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 29470 The speedup is 1.78 and 1.41 respectively.
03-03-2009
PUBLIC COMMENTS Just for the record to see how much slower the kernel-level trap-based emulation on SPARC is (with 20 * 1000000 loops): $ gamma -XX:-UsePopCountInstruction test VM option '-UsePopCountInstruction' sum: 238869248 time: 1011 $ gamma -XX:+UsePopCountInstruction test VM option '+UsePopCountInstruction' sum: 238869248 time: 76985
03-03-2009
EVALUATION A very simple micro-benchmark like this: public class test { public static void main(String[] args) { int sum = 0; long start = System.currentTimeMillis(); for (int i = 0; i < 2000 * 1000000; i++) { sum += Integer.bitCount(i); } long end = System.currentTimeMillis(); System.out.println("sum: " + sum); System.out.println("time: " + (end - start)); } } shows a 5x speedup on a Nehalem processor: $ gamma -XX:-UsePopCountInstruction test VM option '-UsePopCountInstruction' sum: 629085184 time: 8132 $ gamma -XX:+UsePopCountInstruction test VM option '+UsePopCountInstruction' sum: 629085184 time: 1604 And with disabled loop unrolling to get more accurate numbers: $ gamma -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '-UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 8657 $ gamma -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test VM option '+UsePopCountInstruction' VM option 'LoopUnrollLimit=1' sum: 629085184 time: 1458 It's interesting to see that a tighter loop with popcnt is faster.
19-02-2009

Relates :	JDK-6832016 - {DigestMD5Base,Des3DkCrypto}.setParityBit should use Integer.bitCount
Relates :	JDK-7063674 - Wrong results from basic comparisons after calls to Long.bitCount(long)
Relates :	JDK-6832045 - DefaultSynthStyle.{getStateInfo,getMatchCount) should use Integer.bitCount