United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6823354 Add intrinsics for {Integer,Long}.{numberOfLeadingZeros,numberOfTrailingZeros}()
JDK-6823354 : Add intrinsics for {Integer,Long}.{numberOfLeadingZeros,numberOfTrailingZeros}()

Details
Type:
Enhancement
Submit Date:
2009-03-27
Status:
Closed
Updated Date:
2011-03-08
Project Name:
JDK
Resolved Date:
2011-03-08
Component:
hotspot
OS:
generic
Sub-Component:
compiler
CPU:
generic
Priority:
P5
Resolution:
Fixed
Affected Versions:
hs15
Fixed Versions:
hs16 (b04)

Related Reports
Backport:
Backport:
Relates:
Relates:
Relates:

Sub Tasks

Description
These methods can be instrinsified by using bit scan, bit test, and population count instructions.  It seems it's not worth to instrinsify lowestOneBit() too, as it already boilds down to about 2 or 3 machine instructions.

Some of these methods can be used in BigInteger and maybe other classes in the JDK.
I removed highestOneBit from that CR as it's not clear the benefit is worth the changes.  I might open another one in the future.

                                    

Comments
EVALUATION

These are the numbers with Vladimir's suggested changes on a Nehalem box:

32-bit (instrinsic):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 2317
sum: -2, time: 2190
sum: -2, time: 2560
sum: -2, time: 2560
sum: -2, time: 2561
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 1836
sum: -2147483648, time: 1832
sum: -2147483648, time: 1921
sum: -2147483648, time: 1922
sum: -2147483648, time: 1922
Long.numberOfLeadingZeros:
sum: -34, time: 6986
sum: -34, time: 6981
sum: -34, time: 7082
sum: -34, time: 7071
sum: -34, time: 7074
Long.numberOfTrailingZeros:
sum: -2147483616, time: 5162
sum: -2147483616, time: 5164
sum: -2147483616, time: 5448
sum: -2147483616, time: 5488
sum: -2147483616, time: 5452

64-bit (intrinsic):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 2113
sum: -2, time: 2249
sum: -2, time: 2169
sum: -2, time: 2169
sum: -2, time: 2168
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 1882
sum: -2147483648, time: 1876
sum: -2147483648, time: 1830
sum: -2147483648, time: 1831
sum: -2147483648, time: 1830
Long.numberOfLeadingZeros:
sum: -34, time: 3952
sum: -34, time: 3952
sum: -34, time: 4065
sum: -34, time: 4122
sum: -34, time: 4085
Long.numberOfTrailingZeros:
sum: -2147483616, time: 2419
sum: -2147483616, time: 2407
sum: -2147483616, time: 3705
sum: -2147483616, time: 3788
sum: -2147483616, time: 3717

Numbers are sometimes slightly better, but not significantly.
                                     
2009-05-04
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/93c14e5562c4
                                     
2009-05-07
EVALUATION

I used this small micro-benchmark for testing:

public class bits {
    public static void main(String[] args) {
        Integer.highestOneBit(0);
        Integer.lowestOneBit(0);

        Integer.numberOfLeadingZeros(0);
        Integer.numberOfTrailingZeros(0);

        Long.numberOfLeadingZeros(0);
        Long.numberOfTrailingZeros(0);

        System.out.println("Integer.numberOfLeadingZeros:");
        for (int i = 0; i < 5; i++)
            doleadi();

        System.out.println("Integer.numberOfTrailingZeros:");
        for (int i = 0; i < 5; i++)
            dotraili();

        System.out.println("Long.numberOfLeadingZeros:");
        for (int i = 0; i < 5; i++)
            doleadl();

        System.out.println("Long.numberOfTrailingZeros:");
        for (int i = 0; i < 5; i++)
            dotraill();
    }

    static void doleadi() {
        long start = System.currentTimeMillis();
        int sum = 0;
        for (int i = 0; i < Integer.MAX_VALUE; i++) {
            sum += lead(i);
        }
        long end = System.currentTimeMillis();
        System.out.println("sum: " + sum + ", time: " + (end - start));
    }

    static void dotraili() {
        long start = System.currentTimeMillis();
        int sum = 0;
        for (int i = 0; i < Integer.MAX_VALUE; i++) {
            sum += trail(i);
        }
        long end = System.currentTimeMillis();
        System.out.println("sum: " + sum + ", time: " + (end - start));
    }

    static void doleadl() {
        long start = System.currentTimeMillis();
        int sum = 0;
        for (long l = 0; l < Integer.MAX_VALUE; l++) {
            sum += lead(l);
        }
        long end = System.currentTimeMillis();
        System.out.println("sum: " + sum + ", time: " + (end - start));
    }

    static void dotraill() {
        long start = System.currentTimeMillis();
        int sum = 0;
        for (long l = 0; l < Integer.MAX_VALUE; l++) {
            sum += trail(l);
        }
        long end = System.currentTimeMillis();
        System.out.println("sum: " + sum + ", time: " + (end - start));
    }

    static int high(int i)   { return Integer.highestOneBit(i); }
    static int low(int i)    { return Integer.lowestOneBit(i);  }

    static int lead(int i)   { return Integer.numberOfLeadingZeros(i); }
    static int trail(int i)  { return Integer.numberOfTrailingZeros(i); }

    static int lead(long l)  { return Long.numberOfLeadingZeros(l); }
    static int trail(long l) { return Long.numberOfTrailingZeros(l); }
}
                                     
2009-04-20
EVALUATION

On an Intel Nehalem w/ 2.93GHz I get these numbers:

32-bit (Java code):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 6347
sum: -2, time: 6674
sum: -2, time: 7442
sum: -2, time: 7429
sum: -2, time: 7471
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 8767
sum: -2147483648, time: 8881
sum: -2147483648, time: 8155
sum: -2147483648, time: 8159
sum: -2147483648, time: 8144
Long.numberOfLeadingZeros:
sum: -34, time: 10080
sum: -34, time: 10075
sum: -34, time: 9703
sum: -34, time: 9702
sum: -34, time: 9702
Long.numberOfTrailingZeros:
sum: -2147483616, time: 10944
sum: -2147483616, time: 10932
sum: -2147483616, time: 10870
sum: -2147483616, time: 10859
sum: -2147483616, time: 10868

32-bit (intrinsic):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 2759
sum: -2, time: 2367
sum: -2, time: 2334
sum: -2, time: 2334
sum: -2, time: 2334
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 2066
sum: -2147483648, time: 2059
sum: -2147483648, time: 2059
sum: -2147483648, time: 2059
sum: -2147483648, time: 2060
Long.numberOfLeadingZeros:
sum: -34, time: 6623
sum: -34, time: 6610
sum: -34, time: 7170
sum: -34, time: 7172
sum: -34, time: 7162
Long.numberOfTrailingZeros:
sum: -2147483616, time: 5180
sum: -2147483616, time: 5159
sum: -2147483616, time: 5714
sum: -2147483616, time: 5731
sum: -2147483616, time: 5724

That's a speedup of 3.18 (Integer.numberOfLeadingZeros), 3.96 (Integer.numberOfTrailingZeros), 1.36 (Long.numberOfLeadingZeros), and 1.90 (Long.numberOfTrailingZeros) respectively.

64-bit (Java code):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 6247
sum: -2, time: 7895
sum: -2, time: 7884
sum: -2, time: 7884
sum: -2, time: 7884
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 6642
sum: -2147483648, time: 6720
sum: -2147483648, time: 7033
sum: -2147483648, time: 7022
sum: -2147483648, time: 7022
Long.numberOfLeadingZeros:
sum: -34, time: 8518
sum: -34, time: 8596
sum: -34, time: 8677
sum: -34, time: 8677
sum: -34, time: 8677
Long.numberOfTrailingZeros:
sum: -2147483616, time: 8485
sum: -2147483616, time: 8478
sum: -2147483616, time: 7941
sum: -2147483616, time: 7944
sum: -2147483616, time: 7945

64-bit (intrinsic):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 2306
sum: -2, time: 2250
sum: -2, time: 2059
sum: -2, time: 2059
sum: -2, time: 2059
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 1883
sum: -2147483648, time: 1876
sum: -2147483648, time: 1876
sum: -2147483648, time: 1876
sum: -2147483648, time: 1876
Long.numberOfLeadingZeros:
sum: -34, time: 3569
sum: -34, time: 3697
sum: -34, time: 4375
sum: -34, time: 4343
sum: -34, time: 4292
Long.numberOfTrailingZeros:
sum: -2147483616, time: 2405
sum: -2147483616, time: 2400
sum: -2147483616, time: 3711
sum: -2147483616, time: 3667
sum: -2147483616, time: 3797

Speedup: 3.83, 3.74, 2.02, 2.17
                                     
2009-04-20
EVALUATION

These numbers are from an AMD Shanghai w/ 2.6Ghz:

32-bit (Java code):
 
$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 9649
sum: -2, time: 10700
sum: -2, time: 10140
sum: -2, time: 11156
sum: -2, time: 11156
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 10563
sum: -2147483648, time: 10555
sum: -2147483648, time: 13401
sum: -2147483648, time: 13616
sum: -2147483648, time: 13617
Long.numberOfLeadingZeros:
sum: -34, time: 12758
sum: -34, time: 12750
sum: -34, time: 12128
sum: -34, time: 12123
sum: -34, time: 12402
Long.numberOfTrailingZeros:
sum: -2147483616, time: 15583
sum: -2147483616, time: 16279
sum: -2147483616, time: 16420
sum: -2147483616, time: 18200
sum: -2147483616, time: 18201

32-bit (instrinsic w/o lzcnt instruction):

$ gamma -XX:-UseCountLeadingZerosInstruction bits
Integer.numberOfLeadingZeros:
sum: -2, time: 5749
sum: -2, time: 5292
sum: -2, time: 5229
sum: -2, time: 5230
sum: -2, time: 5236
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 3944
sum: -2147483648, time: 3935
sum: -2147483648, time: 3780
sum: -2147483648, time: 3831
sum: -2147483648, time: 3832
Long.numberOfLeadingZeros:
sum: -34, time: 11610
sum: -34, time: 11598
sum: -34, time: 12427
sum: -34, time: 12426
sum: -34, time: 12426
Long.numberOfTrailingZeros:
sum: -2147483616, time: 6726
sum: -2147483616, time: 6760
sum: -2147483616, time: 7455
sum: -2147483616, time: 7559
sum: -2147483616, time: 7456

Speedup: 1.94, 3.55, 0.98, 2.44

32-bit (intrinsic w/ lzcnt):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 2325
sum: -2, time: 2028
sum: -2, time: 2071
sum: -2, time: 2071
sum: -2, time: 2071
...
Long.numberOfLeadingZeros:
sum: -34, time: 6653
sum: -34, time: 6641
sum: -34, time: 8301
sum: -34, time: 8671
sum: -34, time: 8286
...

Speedup: 4.90, 1.46

Overall speedup: 4.90, 3.55, 1.46, 2.44

64-bit (Java code):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 9226
sum: -2, time: 9284
sum: -2, time: 10742
sum: -2, time: 12271
sum: -2, time: 12273
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 11102
sum: -2147483648, time: 11707
sum: -2147483648, time: 10873
sum: -2147483648, time: 10880
sum: -2147483648, time: 10873
Long.numberOfLeadingZeros:
sum: -34, time: 10186
sum: -34, time: 10306
sum: -34, time: 11391
sum: -34, time: 12298
sum: -34, time: 12299
Long.numberOfTrailingZeros:
sum: -2147483616, time: 14097
sum: -2147483616, time: 14780
sum: -2147483616, time: 15025
sum: -2147483616, time: 17561
sum: -2147483616, time: 17553

64-bit (instrinsic w/o lzcnt):

$ gamma -XX:-UseCountLeadingZerosInstruction bits
Integer.numberOfLeadingZeros:
sum: -2, time: 5772
sum: -2, time: 5083
sum: -2, time: 4867
sum: -2, time: 4866
sum: -2, time: 4867
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 3591
sum: -2147483648, time: 3572
sum: -2147483648, time: 3521
sum: -2147483648, time: 3623
sum: -2147483648, time: 3739
Long.numberOfLeadingZeros:
sum: -34, time: 5808
sum: -34, time: 5799
sum: -34, time: 6627
sum: -34, time: 6627
sum: -34, time: 6628
Long.numberOfTrailingZeros:
sum: -2147483616, time: 4150
sum: -2147483616, time: 4142
sum: -2147483616, time: 5382
sum: -2147483616, time: 5383
sum: -2147483616, time: 6063

Speedup: 2.52, 3.09, 1.86, 3.26

64-bit (intrinsic w/ lzcnt):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 2320
sum: -2, time: 1942
sum: -2, time: 1812
sum: -2, time: 1812
sum: -2, time: 1812
...
Long.numberOfLeadingZeros:
sum: -34, time: 2499
sum: -34, time: 2487
sum: -34, time: 3313
sum: -34, time: 3313
sum: -34, time: 3468
...

Speedup: 6.77, 3.71

Overall speedup: 6.77, 3.09, 3.71, 3.26
                                     
2009-04-22
EVALUATION

These numbers are from a T2 w/ 1.165 GHz.  The SPARC implementations of the intrinsics need a hardware supported POPC instruction.

32-bit (Java code):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 86712
sum: -2, time: 86898
sum: -2, time: 87017
sum: -2, time: 87455
sum: -2, time: 86938
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 61786
sum: -2147483648, time: 61736
sum: -2147483648, time: 58972
sum: -2147483648, time: 58971
sum: -2147483648, time: 58971
Long.numberOfLeadingZeros:
sum: -34, time: 145686
sum: -34, time: 145644
sum: -34, time: 99939
sum: -34, time: 99939
sum: -34, time: 99939
Long.numberOfTrailingZeros:
sum: -2147483616, time: 121674
sum: -2147483616, time: 121630
sum: -2147483616, time: 81086
sum: -2147483616, time: 81087
sum: -2147483616, time: 81086

32-bit (intrinsic):

$ gamma bits
Integer.numberOfLeadingZeros:
sum: -2, time: 43354
sum: -2, time: 43353
sum: -2, time: 43307
sum: -2, time: 43307
sum: -2, time: 43307
Integer.numberOfTrailingZeros:
sum: -2147483648, time: 26652
sum: -2147483648, time: 26606
sum: -2147483648, time: 26606
sum: -2147483648, time: 26606
sum: -2147483648, time: 26606
Long.numberOfLeadingZeros:
sum: -34, time: 90336
sum: -34, time: 90301
sum: -34, time: 64500
sum: -34, time: 64500
sum: -34, time: 64500
Long.numberOfTrailingZeros:
sum: -2147483616, time: 79282
sum: -2147483616, time: 79244
sum: -2147483616, time: 42386
sum: -2147483616, time: 42386
sum: -2147483616, time: 42386

Speedup: 2.01, 2.22, 1.55, 1.91

64-bit uses the same instructions as 32-bit.
                                     
2009-04-22



Hardware and Software, Engineered to Work Together