Bug ID: JDK-8230525 Missing intrinsics for Integer and Long operations

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

In Long and Integer, the implementations of the following methods are generic, but could be improved for scalar and/or vector code, in a platform-dependent manner:

reverse  (bitwise reversal - RBIT on ARM64, table lookup on x86)
rotateLeft, rotateRight  (instruction if available)

JDK-8283894 covers the following, which as of 2022 have been added and which should also be made intrinsics:

compress / expand  (x86 PEXT/PDEP)

Regarding table lookup to handle each byte separately: It seems unlikely to me that four or eight distinct table lookups will be very fast compared to the log-N (5 or 6 ALU steps) Hacker's Delight algorithm in the JDK. If you can get all the "lanes" to perform a table lookup in parallel, then you can get something more like constant-time (not log-N steps). So on modern x86 I would try using PSHUFB to bit-reverse 4-bit slices. Sketch of algorithm: 1. split scalar x into x1 = x & M, x2 = (x >> 4) & M where M = 0x0F…0F 2. move x1, x2 to vector registers v1, v2 3. load 16-byte reverse-nybble vector: vNR = 0x_F7B3_D591_1E6A_2C480 4. use PSHUFB with each of v1, v2, to steer nybbles from vNR into v1NR, v2NR 5. reverse step 2 from v1NR, v2NR back to x1NR, x2NR 6. swapping nybbles, recombine xBR = x2NR | (x1NR << 4) 7. reverse bytes of xBR The nice thing about this algorithm is that it works on vectors as well as scalars. That is, all of the scalar operations can be performed in the vector unit instead. Step 7 (byte reversal) would be another PSHUFB, of course. As a fallback I'd try a 256-way table indexed by bytes, but even for that I'd consider using a gather instruction to perform the table lookups in parallel. That means moving the operand into a vector register and then using PSHUFB to expand bytes to (64-bit) words. I think the previous algorithm will be faster. It seems people used a similar constant-time trick for pop-count with good results. References: https://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious https://web.archive.org/web/20180215152955/http://wm.ite.pl/articles/sse-popcount.html https://web.archive.org/web/20180713151429/https://arxiv.org/abs/1611.07612 http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html

13-08-2020

I'm sponsoring this issue for Jason Tatton, jptatton@amazon.com.

06-08-2020

Relates :	JDK-8283894 - Intrinsify compress and expand bits on x86
Relates :	JDK-8226721 - Missing intrinsics for Math.ceil, floor, rint
Relates :	JDK-8248830 - C2: Optimize Rotate API on x86