JDK-8230525 : Missing intrinsics for Integer and Long operations
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 14
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2019-09-04
  • Updated: 2022-04-28
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Description
In Long and Integer, the implementations of the following methods are generic, but could be improved for scalar and/or vector code, in a platform-dependent manner:

reverse  (bitwise reversal - RBIT on ARM64, table lookup on x86)
rotateLeft, rotateRight  (instruction if available)

JDK-8283894 covers the following, which as of 2022 have been added and which should also be made intrinsics:

compress / expand  (x86 PEXT/PDEP)


Comments
Regarding table lookup to handle each byte separately: It seems unlikely to me that four or eight distinct table lookups will be very fast compared to the log-N (5 or 6 ALU steps) Hacker's Delight algorithm in the JDK. If you can get all the "lanes" to perform a table lookup in parallel, then you can get something more like constant-time (not log-N steps). So on modern x86 I would try using PSHUFB to bit-reverse 4-bit slices. Sketch of algorithm: 1. split scalar x into x1 = x & M, x2 = (x >> 4) & M where M = 0x0F…0F 2. move x1, x2 to vector registers v1, v2 3. load 16-byte reverse-nybble vector: vNR = 0x_F7B3_D591_1E6A_2C480 4. use PSHUFB with each of v1, v2, to steer nybbles from vNR into v1NR, v2NR 5. reverse step 2 from v1NR, v2NR back to x1NR, x2NR 6. swapping nybbles, recombine xBR = x2NR | (x1NR << 4) 7. reverse bytes of xBR The nice thing about this algorithm is that it works on vectors as well as scalars. That is, all of the scalar operations can be performed in the vector unit instead. Step 7 (byte reversal) would be another PSHUFB, of course. As a fallback I'd try a 256-way table indexed by bytes, but even for that I'd consider using a gather instruction to perform the table lookups in parallel. That means moving the operand into a vector register and then using PSHUFB to expand bytes to (64-bit) words. I think the previous algorithm will be faster. It seems people used a similar constant-time trick for pop-count with good results. References: https://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious https://web.archive.org/web/20180215152955/http://wm.ite.pl/articles/sse-popcount.html https://web.archive.org/web/20180713151429/https://arxiv.org/abs/1611.07612 http://0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.html
13-08-2020

I'm sponsoring this issue for Jason Tatton, jptatton@amazon.com.
06-08-2020