Bug ID: JDK-8262982 [vector API] add IR and API points for synthetic multi-vectors

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
repo-panamaUnresolved

Support synthetic vector shapes which compile to small ordered sequences of hardware vectors, such as 4x128 bits, 3x256 bits, or 2x512 bits.

Reasons:

1. competitiveness with C APIs, many of which have small multi-vectors.

2. configurable control (via static shape parameter) of loop unrolling, for highly tuned portable loop kernels

3. ability to represent (16 or 32 bit) index vectors for vectors with 128 or more lanes

4. eventual support of 2-D partial reduction operations for configurable unrolled reduction loops (BLAS)

A 3-way unrolled 4-lane loop needs synthetic 12-lane vectors, if it is to avoid textual repetition of code. Textual repetition is not only a maintenance burden, but also a blocker to runtime configuration of algorithms.

An 3-way unrolled 4-lane reduction loop needs to operate on 12-lane vectors, with additional structure which allows a 12-vector to be partially reduced (along the minor 3-axis) to a 4-vector accumulation value.

Currently, the representation of shuffles (permutation vectors) uses the same lane size and lane type as the vector to be permuted. This has two drawbacks:

a. floating point types are an unnatural representation for indexes
b. signed byte integral types don't have enough dynamic range if vectors have more than 128 lanes (as they do on some current and future VPUs)

A Java-like fix to both (a) and (b) is to use 32-bit ints as index lanes. A compromise (valid for known future VPUs but slightly less future proof) is to use 16-bit Java shorts (valid up to 32768 lanes). In either case, byte vectors with more than 128 lanes will require at least synthetic multi-vectors of the 2x or 4x size. Also, applying the shuffle API points to synthetic multi-vectors will require indexes larger than 256 bits.

After this RFE is adopted, we can consider a more uniform and future-proof lane size (16-bit or 32-bit) for the VectorShuffle type.

From what I gather: Idea is to introduce a synthetic vector type associated with an abstract level IR node which is then lowered to a target mappable IR node in a legalization stage, thus any shape which cannot be fully/partially lowered to a target specific species should be scalarized. So a 2D reduction could be either a legal species if target supports operations over 2D vectors or Tensors else it should be lowered to vector of vector form with reduction performed independently on each vector and results are then aggregated. Should it be ok to unlink this to masking optimizations for AVX512 and SVE, and handle this as a separate project as the scope is well beyond masking optimizations.

08-09-2021