Support synthetic vector shapes which compile to small ordered sequences of hardware vectors, such as 4x128 bits, 3x256 bits, or 2x512 bits.
Reasons:
1. competitiveness with C APIs, many of which have small multi-vectors.
2. configurable control (via static shape parameter) of loop unrolling, for highly tuned portable loop kernels
3. ability to represent (16 or 32 bit) index vectors for vectors with 128 or more lanes
4. eventual support of 2-D partial reduction operations for configurable unrolled reduction loops (BLAS)
A 3-way unrolled 4-lane loop needs synthetic 12-lane vectors, if it is to avoid textual repetition of code. Textual repetition is not only a maintenance burden, but also a blocker to runtime configuration of algorithms.
An 3-way unrolled 4-lane reduction loop needs to operate on 12-lane vectors, with additional structure which allows a 12-vector to be partially reduced (along the minor 3-axis) to a 4-vector accumulation value.
Currently, the representation of shuffles (permutation vectors) uses the same lane size and lane type as the vector to be permuted. This has two drawbacks:
a. floating point types are an unnatural representation for indexes
b. signed byte integral types don't have enough dynamic range if vectors have more than 128 lanes (as they do on some current and future VPUs)
A Java-like fix to both (a) and (b) is to use 32-bit ints as index lanes. A compromise (valid for known future VPUs but slightly less future proof) is to use 16-bit Java shorts (valid up to 32768 lanes). In either case, byte vectors with more than 128 lanes will require at least synthetic multi-vectors of the 2x or 4x size. Also, applying the shuffle API points to synthetic multi-vectors will require indexes larger than 256 bits.
After this RFE is adopted, we can consider a more uniform and future-proof lane size (16-bit or 32-bit) for the VectorShuffle type.