When masking is better integrated into the IR, introduce a synthetic partial-vector type, which represents a view of a regular non-partial vector whose size can be less than the natural non-partial size. Implement with a scalar side value which is a small integer in the range [1..VLENGTH], which entails a mask of the same number of 1-bits (starting at lane 0), and is applied to all operations. (Remaining lanes are populated with don't-care values but appear are zero padding to the user.)
There are at least three purposes for synthetic partial vectors:
1. Expressing pre- and post-loops in a uniform manner.
2. Expressing alignment operations. It should be possible to configure a pre-loop (or even a mid-loop) to process a number of stream elements in the range [1..VLENGTH-1], so as to ensure that the next iteration starts on a desirable boundary. This should unlock IR-level strength reductions to use aligned memory instructions, when those exist.
3. Better compilation on VPUs which directly support counted partial vectors. (These have some notion of "next number of elements to process". SVE and RISC-V may be examples.)
Note that the GC can change alignment conditions of on-heap data at any safe point. This implies that alignment cannot be enforced by the Java source code; it must be advised and enforced (when possible) by the JIT. As a matter of API design, alignment (of one selected source or destination stream) should be continuously requested, not just set up in a manual pre-loop. Then JIT should then split the loop structure into pre-loop, aligned-main-loop, unaligned-main-loop, and post-loop, with appropriate phase transitions around safepoints.