JDK-8344424 : C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 24
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2024-11-18
  • Updated: 2025-05-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Relates :  
Description
I'm filing this as a bug, not an RFE, because it would be a possible performance regression with UseCompactObjectHeaders, were it to leave experimental status or become default. This regression would only affect machines that require strict alignment (see AlignVector and Matcher::misaligned_vectors_ok).

------------------------------------------------------------------------------

JDK-8305895 added UseCompactObjectHeaders, which changed the offset from object base to array payload:

-XX:-UseCompactObjectHeaders
UNSAFE.ARRAY_BYTE_BASE_OFFSET = 16
UNSAFE.ARRAY_SHORT_BASE_OFFSET = 16
UNSAFE.ARRAY_CHAR_BASE_OFFSET = 16
UNSAFE.ARRAY_INT_BASE_OFFSET = 16
UNSAFE.ARRAY_LONG_BASE_OFFSET = 16
UNSAFE.ARRAY_FLOAT_BASE_OFFSET = 16
UNSAFE.ARRAY_DOUBLE_BASE_OFFSET = 16

-XX:+UseCompactObjectHeaders
UNSAFE.ARRAY_BYTE_BASE_OFFSET = 12
UNSAFE.ARRAY_SHORT_BASE_OFFSET = 12
UNSAFE.ARRAY_CHAR_BASE_OFFSET = 12
UNSAFE.ARRAY_INT_BASE_OFFSET = 12
UNSAFE.ARRAY_LONG_BASE_OFFSET = 16
UNSAFE.ARRAY_FLOAT_BASE_OFFSET = 12
UNSAFE.ARRAY_DOUBLE_BASE_OFFSET = 16

---------------------------------------------------------------------

And under platforms that require strict alignment, we require 8-byte alignment for all vector loads/stores. One might think that full vector-width is required, but it turns out 8-byte is sufficient. Relevant code:

src/hotspot/share/opto/vectorization.hpp:  static bool vectors_should_be_aligned() { return !Matcher::misaligned_vectors_ok() || AlignVector; }



src/hotspot/cpu/x86/matcher_x86.hpp:  static constexpr bool misaligned_vectors_ok() {
  // x86 supports misaligned vectors store/load.
  static constexpr bool misaligned_vectors_ok() {
    return true;
  }


src/hotspot/cpu/ppc/matcher_ppc.hpp:  static constexpr bool misaligned_vectors_ok() {
  // PPC implementation uses VSX load/store instructions (if
  // SuperwordUseVSX) which support 4 byte but not arbitrary alignment
  static constexpr bool misaligned_vectors_ok() {
    return false;
  }

src/hotspot/cpu/aarch64/matcher_aarch64.hpp:  static constexpr bool misaligned_vectors_ok() {
  // aarch64 supports misaligned vectors store/load.
  static constexpr bool misaligned_vectors_ok() {
    return true;
  }

src/hotspot/cpu/s390/matcher_s390.hpp:  static constexpr bool misaligned_vectors_ok() {
  // z/Architecture does support misaligned store/load at minimal extra cost.
  static constexpr bool misaligned_vectors_ok() {
    return true;
  }

src/hotspot/cpu/arm/matcher_arm.hpp:  static constexpr bool misaligned_vectors_ok() {
  // ARM doesn't support misaligned vectors store/load.
  static constexpr bool misaligned_vectors_ok() {
    return false;
  }

src/hotspot/cpu/riscv/matcher_riscv.hpp:  static constexpr bool misaligned_vectors_ok() {
  // riscv supports misaligned vectors store/load.
  static constexpr bool misaligned_vectors_ok() {
    return true;
  }

And there are some exceptions, for example on aarch64 and x86:

x86:
src/hotspot/cpu/x86/vm_version_x86.cpp:    AlignVector = !UseUnalignedLoadStores;

      if (supports_sse4_2()) { // new ZX cpus
        if (FLAG_IS_DEFAULT(UseUnalignedLoadStores)) {
          UseUnalignedLoadStores = true; // use movdqu on newest ZX cpus
        }
      }
So I suppose some older platforms may be affected, though I have not seen one yet. They would have to be missing the unaligned movdqu instructions.

aarch64:
src/hotspot/cpu/aarch64/vm_version_aarch64.cpp:    AlignVector = AvoidUnalignedAccesses;

  // Ampere eMAG
  if (_cpu == CPU_AMCC && (_model == CPU_MODEL_EMAG) && (_variant == 0x3)) {
    if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
      FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
    }
and

  // ThunderX
  if (_cpu == CPU_CAVIUM && (_model == 0xA1)) {
    guarantee(_variant != 0, "Pre-release hardware no longer supported.");
    if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
      FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
    }
and

  // ThunderX2
  if ((_cpu == CPU_CAVIUM && (_model == 0xAF)) ||
      (_cpu == CPU_BROADCOM && (_model == 0x516))) {
    if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
      FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
    }
and

  // HiSilicon TSV110
  if (_cpu == CPU_HISILICON && _model == 0xd01) {
    if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
      FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
    }

--------------------------------------------------------------

If we do not require strict alignment, then we can use unaligned memory accesses, such as vmovdqu.

With strict alignment requirement (i.e. 8-byte alignment) / AlignVector, we need to make sure that all vector load/store have their address:
adr % 8 = 0

Of course all object bases are aligned with ObjectAlignmentInBytes = 8.

---------------------------

Now let's try to get that 8-byte alignment in some example:

    public short[] convertFloatToShort() {
        short[] res = new short[SIZE];
        for (int i = 0; i < SIZE; i++) {
            res[i] = (short) floats[i];
        }
        return res;
    }

Let's look at the two addresses with UseCompactObjectHeaders=false, where we can vectorize:

F_adr = base + 16 + 4 * i
-> aligned for: i % 2 = 0
S_adr = base + 16 + 2 * i
-> aligned for: i % 4 = 0

-> solution for both: i % 4 = 0, i.e. we have alignment for both vector accesses every 4th iteration.


Let's look at the two addresses with UseCompactObjectHeaders=true, where we cannot vectorize:

F_adr = base + 12 + 4 * i
-> aligned for: i % 2 = 1
S_adr = base + 12 + 2 * i
-> aligned for: i % 4 = 2

-> There is no solution to satisfy both alignment constraints!

----------------------------

Of course this is not strictly due to UseCompactObjectHeaders, there are other flags that affect the distance-to-payload, such as UseCompressedClassPointers, which everyone has enabled now, I think. But the question is if we are ok with the changes to enabling UseCompactObjectHeaders, which will mean that some mixed type (e.g. conversion) loops cannot vectorize, due to impossible alignment constraints.

-------------------------------------

If you are more interested in how we currently compute the alignment solution for AlignVector, please see:
JDK-8310190
https://github.com/openjdk/jdk/pull/14785
Comments
PPC64 is no longer affected: JDK-8348678
26-05-2025

ILW = Some loops don't vectorize anymore (potential impact on performance), with experimental option -XX:+UseCompactObjectHeaders platforms that require alignment, no workaround = MLH = P4
22-11-2024

> The simple loop would be preferable, I think. Don't you agree? Yes, I agree. But I think we can still vectorize the other examples if we emulate unaligned accesses using two aligned accesses and a shift. In the worst case of a short loop, this means twice as many memory accesses, but in the best case, it should be only one extra access for the loop, because the "2nd 8 bytes" of the vector pair in the current iteration becomes the "1st 8 bytes" of the next iteration.
20-11-2024

[~dlong] we "could" do a lot of things to remidy this vectoirzation blocker. But the question is: does anybody care enough to put in the work? I don't want to deal much more with strict-alignment than I have to. It is really annoying, and basically all modern CPUs actually support unaligned vector load/store. As far as I know, there is not even a performance impact of not aligning on modern CPUs.
20-11-2024

Hi [~epeter], thanks for asking. The Arm Architecture Reference Manual, https://developer.arm.com/documentation/ddi0487/latest, doesn't require these alignment constraints, so this wouldn't break those platforms. This change was made because their performance was particularly poor with unaligned accesses however. See https://bugs.openjdk.org/browse/JDK-8159063. I have no access to these machines now. As these machines are rather old, I don't think it is unacceptable for there to be a performance regression for the sake of improving overall performance on recent hardware. Thanks.
19-11-2024

[~rkennke] Also mentioned that it is preferrable to have short arrays, especially because Strings are based on byte/char arrays, and they are often short, so the 4-byte win is significant. Of course there are also String operations, where we go from ASCII to unicode, and that is a mixed-type loop, like this: for (int i = 0; i < SIZE; i++) { chars[i] = bytes[i]; } In the end it is a trade-off, between lower memory-footprint and performance of loops with vectorization.
19-11-2024

[~dlong] Let's analyze that case: for (int i = 0; i + 2 < SIZE; i++) { res[i + 2] = (short) floats[i + 1]; } S_adr = base + 16 + 2 * (i+2) = base + 20 + 2*i -> i % 4 = 2 F_adr = base + 16 + 4 * (i+1) = base + 20 + 4*i -> i % 2 = 1 So yes, there is no solution here. Yes, there will always be examples that do not work. But the question is which are the most used examples, where one would hope it would vectorize. The simple loop would be preferable, I think. Don't you agree? for (int i = 0; i + 2 < SIZE; i++) { res[i] = (short) floats[i]; }
19-11-2024

I'm fixing the tests here: JDK-8340010 Responding to some comments from over there: [~rkennke] > Some related code may have changed while we reviewed the big PR, maybe? I doubt it, I think we just did not run sufficiently high tiers with all relevant flag combinations. > Can it be made that vectorization does not work in those mixed scenarios but does work when both arrays are aligned? Or maybe it already works that way? Yes, it works in non-mixed cases. It also works in some mixed cases. It only does not work if both types are less than 8 bytes, otherwise we can usually make the alignment happen. > There is work in progress to make the object header even more compact, down to 4 bytes. I suppose that would solve the issue. But I'm hearing some significant resistance, so let's see if that happens. [~fgao] I know that you from ARM care about some machines that require strict alignment. What is your opinion on all this?
19-11-2024

Is there an existing loop transformation that allows us to vectorize the following with -XX:-UseCompactObjectHeaders -XX:+AlignVector? for (int i = 0; i + 2 < SIZE; i++) { res[i + 2] = (short) floats[i + 1]; } If I'm not mistaken, it has a similar issue.
19-11-2024

We had a discussion on GitHub up till now: https://github.com/openjdk/jdk/pull/20677#issuecomment-2483138198
18-11-2024