Bug ID: JDK-8371603 C2: Missing Ideal optimizations for load and store vectors on SVE

JDK-8371603 : C2: Missing Ideal optimizations for load and store vectors on SVE

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 26

Priority: P3
Status: Open
Resolution: Unresolved
OS: linux
CPU: aarch64

Submitted: 2025-11-11
Updated: 2025-12-08

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 26
26Unresolved

Related Reports

Causes :	JDK-8286941 - Add mask IR for partial vector operations for ARM SVE
Causes :	JDK-8367389 - C2 SuperWord: refactor VTransform to model the whole loop instead of just the basic block
Duplicate :	JDK-8372717 - Fix the missing optimization issues for vector nodes caused by JDK-8286941
Relates :	JDK-8372274 - [lworld] C2: assert(_inputs.at(alias_idx) == nullptr) failed: did not yet touch this slice

Description

The issue seems to be a bad combination of mistakes made in JDK-8286941 (JDK20) combined with verification from JDK-8367389 (JDK26).

JDK-8286941 has the consequence that we do not perform all Ideal optimizations. This is an issue on its own, because we do not optimize as much as we could.

However, with JDK-8367389 we now encounter a particular case where we have multiple LoadVectorNodes that have separate MergeMem nodes, which we could have stepped over and gotten to the same memory state, but the mistakes from JDK-8286941 prevent this. Now we have multiple loads that SHOULD have the same memory state, but instead have different MergeMem nodes. This triggers the assert during SuperWord, introduced with JDK-8367389.

---------------------------------------------- ORIGINAL REPORT ------------------------------------------------------------

When I running a JMH benchmark on an **AWS Graviton3 machine (with 256-bit sve support)**, the following assert failed. Log is as follow:

```
# Run progress: 0.00% complete, ETA 00:00:14
# Fork: 1 of 1
WARNING: Using incubator modules: jdk.incubator.vector
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::objectFieldOffset has been called by org.openjdk.jmh.util.Utils (file:/localhome/jadmin/erfang/jdk/build/linux-aarch64-server-fastdebug/images/test/micro/benchmarks.jar)
WARNING: Please consider reporting this to the maintainers of class org.openjdk.jmh.util.Utils
WARNING: sun.misc.Unsafe::objectFieldOffset will be removed in a future release
# Warmup Iteration   1: #
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/localhome/jadmin/erfang/jdk/src/hotspot/share/opto/vectorization.cpp:231), pid=167298, tid=167336
#  assert(_inputs.at(alias_idx) == nullptr || _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input
#
# JRE version: OpenJDK Runtime Environment (26.0) (fastdebug build 26-internal-adhoc.jadmin.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 26-internal-adhoc.jadmin.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x1b2d5ec]  VLoopMemorySlices::find_memory_slices()+0x29c
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -F%F -- %E" (or dumping to /localhome/jadmin/erfang/jdk/build/linux-aarch64-server-fastdebug/images/test/core.167298)
#
# An error report file with more information is saved as:
# /localhome/jadmin/erfang/jdk/build/linux-aarch64-server-fastdebug/images/test/hs_err_pid167298.log
^C
ERROR: Build failed for target 'test' in configuration 'linux-aarch64-server-fastdebug' (exit code 141)

No indication of failed target found.
HELP: Try searching the build log for '] Error'.
HELP: Run 'make doctor' to diagnose build problems.

make[1]: *** [/localhome/jadmin/erfang/jdk/make/Init.gmk:151: main] Error 141
make: *** [/localhome/jadmin/erfang/jdk/make/PreInit.gmk:159: test] Interrupt
```

Test cases is a new JMH benchmark file, you can put it in ```test/micro/org/openjdk/bench/jdk/incubator/vector/MaskLastTrueBenchmark.java```

The code is as follow:
```
/*
 * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
 *
 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
 * or visit www.oracle.com if you need additional information or have any
 * questions.
 */

package org.openjdk.bench.jdk.incubator.vector;

import java.util.Random;
import jdk.incubator.vector.*;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.*;

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Warmup(iterations = 4, time = 2)
@Measurement(iterations = 6, time = 1)
@Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"})
public class MaskLastTrueBenchmark {
    @Param({"128"})
    int size;

    private static final VectorSpecies<Byte> bspecies = VectorSpecies.ofLargestShape(byte.class);
    private static final VectorSpecies<Short> sspecies = VectorSpecies.ofLargestShape(short.class);
    private static final VectorSpecies<Integer> ispecies = VectorSpecies.ofLargestShape(int.class);
    private static final VectorSpecies<Long> lspecies = VectorSpecies.ofLargestShape(long.class);
    private static final VectorSpecies<Float> fspecies = VectorSpecies.ofLargestShape(float.class);
    private static final VectorSpecies<Double> dspecies = VectorSpecies.ofLargestShape(double.class);

    byte[] byte_arr;
    short[] short_arr;
    int[] int_arr;
    long[] long_arr;
    float[] float_arr;
    double[] double_arr;
    boolean[] mask_arr;

    @Setup(Level.Trial)
    public void BmSetup() {
        Random r = new Random();
        byte_arr = new byte[size];
        short_arr = new short[size];
        int_arr = new int[size];
        long_arr = new long[size];
        float_arr = new float[size];
        double_arr = new double[size];
        mask_arr = new boolean[size];

        for (int i = 0; i < size; i++) {
            byte_arr[i] = (byte) r.nextInt();
            short_arr[i] = (short) r.nextInt();
            int_arr[i] = r.nextInt();
            long_arr[i] = r.nextLong();
            float_arr[i] = r.nextFloat();
            double_arr[i] = r.nextDouble();
            mask_arr[i] = r.nextBoolean();
        }
    }

    // VectorMask.fromArray + lastTrue

    @Benchmark
    public int testLastTrueFromArrayByte() {
        int sum = 0;
        for (int i = 0; i < size; i += bspecies.length()) {
            VectorMask<Byte> m = VectorMask.fromArray(bspecies, mask_arr, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromArrayShort() {
        int sum = 0;
        for (int i = 0; i < size; i += sspecies.length()) {
            VectorMask<Short> m = VectorMask.fromArray(sspecies, mask_arr, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromArrayInt() {
        int sum = 0;
        for (int i = 0; i < size; i += ispecies.length()) {
            VectorMask<Integer> m = VectorMask.fromArray(ispecies, mask_arr, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromArrayLong() {
        int sum = 0;
        for (int i = 0; i < size; i += lspecies.length()) {
            VectorMask<Long> m = VectorMask.fromArray(lspecies, mask_arr, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromArrayFloat() {
        int sum = 0;
        for (int i = 0; i < size; i += fspecies.length()) {
            VectorMask<Float> m = VectorMask.fromArray(fspecies, mask_arr, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromArrayDouble() {
        int sum = 0;
        for (int i = 0; i < size; i += dspecies.length()) {
            VectorMask<Double> m = VectorMask.fromArray(dspecies, mask_arr, i);
            sum += m.lastTrue();
        }
        return sum;
    }


    // Vector.compare + lastTrue

    @Benchmark
    public int testLastTrueCompareByte() {
        int sum = 0;
        for (int i = 0; i < size; i += bspecies.length()) {
            ByteVector v = ByteVector.fromArray(bspecies, byte_arr, i);
            sum += v.compare(VectorOperators.LT, 0).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueCompareShort() {
        int sum = 0;
        for (int i = 0; i < size; i += sspecies.length()) {
            ShortVector v = ShortVector.fromArray(sspecies, short_arr, i);
            sum += v.compare(VectorOperators.LT, 0).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueCompareInt() {
        int sum = 0;
        for (int i = 0; i < size; i += ispecies.length()) {
            IntVector v = IntVector.fromArray(ispecies, int_arr, i);
            sum += v.compare(VectorOperators.LT, 0).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueCompareLong() {
        int sum = 0;
        for (int i = 0; i < size; i += lspecies.length()) {
            LongVector v = LongVector.fromArray(lspecies, long_arr, i);
            sum += v.compare(VectorOperators.LT, 0).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueCompareFloat() {
        int sum = 0;
        for (int i = 0; i < size; i += fspecies.length()) {
            FloatVector v = FloatVector.fromArray(fspecies, float_arr, i);
            sum += v.compare(VectorOperators.LT, 0).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueCompareDouble() {
        int sum = 0;
        for (int i = 0; i < size; i += dspecies.length()) {
            DoubleVector v = DoubleVector.fromArray(dspecies, double_arr, i);
            sum += v.compare(VectorOperators.LT, 0).lastTrue();
        }
        return sum;
    }


    // VectorMask.indexInRange + lastTrue

    @Benchmark
    public int testLastTrueIndexInRangeByte() {
        int sum = 0;
        int limit = 0;
        VectorMask<Byte> m = VectorMask.fromArray(bspecies, mask_arr, 0);
        for (int i = 0; i < size; i += bspecies.length()) {
            sum += m.indexInRange(0, limit++ % (m.length())).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueIndexInRangeShort() {
        int sum = 0;
        int limit = 0;
        VectorMask<Short> m = VectorMask.fromArray(sspecies, mask_arr, 0);
        for (int i = 0; i < size; i += sspecies.length()) {
            sum += m.indexInRange(0, limit++ % (m.length())).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueIndexInRangeInt() {
        int sum = 0;
        int limit = 0;
        VectorMask<Integer> m = VectorMask.fromArray(ispecies, mask_arr, 0);
        for (int i = 0; i < size; i += ispecies.length()) {
            sum += m.indexInRange(0, limit++ % (m.length())).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueIndexInRangeLong() {
        int sum = 0;
        int limit = 0;
        VectorMask<Long> m = VectorMask.fromArray(lspecies, mask_arr, 0);
        for (int i = 0; i < size; i += lspecies.length()) {
            sum += m.indexInRange(0, limit++ % (m.length())).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueIndexInRangeFloat() {
        int sum = 0;
        int limit = 0;
        VectorMask<Float> m = VectorMask.fromArray(fspecies, mask_arr, 0);
        for (int i = 0; i < size; i += fspecies.length()) {
            sum += m.indexInRange(0, limit++ % (m.length())).lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueIndexInRangeDouble() {
        int sum = 0;
        int limit = 0;
        VectorMask<Double> m = VectorMask.fromArray(dspecies, mask_arr, 0);
        for (int i = 0; i < size; i += dspecies.length()) {
            sum += m.indexInRange(0, limit++ % (m.length())).lastTrue();
        }
        return sum;
    }


    // VectorMask.fromLong + lastTrue

    @Benchmark
    public int testLastTrueFromLongByte() {
        int sum = 0;
        for (int i = 0; i < size; i += bspecies.length()) {
            VectorMask<Byte> m = VectorMask.fromLong(bspecies, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromLongShort() {
        int sum = 0;
        for (int i = 0; i < size; i += sspecies.length()) {
            VectorMask<Short> m = VectorMask.fromLong(sspecies, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromLongInt() {
        int sum = 0;
        for (int i = 0; i < size; i += ispecies.length()) {
            VectorMask<Integer> m = VectorMask.fromLong(ispecies, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromLongLong() {
        int sum = 0;
        for (int i = 0; i < size; i += lspecies.length()) {
            VectorMask<Long> m = VectorMask.fromLong(lspecies, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromLongFloat() {
        int sum = 0;
        for (int i = 0; i < size; i += fspecies.length()) {
            VectorMask<Float> m = VectorMask.fromLong(fspecies, i);
            sum += m.lastTrue();
        }
        return sum;
    }

    @Benchmark
    public int testLastTrueFromLongDouble() {
        int sum = 0;
        for (int i = 0; i < size; i += dspecies.length()) {
            VectorMask<Double> m = VectorMask.fromLong(dspecies, i);
            sum += m.lastTrue();
        }
        return sum;
    }


    // VectorMask.fromArray + lastTrue & toLong
    // Before:
    //   LoadVector + VectorLoadMask + VectorMaskLastTrue
    //                               + VectorMaskToLong
    // After:
    //   LoadVector + VectorMaskLastTrue
    //              + VectorLoadMask + VectorMaskToLong
    //
    // Match rule of "LoadVector + VectorLoadMask" doesn't match since LoadVector is multi used.

    @Benchmark
    public long testMultiUsesFromArrayByte() {
        long sum = 0;
        for (int i = 0; i < size; i += bspecies.length()) {
            VectorMask<Byte> m = VectorMask.fromArray(bspecies, mask_arr, i);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesFromArrayShort() {
        long sum = 0;
        for (int i = 0; i < size; i += sspecies.length()) {
            VectorMask<Short> m = VectorMask.fromArray(sspecies, mask_arr, i);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesFromArrayInt() {
        long sum = 0;
        for (int i = 0; i < size; i += ispecies.length()) {
            VectorMask<Integer> m = VectorMask.fromArray(ispecies, mask_arr, i);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesFromArrayLong() {
        long sum = 0;
        for (int i = 0; i < size; i += lspecies.length()) {
            VectorMask<Long> m = VectorMask.fromArray(lspecies, mask_arr, i);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesFromArrayFloat() {
        long sum = 0;
        for (int i = 0; i < size; i += fspecies.length()) {
            VectorMask<Float> m = VectorMask.fromArray(fspecies, mask_arr, i);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesFromArrayDouble() {
        long sum = 0;
        for (int i = 0; i < size; i += dspecies.length()) {
            VectorMask<Double> m = VectorMask.fromArray(dspecies, mask_arr, i);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }


    // Vector.compare + lastTrue & toLong

    @Benchmark
    public long testMultiUsesCompareByte() {
        long sum = 0;
        for (int i = 0; i < size; i += bspecies.length()) {
            ByteVector v = ByteVector.fromArray(bspecies, byte_arr, i);
            VectorMask<Byte> m = v.compare(VectorOperators.LT, 0);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesCompareShort() {
        long sum = 0;
        for (int i = 0; i < size; i += sspecies.length()) {
            ShortVector v = ShortVector.fromArray(sspecies, short_arr, i);
            VectorMask<Short> m = v.compare(VectorOperators.LT, 0);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesCompareInt() {
        long sum = 0;
        for (int i = 0; i < size; i += ispecies.length()) {
            IntVector v = IntVector.fromArray(ispecies, int_arr, i);
            VectorMask<Integer> m = v.compare(VectorOperators.LT, 0);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesCompareLong() {
        long sum = 0;
        for (int i = 0; i < size; i += lspecies.length()) {
            LongVector v = LongVector.fromArray(lspecies, long_arr, i);
            VectorMask<Long> m = v.compare(VectorOperators.LT, 0);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesCompareFloat() {
        long sum = 0;
        for (int i = 0; i < size; i += fspecies.length()) {
            FloatVector v = FloatVector.fromArray(fspecies, float_arr, i);
            VectorMask<Float> m = v.compare(VectorOperators.LT, 0);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

    @Benchmark
    public long testMultiUsesCompareDouble() {
        long sum = 0;
        for (int i = 0; i < size; i += dspecies.length()) {
            DoubleVector v = DoubleVector.fromArray(dspecies, double_arr, i);
            VectorMask<Double> m = v.compare(VectorOperators.LT, 0);
            sum += m.lastTrue();
            sum += m.toLong();
        }
        return sum;
    }

}
```

To reproduce the crash, run the following test command:
``` make test TEST=micro:org.openjdk.bench.jdk.incubator.vector.MaskLastTrueBenchmark.* ```

[~epeter] Would you mind taking a look since the assert was introduced by https://github.com/openjdk/jdk/commit/2ac24bf1bac9c32704ebd72b93a75819b9404063, thanks!

Comments

A pull request was submitted for review. Branch: master URL: https://git.openjdk.org/jdk/pull/28651 Date: 2025-12-04 01:41:19 +0000
04-12-2025
I closed JDK-8372717 as duplicate of this because this issue contains more information and it's really a bug not an enhancement.
02-12-2025
[~xgong] Yes, absolutely, we have some time. The fork of JDK26 is coming up quickly, but we still have many weeks to get a P3 bug quickly backported during rampdown.
28-11-2025
Thanks for your regression tests [~epeter], I'd like fix the LoadVector/StoreVector issues in JDK-8372717 as they are caused by the partical operations PR. I will start the patch next Monday. Is that fine? As for others that might missing optimizations due to missing calling the superclass's Ideal, I think we can fix it with another patch. WDYT?
28-11-2025
[~erfang] [~xgong] Can you two coordinate how to fix this issue? [~xgong] now also filed JDK-8372717. I could also do it, but over QEMU it is very very slow and annoying ;) We should make sure that at least the case for the LoadVectors is fixed, so that we can fix the assert here for JDK26, as it is a JDK26 regression. The other issues we now found that are mere missing optimizations can also only fix for JDK27 or later, it is less pressing. Let me know how you want to proceed. Feel free to take over this issue, or close it as a duplicate of JDK-8372717. If you do decide to take it over, I would appreciate if all attached regression tests were in the PR, and I get tagged as contributor ;)
28-11-2025
And the following ones may also be bad patterns: RotateLeftVNode::Ideal RotateRightVNode::Ideal VectorUnboxNode::Ideal VectorLongToMaskNode::Ideal, this if statement if (src->Opcode() != Op_VectorStoreMask) { return nullptr; } FmaVNode::Ideal NegVNode::Ideal
28-11-2025
[~epeter] Nice work! Just confirmed that with the following changes, the crash disappeared. diff --git a/src/hotspot/share/opto/vectornode.cpp b/src/hotspot/share/opto/vectornode.cpp index bc79399900c..575cc7321da 100644 --- a/src/hotspot/share/opto/vectornode.cpp +++ b/src/hotspot/share/opto/vectornode.cpp @@ -1138,7 +1138,10 @@ LoadVectorNode* LoadVectorNode::make(int opc, Node* ctl, Node* mem, Node* LoadVectorNode::Ideal(PhaseGVN* phase, bool can_reshape) { const TypeVect* vt = vect_type(); if (Matcher::vector_needs_partial_operations(this, vt)) { - return VectorNode::try_to_gen_masked_vector(phase, this, vt); + Node* res = VectorNode::try_to_gen_masked_vector(phase, this, vt); + if (res != nullptr) { + return res; + } } return LoadNode::Ideal(phase, can_reshape); }
28-11-2025
[~xgong] I now found a nice reproducer, where we miss to do the LoadNode::Ideal optimization. See TestOptimizeLoadVector.java, where we get 2 LoadVector instead of a single LoadVector in the final graph. I'll now investigate if there are similar reproducers for the other cases of JDK-8286941. These are the cases I'll look at: - LoadVectorNode::Ideal -> see TestOptimizeLoadVector.java - StoreVectorNode::Ideal -> see TestOptimizeStoreVector.java - VectorMaskOpNode::Ideal -> probably ok, but still a bad pattern - ReductionNode::Ideal -> probably ok, but still a bad pattern - VectorNode::Ideal -> probably ok, but still bad pattern
28-11-2025
"Matcher::vector_needs_partial_operations(this, vt)" returns true for `LoadVectorNode` if the vector size exceeds 16-byte. So that's the case and the issue happens on 256-bit SVE. The condition is true, and "try_to_gen_masked_vector(phase, this, vt)" returns "nullptr" if the vector size equals to the SVE's vector register size. For these cases, it seems "LoadNode::Ideal" will be missed. So I think that's the issue, which is my fault. I just wonder why such cases are not crashed before.
28-11-2025
The code was added in JDK-8286941. It was later a bit refactored, but the condition looks the same. [~xgong] Do you remember if it was on purpose that we never call LoadNode::Ideal? The same pattern can be found for other nodes. If I'm understanding right, the idea is that if we have a vector that is too large for NEON, we cannot use NEON vectors and have to use SVE vectors. But those may only be available in the maximum vector length. And if we need something in between the NEON and maximum SVE size, then we need to mask off some lanes. But if it turns out that try_to_gen_masked_vector does nothing, we should be able to continue with LoadNode::Ideal. Honestly, I'm a bit surprised that his did not create other issues on SVE 256bit. Because you would be missing a lot of optimizations that way!
27-11-2025
Update: I was able to step through the SVE execution, and I finally noticed that the difference in the graph comes from here: 1119 Node* LoadVectorNode::Ideal(PhaseGVN* phase, bool can_reshape) { 1120 const TypeVect* vt = vect_type(); 1121 if (Matcher::vector_needs_partial_operations(this, vt)) { 1122 return VectorNode::try_to_gen_masked_vector(phase, this, vt); 1123 } 1124 return LoadNode::Ideal(phase, can_reshape); 1125 } And further from here: 348 bool Matcher::vector_needs_partial_operations(Node* node, const TypeVect* vt) { 349 // Only SVE has partial vector operations 350 if (UseSVE == 0) { 351 return false; 352 } ... 371 case Op_LoadVector: 372 case Op_StoreVector: 373 // We use NEON load/store instructions if the vector length is <= 128 bits. 374 return vt->length_in_bytes() > 16; So it turns out, that we need an SVE machine with more than 16 bytes, i.e. at least 256 bits. And that is exactly what happens on the 256bit SVE Graviton 3 Machine. So now we only call VectorNode::try_to_gen_masked_vector, which may or may not do anything. In this case, it does nothing, and just returns nullptr. But that means we do not call LoadNode::Ideal, and miss to do those optimizations. One of those is to step_through_mergemem. On my AVX512 machine, we are able to eliminate the MergeMem and go directly to its input. But on this SVE 256bit machine we do not step over the MergeMem, and so we keep the MergeMem in the graph. We have multiple LoadVector, possibly with multiple different MergeMem, but all those MergeMem nodes eventually end up with the same inputs. So if when we do step over them on AVX512, we now end up with the same memory inputs for all loads. But on SVE 256bit we keep the different MergeMem nodes. I'll now investigate why we do this on SVE 256bit, maybe it is just a basic logic error that we do not enter LoadNode::Ideal when VectorNode::try_to_gen_masked_vector returns null. Or maybe it is on purpose, and things get a bit more compicated, let's see.
27-11-2025
Still debugging, but I have made an observation: MergeMemNode::hash returns NO_HASH. This means that they are not commoned. So it is not entirely surprising that we find different MergeMems (with identical inputs) where we expect identical memory state. I suspect that we return NO_HASH so that we do not common during GVN when parsing: else incomplete memory states would already be commoned. But during IGVN we could probably safely common the MergeMems. Also: I have compared the graph on my AVX512 and MaxVectorSize=32 machine with the SVE 256bit. The graphs are very similar, but the memory states do look a little bit different. I'll keep investigating.
24-11-2025
Update: I have a stand-alone test now, though it needs to be reduced further. /home/opc/qemu/build/qemu-aarch64 -cpu max,sve=on,sve256=on ./java -Xbatch -XX:-TieredCompilation -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=printcompilation,Test::test -esa Test1.java CompileCommand: compileonly Test.test bool compileonly = true CompileCommand: PrintCompilation Test.test bool PrintCompilation = true test v1 172740 114 % b Test1::test @ 53 (188 bytes) # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/opc/jdk-fork0/open/src/hotspot/share/opto/vectorization.cpp:231), pid=281451, tid=281467 # assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input
20-11-2025
Update: I was able to reduce the Int256VectorTests.java a little, file attached. It relies on quite a test infrastructure still though. The next goal would be to create a stand-alone test.
20-11-2025
[~shade][~erfang] I was just able to reproduce the issue using QEMU. What I did: Ran Int256VectorTests.java on NEON. Then I copied the command line, and ran it through: qemu-aarch64 -cpu max,sve=on,sve256=on And this got me: # Internal Error (/home/opc/jdk-fork0/open/src/hotspot/share/opto/vectorization.cpp:231), pid=260743, tid=260759 # assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input Now the true investigation can begin :) Thanks again [~shade] for reporting the additional failures, those seem to be much easier to reproduce! And thanks to [~erfang] for reporting the issue in the first place :)
20-11-2025
[~shade] Great, it should be easier to extract a small case from these tests.
20-11-2025
[~shade] Thanks a lot for these reports!
20-11-2025
Seeing these kind of failures in jdk/incubator/vector on Graviton 3 machine. 12 tests are failing, all failing in the same mode. $ make test TEST=jdk/incubator/vector 2>&1 \| tee vector.log ... ============================== Test summary ============================== TEST TOTAL PASS FAIL ERROR SKIP >> jtreg:test/jdk/jdk/incubator/vector 83 67 12 0 4 << ============================== TEST FAILURE $ grep ^TEST: vector.log \| nl 1 TEST: jdk/incubator/vector/Double256VectorTests.java 2 TEST: jdk/incubator/vector/Byte256VectorTests.java 3 TEST: jdk/incubator/vector/DoubleMaxVectorTests.java 4 TEST: jdk/incubator/vector/ByteMaxVectorTests.java 5 TEST: jdk/incubator/vector/Float256VectorTests.java 6 TEST: jdk/incubator/vector/FloatMaxVectorTests.java 7 TEST: jdk/incubator/vector/Int256VectorTests.java 8 TEST: jdk/incubator/vector/IntMaxVectorTests.java 9 TEST: jdk/incubator/vector/Long256VectorTests.java 10 TEST: jdk/incubator/vector/LongMaxVectorTests.java 11 TEST: jdk/incubator/vector/Short256VectorTests.java 12 TEST: jdk/incubator/vector/ShortMaxVectorTests.java $ find build/ -iname hs_err\* -exec grep -H assert {} \; \| grep failed \| nl 1 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/Double256VectorTests/hs_err_pid183875.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 2 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/Byte256VectorTests/hs_err_pid184298.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 3 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/DoubleMaxVectorTests/hs_err_pid184357.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 4 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/ByteMaxVectorTests/hs_err_pid184455.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 5 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/Float256VectorTests/hs_err_pid184715.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 6 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/FloatMaxVectorTests/hs_err_pid185045.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 7 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/Int256VectorTests/hs_err_pid185209.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 8 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/IntMaxVectorTests/hs_err_pid185505.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 9 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/Long256VectorTests/hs_err_pid185702.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 10 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/LongMaxVectorTests/hs_err_pid186028.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 11 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/Short256VectorTests/hs_err_pid186533.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input 12 build/linux-aarch64-server-fastdebug/test-support/jtreg_test_jdk_jdk_incubator_vector/jdk/incubator/vector/ShortMaxVectorTests/hs_err_pid186844.log:# assert(_inputs.at(alias_idx) == nullptr \|\| _inputs.at(alias_idx) == load->in(1)) failed: not yet touched or the same input Sample hs_err attached as hs_err_pid186844.log.
20-11-2025
I'm considering writing some more general verification method that checks that all slices have a unique memory state before a Loop. Maybe that can get us some other examples and provide a way to a reproducer of the issue in SuperWord.
17-11-2025
[~erfang] Just wondering if we can slightly improve this condition In theory maybe yes. But I can only think of hacky ways here. I really do need to find a way of reproducing the issue on my side anyway, just to see how this came about. Ideal would be a reproducer that can also run on other platforms (aarch64 NEON and x64).
14-11-2025
This is a bit frustrating. I can't even get the method that has the compilation failure to compile, even without emulator: /home/opc/jdk-fork0/build/linux-aarch64-debug/jdk/bin/java -XX:MaxVectorSize=32 -XX:+TraceNewVectors -XX:CompileCommand=printcompilation,::testLastTrueCompareInt* -XX:CompileCommand=compileonly,::testLastTrueCompareInt* -Xcomp -jar /home/opc/jdk-fork0/build/linux-aarch64-debug/images/test/micro/benchmarks.jar org.openjdk.bench.jdk.incubator.vector.MaskLastTrueBenchmark.testLastTrueCompareInt Never shows the compilation of ::testLastTrueCompareInt*. And that despite comileonly and Xcomp. Obviously, I'm doing something wrong here.
14-11-2025
Loading the replay file lands me in an assert sadly. /home/opc/jdk-fork0/build/linux-aarch64-debug/jdk/bin/java -XX:+ReplayCompiles -XX:+ReplayIgnoreInitErrors -XX:ReplayDataFile=r.log -jar /home/opc/jdk-fork0/build/linux-aarch64-debug/images/test/micro/benchmarks.jar Resolving klass jdk/internal/vm/vector/VectorSupport$VectorMask at 20 Resolving klass jdk/incubator/vector/Int256Vector$Int256Mask at 12 Error while parsing line 2040 at position 105: Can't find method Error while parsing line 2041 at position 99: Can't find method # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/opc/jdk-fork0/open/src/hotspot/share/ci/ciKlass.hpp:60), pid=4064290, tid=4064306 # assert(k != nullptr) failed: illegal use of unloaded klass
14-11-2025
[~erfang] Thanks for the graph dump! Ah interesting, it really does seem like the two memory states "1517 MergeMem" and "1809 MergeMem" have the same inputs, and so they represent the same memory state. _inputs.at(alias_idx) = 1517 MergeMem === _ 1 206 207 [[ 1596 1860 ]] { N207:rawptr:BotPTR } Memory: @BotPTR +bot, idx=Bot; !orig=[316] !jvms: Int256Vector$Int256Mask::lastTrue @ bci:16 (line 750) MaskLastTrueBenchmark::testLastTrueCompareInt @ bci:33 (line 167) MaskLastTrueBenchmark_testLastTrueCompareInt_jmhTest::testLastTrueCompareInt_thrpt_jmhStub @ bci:17 (line 121) load->in(1) = 1809 MergeMem === _ 1 206 207 [[ 1798 1863 ]] { N207:rawptr:BotPTR } Memory: @BotPTR +bot, idx=Bot; !orig=1517,[316] !jvms: Int256Vector$Int256Mask::lastTrue @ bci:16 (line 750) MaskLastTrueBenchmark::testLastTrueCompareInt @ bci:33 (line 167) MaskLastTrueBenchmark_testLastTrueCompareInt_jmhTest::testLastTrueCompareInt_thrpt_jmhStub @ bci:17 (line 121) It even seems like one is the clone of the other, "1809" has "orig=1517,[316]", so it must have been cloned from "1517". At least this tells me that they both produce equivalent memory states, and so swapping one for the other does not lead to correctness issues. I do wonder why they were not commoned. Because IGVN would do that, I think. Can you run with -XX:VerifyIterativeGVN=1110, to see if this is a missed opportunity? It is possible that "1809" was only just cloned in the same loopopts phase, and then we did not yet have the chance to common it. And thanks for the tips about flags etc. I'm struggling a bit with the benchmark because the emulator leads to drastically different timing, and so that affects what methods get compiled at what time, and with what profiling info. Maybe I'll have some luck later, maybe using the replay file.
14-11-2025
Hi [~epeter] I have uploaded two files "igv-dump.txt" and "tmp.xml" which contain the node information. Please check them. From the ideal graph we can see that "1517 MergeMem" and "1809 MergeMem" are really two different nodes. But they have the totally same inputs, so are they alias but different nodes ? As for the cpu flags, when I run the test, I didn't set any flags. so I think all flags were set by default. This is an aws graviton3 machine, with 256-bit sve support. So to emulate this machine, maybe you have to set "-XX:MaxVectorSize=32". On a 512-bit x86 machine, QEMU emulates 512-bit SVE by default. And there are detailed machine information in the hs_err_pid144663.log
14-11-2025
Hi [~epeter], after using `-XX:VerifyIterativeGVN=1110`, no more information is printed because the program crashes before printing any additional information. > At least this tells me that they both produce equivalent memory states, and so swapping one for the other does not lead to correctness issues. Just wondering if we can slightly improve this condition `_inputs.at(alias_idx) == load->in(1)`, for example, by checking whether they are alias nodes or clone nodes with the same memory states, rather than being completely identical nodes? I'm not familiar with the logic here, just a guess.
14-11-2025
[~erfang] Do you know what exact "-cpu" flag I need to feed into QEMU to simulate your machine that reproduced the issue? This here did not reproduce: /home/opc/qemu/build/qemu-aarch64 -cpu max,sve=on /home/opc/jdk-fork0/build/linux-aarch64-debug/jdk/bin/java -jar /home/opc/jdk-fork0/build/linux-aarch64-debug/images/test/micro/benchmarks.jar org.openjdk.bench.jdk.incubator.vector.MaskLastTrueBenchmark.testLastTrueCompareInt But it is probably missing the 256 bit part.
13-11-2025
[~erfang] Would it be possible for you to print the graph at the time of the assert? Just dump all nodes. And also point out which one is the "load" from the assert. That would help me see what assumption was wrong. I'll still try to emulate the machine, let's see if I can get it all set up.
13-11-2025
Hi [~epeter], it won't take that long, it will crash when running the first benchmark. We can debug the execution with command: ``` java -jar build/linux-aarch64-server-fastdebug/images/test/micro/benchmarks.jar org.openjdk.bench.jdk.incubator.vector.MaskLastTrueBenchmark.testLastTrueCompareInt ```
12-11-2025
[~erfang] Ok, thanks for trying to reduce the JMH. And thanks for the extra files. Maybe someone will have some luck on an emulator. It's a bit tricky because the benchmark currently seems to take quite long to run, maybe around 10min. Not sure how long that will take on the emulator.
12-11-2025
> I ran the benchmark on my avx512 machine, and could not reproduce :/ Yeah, I can't reproduce it on my 128-bit sve machine either. And I also tried to reproduce it with "-XX:MaxVectorSize=32" on a 512-bit x86 machine, could not reproduce either. Maybe it only exists on 256-bit sve machine, I don't have access to 512-bit sve machine. I'm not sure if this problem exists on it.
12-11-2025
[~epeter] [~thartmann] I have attached the hs_err_pid144663.log and replay_pid144663.log files. I tried extracting a small case from the JMH test, but without success. I've been busy with other things, so I haven't had much time to investigate this issue in depth. If you have any leads, I can help verify them. Thanks!
12-11-2025
Please also share the replay_pid* file.
12-11-2025
ILW = Assert during C2 compilation in SuperWord (regression), single benchmark on AARch64 with 256-bit sve support, disable superword or compilation of affected method = HLM = P3
12-11-2025
I ran the benchmark on my avx512 machine, and could not reproduce :/
11-11-2025
[~erfang] Thanks for the report! I don't have access to any SVE machine sadly. But let's try to dig into it together. Optimal would be if we could reduce the test a bit, and run it without JMH. Also: can you please drop the hs_err file, so I get a bit more information about how it fails?
11-11-2025