Bug ID: JDK-8336000 C2 SuperWord: report that 2-element reductions do not vectorize

JDK-8336000 : C2 SuperWord: report that 2-element reductions do not vectorize

Type: Enhancement
Component: hotspot
Sub-Component: compiler
Affected Version: 21,22

Priority: P3
Status: Open
Resolution: Unresolved
OS: generic
CPU: aarch64

Submitted: 2024-07-09
Updated: 2025-05-23

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Related Reports

Relates :	JDK-8307516 - C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
Relates :	JDK-8357530 - C2 SuperWord: Diagnostic flag AutoVectorizationOverrideProfitability
Relates :	JDK-8340093 - C2 SuperWord: implement cost model

Description

As described in the thread below, the issue is with 2-element reductions. They are currently not deemed profitable, and may not be profitable on some platforms.

The reported example first assumed that the issue was with bitCount, but it is really the issue with 2-element reductions not being vectorized. We have a 2-element reduction because of 16 byte on aarch64 machine, and long-vectors can thus only have 2 elements.

Relates to JDK-8307516, and may be fixed with JDK-8340093.

---------------------- Original Description -------------------------

Contrary to intuition Integer::bitcount performs better that Long::bitcount on the same underlying data on AArch64.

Summary

The issue is that Long::bitcount does not vectorized, while Integer::bitcount uses NEON SIMD instructions to operate on groups of 4 32-bit ints at a time. This is a counterintuitive, as one would expect striding with a wider bit-width to be more performant.

Linux x64 shows no such issue - performance of int and long bit counts is identical.

Background

We ran into this anomaly when implementing hamming distance between two bit vectors. The values for each bit vector are packed into a byte[], where each bit represents a feature of the vector. We then chose to view the vector data as long, via MethodHandles::byteArrayViewVarHandle(long[].class, ..), in order to perform an xor between the vectors and count the set bits. Surprisingly, implementing the same but using MethodHandles::byteArrayViewVarHandle(int[].class, ..) performs significantly better on ARM.

There may be more going on between our implementation of hamming distance and plain use of bitCount, but it's likely that this is the root cause for the performance differential that we see. Additionally, it is useful to analyse the plain use of bitCount independently (of our hamming distance impl).

Long::bitCount - unrolling appears to almost "just" stamp out the non-unrolled loop body 8 times. So, one unrolled iteration operates on 8 64-bit long values at a time, which covers 512 bits per loop iteration.

Integer::bitCount - unrolling vectorizes. So, one unrolled iteration operates on 32 32-bit int values at a time - effectively loading 4 32-bit int values into a single 128-bit register. And it does this 8 times, which covers 1024 bits per loop iteration.

The int variant vectorizes (uses NEON SIMD) when it unrolls, whereas the long variant does not. The result is that the unrolled int variant processes twice as many values per iteration of the unrolled loop, when compared to that of the long variant.

Trivial benchmark:

public class BitCountBenchmark {
    @Param({ "100000" })
    int times = 100_000;

    @Param({ "1024" })
    int size = 1024;

    long[][] la;
    int[][] ia;

    @Setup
    public void setup() {
        Random rand = new Random();
        this.ia = new int[times][size];
        this.la = new long[times][size / 2];
        for (int i = 0; i < times; i++) {
            for (int j = 0; j < size / 2; j++) {
                int x = rand.nextInt();
                int y = rand.nextInt();
                ia[i][j * 2] = x;
                ia[i][j * 2 + 1] = y;
                la[i][j] = (((long)x) << 32) | (y & 0xffffffffL);
            }
        }

        if (bitCountLong() != bitCountInt()) {
            throw new AssertionError();
        }
    }

    @Benchmark
    public int bitCountLong() {
        int tot = 0;
        for (int i = 0; i < times; i++) {
            tot += longBitCount(la[i]);
        }
        return tot;
    }

    @Benchmark
    public int bitCountInt()  {
        int tot = 0;
        for (int i = 0; i < times; i++) {
            tot += intBitCount(ia[i]);
        }
        return tot;    
    }

    static int longBitCount(long[] la) {
        int distance = 0;
        for (int i = 0;  i < la.length;  i++) {
            distance += Long.bitCount(la[i]);
        }
        return distance;
    }

    static int intBitCount(int[] ia) {
        int distance = 0;
        for (int i = 0; i < ia.length;  i++) {
            distance += Integer.bitCount(ia[i]);
        }
        return distance;
    }

Comments

[~uschindler] You can of course work around the compiler and its limitations if you think that is worth it for you. I hope to improve the compiler with JDK24 or JDK25 so that this case would vectorize as expected. But I cannot make promises ;)
12-07-2024
So maybe we should add a condition in Lucene based on VectorSize. We have the framework available (in Constants.java): https://github.com/apache/lucene/blob/cc854f408fd06553de4780b1874992c16b71f26e/lucene/core/src/java/org/apache/lucene/util/Constants.java#L92-L94 This is unrelated to this issue, but we should maybe change the condition to use the above constant >= XY in our code and only use longs when OK? If it is non-hotspot the constant will be 0 and we would use 32 bit (which is fine as less risky for performance regression). Thanks @Emanuel for giving us enough background information.
11-07-2024
> I think this judgement may be wrong, i.e. that 2-element-reductions for int/long are sometimes actually performance protitable. But it may not always be the case. Thanks for finding this. It would certainly seem the case here that the reduction seems worth while, but as you say it may not generally be the case. > I hope to address this soon, with JDK-8307516 or other related RFE's. But as such, I think this is more of an RFE, not a BUG. That sounds fine to me. > It also does not relate to Aarch64, the same effect can be observed on x64. Hmm... I've not looked at the generated code on x64, just the results of a jmh benchmark, and Long::bitCount seems faster than Integer::bitCount. But maybe it depends on the architecture, MaxVectorSize, which is 64 on my AVX 512 machine.
11-07-2024
[~chegar] > > It also does not relate to Aarch64, the same effect can be observed on x64. > > Hmm... I've not looked at the generated code on x64, just the results of a jmh benchmark, and Long::bitCount seems faster than > Integer::bitCount. But maybe it depends on the architecture, MaxVectorSize, which is 512 on my machine. Yes, it depends on MaxVectorSize. If you set it to MaxVectorSize=16, you will see the same results. But you machine probably has 32 or even 64 bytes, as is more standard these days. So if you have 32 bytes, then you can make 4-long vectors or 8-long vectors with 64 bytes. And those are then obviously not 2-element vectors, and so the heuristic might deem them profitable.
10-07-2024
> It is indeed the reduction, I think. Can you figure out what MaxVectorSize is on your aarch64 machine? $ /Users/chegar/binaries/jdk-22.0.1.jdk/Contents/Home/bin/java -XX:+PrintFlagsFinal -version \| grep MaxVectorSize intx MaxVectorSize = 16 {C2 product} {default}
10-07-2024
I added this report to my long list of issues in JDK-8317424. I hope to address this soon. The goal is to cost-model reductions better, rather than have such a rigid rule like "2-element-reductions for int/long are never performance profitable".
09-07-2024
[~chegar] I did some investigating. See my BitCount.java It is indeed the reduction, I think. Can you figure out what MaxVectorSize is on your aarch64 machine? If it is indeed 16 byte only, then you must encounter the same as I. The problem is that we have long-instructions. Thus, we can only have 2 per vector. This is fine for normal loops like this: static void test2(long[] a) { for (int i = 0; i < a.length; i++) { a[i] = Long.bitCount(a[i]); } } But not if we also have a reduction with only 2 elements, like this: static int test4(long[] a) { int sum = 0; for (int i = 0; i < a.length; i++) { sum += Long.bitCount(a[i]); } return sum; } The reason is apparent in this code from superword.cpp: // Can code be generated for the pack, restricted to size nodes? bool SuperWord::implemented(const Node_List* pack, const uint size) const { assert(size >= 2 && size <= pack->size() && is_power_of_2(size), "valid size"); bool retValue = false; Node* p0 = pack->at(0); if (p0 != nullptr) { int opc = p0->Opcode(); if (is_marked_reduction(p0)) { const Type *arith_type = p0->bottom_type(); // Length 2 reductions of INT/LONG do not offer performance benefits if (((arith_type->basic_type() == T_INT) \|\| (arith_type->basic_type() == T_LONG)) && (size == 2)) { retValue = false; } else { retValue = ReductionNode::implemented(opc, size, arith_type->basic_type()); } I think this judgement may be wrong, i.e. that 2-element-reductions for int/long are sometimes actually performance protitable. But it may not always be the case. I hope to address this soon, with JDK-8307516 or other related RFE's. But as such, I think this is more of an RFE, not a BUG. It also does not relate to Aarch64, the same effect can be observed on x64. Still: thank you for reporting. I will look more at these examples in the future.
09-07-2024
Ah, this could also be an issue with reductions. Investigating...
09-07-2024
[~fgao] Knows more here?
09-07-2024
[~chegar] Thanks for reporting this. Looking through our "IR" tests, I find this: grep bitCount test/hotspot/jtreg/compiler/vectorization/ -r test/hotspot/jtreg/compiler/vectorization/runner/BasicIntOpTest.java: res[i] = Integer.bitCount(a[i]); test/hotspot/jtreg/compiler/vectorization/TestPopCountVector.java: output[i] = Integer.bitCount(input[i]); test/hotspot/jtreg/compiler/vectorization/TestPopCountVector.java: int expected = Integer.bitCount(input[i]); test/hotspot/jtreg/compiler/vectorization/TestPopCountVectorLong.java: output[i] = Long.bitCount(input[i]); test/hotspot/jtreg/compiler/vectorization/TestPopCountVectorLong.java: int expected = Long.bitCount(input[i]); -> these are actually restricted to x64, so aarch64 is not included. emanuel@emanuel-oracle:/oracle-work/jdk-fork2/open$ grep bitCount test/hotspot/jtreg/compiler/loopopts/superword/ -r test/hotspot/jtreg/compiler/loopopts/superword/TestGeneralizedReductions.java: acc += Long.bitCount(array[i]); -> also this one is restricted to AVX2, so not expected on aarch64 with asimd (i.e. NEON). In C2, we have these intrinsics: src/hotspot/share/opto/library_call.cpp: case vmIntrinsics::_bitCount_i: n = new PopCountINode( arg); break; src/hotspot/share/opto/library_call.cpp: case vmIntrinsics::_bitCount_l: n = new PopCountLNode( arg); break; These can get translated to vector nodes: src/hotspot/share/opto/vectornode.cpp: case Op_PopCountI: src/hotspot/share/opto/vectornode.cpp: return Op_PopCountVI; src/hotspot/share/opto/vectornode.cpp: case Op_PopCountL: src/hotspot/share/opto/vectornode.cpp: return Op_PopCountVL; For SuperWord, our AutoVectorizer, this is the method that determines if a vector-instruction is available on the platform: Matcher::match_rule_supported_auto_vectorization On x86: src/hotspot/cpu/x86/x86.ad: bool Matcher::match_rule_supported_auto_vectorization(int opcode, int vlen, BasicType bt) { return match_rule_supported_vector(opcode, vlen, bt); } Looking at src/hotspot/cpu/x86/x86.ad: bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType bt) { ... case Op_PopCountVI: case Op_PopCountVL: { if (!is_pop_count_instr_target(bt) && (size_in_bits == 512) && !VM_Version::supports_avx512bw()) { return false; } } break; And are corresponding instructions in the backend src/hotspot/cpu/x86/x86.ad: instruct vpopcount_integral_reg_evex(vec dst, vec src) %{ predicate(is_vector_popcount_predicate(Matcher::vector_element_basic_type(n->in(1)))); match(Set dst (PopCountVI src)); match(Set dst (PopCountVL src)); But looking at aarch64, src/hotspot/cpu/aarch64/aarch64_vector.ad: bool Matcher::match_rule_supported_auto_vectorization(int opcode, int vlen, BasicType bt) { if (UseSVE == 0) { // These operations are not profitable to be vectorized on NEON, because no direct // NEON instructions support them. But the match rule support for them is profitable for // Vector API intrinsics. if ((opcode == Op_VectorCastD2X && bt == T_INT) \|\| (opcode == Op_VectorCastL2X && bt == T_FLOAT) \|\| (opcode == Op_CountLeadingZerosV && bt == T_LONG) \|\| (opcode == Op_CountTrailingZerosV && bt == T_LONG) \|\| // The implementations of Op_AddReductionVD/F in Neon are for the Vector API only. // They are not suitable for auto-vectorization because the result would not conform // to the JLS, Section Evaluation Order. opcode == Op_AddReductionVD \|\| opcode == Op_AddReductionVF \|\| opcode == Op_MulReductionVD \|\| opcode == Op_MulReductionVF \|\| opcode == Op_MulVL) { return false; } } return match_rule_supported_vector(opcode, vlen, bt); } ... bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType bt) { It looks like it should be supported by default. I will have to look at this in the debugger... Because the instruction is implemented: instruct vpopcountL(vReg dst, vReg src) %{ match(Set dst (PopCountVL src)); format %{ "vpopcountL $dst, $src" %} ins_encode %{ if (UseSVE == 0) { __ cnt($dst$$FloatRegister, __ T16B, $src$$FloatRegister); __ uaddlp($dst$$FloatRegister, __ T16B, $dst$$FloatRegister); __ uaddlp($dst$$FloatRegister, __ T8H, $dst$$FloatRegister); __ uaddlp($dst$$FloatRegister, __ T4S, $dst$$FloatRegister); } else { __ sve_cnt($dst$$FloatRegister, __ D, ptrue, $src$$FloatRegister); } %} ins_pipe(pipe_slow); %} Maybe this was an oversight, or it is simply not profitable... idk yet.
09-07-2024
Reproducer and more details here: https://github.com/ChrisHegarty/hammingBench/ We workaround this in Lucene by striding over the data as an int[], on ARM. See https://github.com/apache/lucene/pull/13545 Somewhat relates to JDK-8283749.
09-07-2024