JDK-8329077 : C2 SuperWord: MoveF2I cases do not vectorize
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 23,24,25
  • Priority: P4
  • Status: In Progress
  • Resolution: Unresolved
  • Submitted: 2024-03-26
  • Updated: 2025-07-04
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Description
This came up in:
https://mail.openjdk.org/pipermail/panama-dev/2024-March/020327.html

"But then why does this one:

    public void scalarSegmentArray(Data state) {
        final MemorySegment input = state.inputSegment;
        final double[] output = state.outputArray;
        for(int i = 0; i < SIZE; i++) {
            output[i] += input.getAtIndex(JAVA_DOUBLE, i);
        }
    }

not vectorize? input and output can't overlap because one is off heap
and the other is on heap. It seems for doubles the MemorySegment API
reads a double in 2 steps: use getLongUnaligned() and then convert the
result to double with Double.longBitsToDouble(). The vectorizer doesn't
support vectorization of that long to double move."

"MemorySegment reads double in the same way as other (pre-existing) 
VarHandles - e.g. as longs which are then converted to doubles (w/o 
using Unsafe::getDouble directly)."

"I prefer to see vectorizer handling "MoveX2Y (LoadX mem)" 
=> "VectorReinterpret (LoadVector mem)" well and then introduce rules to 
strength-reduce it to mismatched access."
Comments
FYI I'm currently working on the implementation
04-07-2025

[~galder] Are you still interested in this issue?
02-06-2025

[~galder] I changed the title from "C2: MemorySegment double accesses don't vectorize" to "C2 SuperWord: MoveF2I cases do not vectorize" Because it is not really about MemorySegment, but core SuperWord functionality, as the Test.java shows.
07-01-2025

[~roland] I had asked you on Skack on April 23, without any response. You can keep it, if you want.
07-05-2024

Well wait, that actually was no surprise: ConvF2I actually converts the numbers, but MoveF2I just reinterprets them! Ok, what are the correponding Vector operations? For ConvF2I, we pick it in VectorCastNode::opcode Maybe I would have to make a special exception there, so that I am not producing a VectorCastF2X, but a VectorReinterpret. The only tricky thing is that VectorCastNode::implemented only seems to pass the dst_type to the Matcher::match_rule_supported_auto_vectorization, and not the src_type. Not sure what to do about that... Maybe I have to create a new "VectorReinterpretNode::opcode / ::implemented" method, in parallel to those for the casts. How can we check if they are implemented? I see some checks in src/hotspot/share/opto/vectorIntrinsics.cpp arch_supports_vector(Op_VectorReinterpret, mem_num_elem, mem_elem_bt, VecMaskNotUsed) I'd have to look more closely how that one gets processed...
07-05-2024

Hoping it's ok it's definitely not enough. Asking is a minimum. So no, assigning to yourself a bug that was assigned to someone else without any form of prior discussion is not ok. I'm not sure what made you think this was not actively being worked on. Please don't do that. In this particular case, though, we hadn't spent a lot of time on our side working on it (but we could have) so you can keep it.
07-05-2024

Looking into x64 AD files: instruct convF2I_reg_reg(rRegI dst, regF src, rFlagsReg cr) %{ match(Set dst (ConvF2I src)); effect(KILL cr); format %{ "convert_f2i $dst, $src" %} ins_encode %{ __ convertF2I(T_INT, T_FLOAT, $dst$$Register, $src$$XMMRegister); %} ins_pipe(pipe_slow); %} instruct MoveF2I_stack_reg(rRegI dst, stackSlotF src) %{ match(Set dst (MoveF2I src)); effect(DEF dst, USE src); ins_cost(125); format %{ "movl $dst, $src\t# MoveF2I_stack_reg" %} ins_encode %{ __ movl($dst$$Register, Address(rsp, $src$$disp)); %} ins_pipe(ialu_reg_mem); %} instruct MoveF2I_reg_stack(stackSlotI dst, regF src) %{ match(Set dst (MoveF2I src)); effect(DEF dst, USE src); ins_cost(95); // XXX format %{ "movss $dst, $src\t# MoveF2I_reg_stack" %} ins_encode %{ __ movflt(Address(rsp, $dst$$disp), $src$$XMMRegister); %} ins_pipe(pipe_slow); %} instruct MoveF2I_reg_reg(rRegI dst, regF src) %{ match(Set dst (MoveF2I src)); effect(DEF dst, USE src); ins_cost(85); format %{ "movd $dst,$src\t# MoveF2I" %} ins_encode %{ __ movdl($dst$$Register, $src$$XMMRegister); %} ins_pipe( pipe_slow ); %} ------------------- Well, ConvF2I actually does a call to "C2_MacroAssembler::convertF2I", which is not just a simple no-op, but checks the value, and does some cleanup. If I read the stubs right, this is for the NaN values. ------------------ But I think MoveF2I directly does the move, and no cleanup... so these have different semantics.
07-05-2024

Trying to figure out what is the difference between "MoveF2I" and "ConvF2I" and their analogue for I/L and F/D. Uses of "MoveF2I" is quite limited: case vmIntrinsics::_floatToRawIntBits: result = new MoveF2INode(arg); break; case vmIntrinsics::_floatToIntBits: { .... ads some branching for the NaN ... phi->init_req(2, _gvn.transform(new MoveF2INode(arg))); Node* MoveI2FNode::Identity(PhaseGVN* phase) { if (in(1)->Opcode() == Op_MoveF2I) { return in(1)->in(1); } return this; } const Type* MoveF2INode::Value(PhaseGVN* phase) const { const Type *t = phase->type( in(1) ); if( t == Type::TOP ) return Type::TOP; if( t == Type::FLOAT ) return TypeInt::INT; const TypeF *tf = t->is_float_constant(); JavaValue v; v.set_jfloat(tf->getf()); return TypeInt::make( v.get_jint() ); } bool LibraryCallKit::inline_vector_reduction() { ... switch (elem_bt) { case T_BYTE: case T_SHORT: case T_INT: { bits = gvn().transform(new ConvI2LNode(value)); break; } case T_FLOAT: { value = gvn().transform(new MoveF2INode(value)); bits = gvn().transform(new ConvI2LNode(value)); break; } case T_DOUBLE: { bits = gvn().transform(new MoveD2LNode(value)); break; } bool PhaseIdealLoop::intrinsify_fill(IdealLoopTree* lpt) { ... // Convert float/double to int/long for fill routines if (t == T_FLOAT) { store_value = new MoveF2INode(store_value); _igvn.register_new_node_with_optimizer(store_value); } else if (t == T_DOUBLE) { store_value = new MoveD2LNode(store_value); _igvn.register_new_node_with_optimizer(store_value); } So far, it almost looks like these nodes basically do the same thing... why do we have them? Let's also look into the backend, to see what they produce.
07-05-2024

I'll work on this on the side, hope that is ok. ./java -XX:CompileOnly=Test::test5 -Xbatch -XX:+TraceLoopOpts -XX:CompileCommand=TraceAutoVectorization,Test::test*,SW_REJECTIONS -XX:+TraceNewVectors Test.java
07-05-2024

Here an extract of some IR tests I have in a draft elsewhere, feel free to use it here: @Test @IR(counts = {IRNode.STORE_VECTOR, "= 0"}, applyIfPlatform = {"64-bit", "true"}, applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"}) // In theory, one would expect this to be a simple 4byte -> 4byte conversion. // But there is a CmpF and CMove here because we check for isNaN. Plus a MoveF2I. // // Would be nice to vectorize: Missing support for CmpF, CMove and MoveF2I. static Object[] test5(int[] a, float[] b) { for (int i = 0; i < a.length; i++) { a[i] = Float.floatToIntBits(b[i]); } return new Object[]{ a, b }; } @Test @IR(counts = {IRNode.STORE_VECTOR, "= 0"}, applyIfPlatform = {"64-bit", "true"}, applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"}) // Missing support for MoveF2I static Object[] test6(int[] a, float[] b) { for (int i = 0; i < a.length; i++) { a[i] = Float.floatToRawIntBits(b[i]); } return new Object[]{ a, b }; } @Test @IR(counts = {IRNode.STORE_VECTOR, "= 0"}, applyIfPlatform = {"64-bit", "true"}, applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"}) // Missing support for MoveI2F static Object[] test7(int[] a, float[] b) { for (int i = 0; i < a.length; i++) { b[i] = Float.intBitsToFloat(a[i]); } return new Object[]{ a, b }; } @Test @IR(counts = {IRNode.STORE_VECTOR, "= 0"}, applyIfPlatform = {"64-bit", "true"}, applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"}) // Missing support for Needs CmpD, CMove and MoveD2L static Object[] test8(long[] a, double[] b) { for (int i = 0; i < a.length; i++) { a[i] = Double.doubleToLongBits(b[i]); } return new Object[]{ a, b }; } @Test @IR(counts = {IRNode.STORE_VECTOR, "= 0"}, applyIfPlatform = {"64-bit", "true"}, applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"}) // Missing support for MoveD2L static Object[] test9(long[] a, double[] b) { for (int i = 0; i < a.length; i++) { a[i] = Double.doubleToRawLongBits(b[i]); } return new Object[]{ a, b }; } @Test @IR(counts = {IRNode.STORE_VECTOR, "= 0"}, applyIfPlatform = {"64-bit", "true"}, applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"}) // Missing support for MoveL2D static Object[] test10(long[] a, double[] b) { for (int i = 0; i < a.length; i++) { b[i] = Double.longBitsToDouble(a[i]); } return new Object[]{ a, b }; }
23-04-2024

This really does seem to be a missing feature of SuperWord. I also don't see these to be vectorized, for the same reason: for (int i = 0; i < a.length; i++) { a[i] = Double.doubleToRawLongBits(b[i]); } for (int i = 0; i < a.length; i++) { b[i] = Double.longBitsToDouble(a[i]); } Note, that this also blocks: for (int i = 0; i < a.length; i++) { a[i] = Double.doubleToLongBits(b[i]); } But there also seems to be an additional issue with Cmove / CmpF here. I have not investigated more. And the exactly same issues exist for Float.
23-04-2024

Yes, I added it to my list in: JDK-8317424: C2 SuperWord Umbrella: improvements
27-03-2024

Converting this to an enhancement. [~epeter], you might want to link this to the other superword enhancements.
27-03-2024