JDK-8317424 : C2 SuperWord Umbrella: improvements
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 22
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2023-10-03
  • Updated: 2025-02-01
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Description
Here a list of RFEs and BUGs related to SuperWord / AutoVectorization

You can also refer to my visual represenatation here:
https://eme64.github.io/blog/2025/01/01/AutoVectorization-Status.html

------------------------------------ FIXED BUGS ---------------------------------------------------------------------

JDK-8332905: C2 SuperWord: bad AD file, with RotateRightV and first operand not a pack
JDK-8330819: C2 SuperWord: bad dominance after pre-loop limit adjustment with base that has CastLL after pre-loop
JDK-8316679: C2 SuperWord: wrong result, load should not be moved before store if not comparable
JDK-8316594: C2 SuperWord: wrong result with hand unrolled loops
JDK-8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs
(JDK-8311586, JDK-8309662, JDK-8303827)
JDK-8314612: TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
JDK-8313720: C2 SuperWord: wrong result with -XX:+UseVectorCmov -XX:+UseCMoveUnconditionally
JDK-8306302: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
JDK-8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs
JDK-8310130: C2: assert(false) failed: scalar_input is neither phi nor a matchin reduction
JDK-8309268: C2: "assert(in_bb(n)) failed: must be" after JDK-8306302
JDK-8304720: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph
JDK-8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies
JDK-8340010: Fix vectorization tests with compact headers
JDK-8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures

------------------------------------ TODO BUGS ---------------------------------------------------------------------

JDK-8323582: C2 SuperWord AlignVector: misaligned vector memory access with Unsafe.allocateMemory
(working on it. will give us the infrastructure for Aliasing-Analysis)

------------------------------------ PROBYBLY NEVER TO BE FIXED ------------------------------------------

Lilliput collateral damage:
JDK-8344424: C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector

------------------------------------ COMPLETED IMPROVEMENTS -----------------------------------------------
JDK-8317572: C2 SuperWord: refactor/improve VectorizeDebugOption and TraceSuperWord
JDK-8309267: C2 SuperWord: some tests fail on KNL machines - fail to vectorize
JDK-8302652: [SuperWord] Reduction should happen after loop, when possible
JDK-8308606: C2 SuperWord: remove alignment checks when not required
JDK-8308917: C2 SuperWord::output: assert before bailout with CountedLoopReserveKit
JDK-8260943: C2 SuperWord: Remove dead vectorization optimization added by 8076284
JDK-8318703: C2 SuperWord: take reduction nodes into account in early unrolling analysis

JDK-8325155: C2 SuperWord: remove alignment boundaries
JDK-8325541: C2 SuperWord: refactor filter / split
JDK-8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence)
JDK-8332163: C2 SuperWord: refactor PacksetGraph and SuperWord::output into VTransformGraph

Cleanup:
JDK-8309204: Obsolete DoReserveCopyInSuperWord
JDK-8323577 C2 SuperWord: remove AlignVector restrictions on IR tests added in JDK-8305055
JDK-8325159: C2 SuperWord: measure time for CITime
JDK-8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion

Testing / Benchmarking:
JDK-8329273: C2 SuperWord: some basic MemorySegment IR tests
JDK-8333647: C2 SuperWord: some additional PopulateIndex tests
JDK-8310308: IR Framework: check for type and size of vector nodes
JDK-8340272: C2 SuperWord: JMH benchmark for Reduction vectorization
JDK-8344118: C2 SuperWord: add VectorThroughputForIterationCount benchmark
JDK-8342387: C2 SuperWord: refactor and improve compiler/loopopts/superword/TestDependencyOffsets.java
JDK-8347545: C2 SuperWord: AutoVectorization benchmark to motivate future work

------------------------------------ TODO VARIOUS IMPROVEMENTS ----------------------------------------------------------

JDK-8309908: C2 SuperWord: IGVN commute swap_edges can prevent vectorization
JDK-8308841: C2 SuperWord: implement vectorization of integer CMove
JDK-8303113: [SuperWord] investigate if enabling _do_vector_loop by default creates speedup
JDK-8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)
JDK-8299808: ArrayFill should be preferred over unrolling
JDK-8332878: C2 SuperWord: improve PopulateIndex detection for L/F/D

JDK-8342095: Add autovectorizer support for subword vector casts

JDK-8307084: C2: Vector atomic post loop is not executed for some small trip counts
(Found by ARM, I hope they take this one up soon!)

JDK-8344085: C2 SuperWord: improve vectorization for small loop iteration count


JDK-8328678: C2: hand unrolled loops don't vectorize/unroll as well as loops unrolled by the compiler

Reductions
JDK-8343597: C2 SuperWord: RelaxedMath for faster float reductions
JDK-8345044: Sum of array elements not vectorized
(should be addressed by cost-model, see other comments below)
JDK-8345107: C2 SuperWord: implement polynomial reductions (for hashing)
More ideas: generalize to prefix-sum, scans, and even segmented scans. Probably this requires a cost-model. And maybe some prior transformations on the scalar graph?
JDK-8345245: C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
JDK-8345549: C2 SuperWord: prefix-sum
JDK-8255030: Vectorize equality comparison of some inline types: Even if the issue is about inline types, it can be applicable to other types as well (e.g. record Quadrilateral(int xA, int yA, int xB, int yB, int xC, int yC, int xD, int yD)). Inline types make objects flatter, expand the applicability of this (e.g. record Quadrilateral(Point! A, Point! B, Point! C, Point! D))

Tests:
JDK-8310891: C2 SuperWord tests: move platform requirements to IR rules
JDK-8310523: Add IR tests for nodes that have too few IR tests yet
JDK-8327671: C2 SuperWord: move all tests to test/hotspot/jtreg/compiler/autovectorization

IR Framework:
JDK-8320224: IR Framework: add MaxVectorSize to JTREG_WHITELIST_FLAGS
JDK-8309183: [IR Framework] Add UseKNLSetting to whitelist
JDK-8310533: [IR Framework] Add possibility to automatically verify that a test method always returns the same result

More Testing infrastructure:
JDK-8346106: Verify.checkEQ: testing utility for recursive value verification
JDK-8346107: Generators: testing utility for random value generation
JDK-8344942: Template-Based Testing Framework

------------------------------------ TODO MemorySegment ------------------------------------------------------------------

JDK-8330991: C2 SuperWord: refactor VPointer
JDK-8331576: C2 SuperWord: Unsafe access with long address that is a CastX2P does not vectorize

I'm first working on a more general MemPointer, which can also be used outside of loopopts.
My first target is the MergeMem optimization: JDK-8335392: C2 MergeStores: enhanced pointer parsing

JDK-8327209: C2 MemorySegment: missing RCE and vectorization
JDK-8324751: C2 SuperWord: Aliasing Analysis
JDK-8329077: C2: MemorySegment double accesses don't vectorize
JDK-8330274: C2 SuperWord: VPointer invar: same sum with different addition order should be equal
JDK-8331659: C2 SuperWord: investicate failed vectorization in compiler/loopopts/superword/TestMemorySegment.java
JDK-8343536: C2 SuperWord / MergeStores: investigate missing optimizations in MemorySegment examples

------------------------------------ TODO COST MODELING ------------------------------------------------------------------
JDK-8340093: C2 SuperWord: implement cost model
Systematically estimate the cost of the scalar vs vector loop.
This would be a better profitability heuristic than what we have now.
It would make it easier to estimate if reductions are profitable.
And it would allow us to estimate if vectorization is profitable with shuffles / insert / extract nodes,
which are additional operations: is their extra work outweighed by the vectorization gains?

Below some issues that are related to cost-modeling:
JDK-8307516: C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
(goal: replace heuristics with cost-model)

JDK-8336000: Long::bitCount does not auto-vectorize on AArch64
(actually reports issue with 2-element reductions, they are marked as not protitable in SuperWord::implemented, must be re-evaluated)

https://www.elastic.co/search-labs/blog/articles/Vector%20Similarity%20Computations%20-%20ludicrous%20speed
Can we do this with auto-vectorization?
The embarassing thing here is: even a simple dot-product did not vectorize (example with bytes)


JDK-8305717: SuperWord: Vectorization in opposite direction traversal cases
JDK-8305707: SuperWord should vectorize reverse-order reduction loops
(requires shuffles, and maybe reverse-order reductions in the backend?)
-----------------------------------------------------------------------------------------------------

BIG GOAL

JDK-8347116: C2 SuperWord: If-Conversion

------------------------------------ VALHALLA ------------------------------------------------------------------
JDK-8253160: C2's superword optimization should vectorize flat inline type array accesses

Comments
Walking through the SuperWord code again, and trying to see what code looks old/problematic/worthy of refactoring. Worth more investigation: _race_possible -> don't understand what this means. And what it does. May be reason for some bugs? order_def_uses -> orders uses packsets to smooth out conflicts. This basically just pushes the issue down to the uses, as far as possible. The question is why this does not always happen, and why only among packsets? I feel this is incomplete. It feels like it should be able to fix issues like JDK-8309908, why does it not? opnd_positions_match -> is supposed to re-order some inputs to ensure the nodes can be packed. Not sure if this works, and if it works correctly. _early_return -> seems like some alternate failure state, where the loop is not canonical. We could refactor this to be nicer. Not sure if worth it. But could be interesting to validate that after unrolling it is still valid. But that might just fail on some edge-case. _num_work_vecs _num_reductions -> used for reduction heuristic. Now not very accurate anymore. But probably will only remove it once we get a better general heuristic. SuperWord::mark_reductions / SuperWordReducitons -> we eventually want to make this much stronger. It should do a general path search, probably re-order the graph if necessary. And it should also look for polynomial reductions (mul-add). The question is if such reduction recognition should be packaged differently. Basically, it would be nice to package the polynomial reduction into special nodes already... SuperWord::transform_loop -> checks for main_loop and finds pre-loop. Is this relevant? When might we not find the pre-loop? SuperWord::find_adjacent_refs -> all alignment stuff can eventually be removed. We can then also develop into non-adjacent refs. Probably first take all adjacent refs, then take strided refs. Eventually, we could also try to gather/scatter more genrally. Additionally, this becomes quite expensive because it has nested loops. Maybe we can make it linear somehow? Split them into groups, then put them in an order. If overlap, try to separate by origin node (CloneMap). SuperWordRTDepCheck _disjoint_ptrs -> no use so far. Remove? If anything we should only trace dependencies between those that actually get edges added for them. There, we could then add speculation somehow, and retry with them properly separated. Currently, it is quite useless. SuperWord::mem_slice_preds -> has a FIXME. I should probably investigate a bit what cases lead to those "special cases". Maybe the assert can be simplified to "!in_bb(out)" ? SuperWord::stmts_can_pack -> exists_at seems a bit expensive. But we also do not want to double-pack nodes. Maybe there is an alternative. SuperWord::isomorphic -> seems overly complicated. Did Christian not just redo that code? JDK-8310886 SuperWord::have_similar_inputs -> commented out asserts? SuperWord::adjust_alignment_for_type_conversion -> looks like another part of the alignment stuff that I'd like to remove. SuperWord::est_savings -> check if it ever prevents some vectorization. And why. Looks like premature cost analysis... not sure what this is. But it is possible that it is very intricate and still kinda correct. The question is if we really want that... SuperWord::filter_packs -> checks for implemented & profitable. We probably want to eventually break this up more. I guess we can move implemented much earlier, just when it is combined, and then split. SuperWord::profitable -> probably deserves a renaming? Does lots of non-cost related things. SuperWord::insert_extracts -> not sure if this ever works. Maybe add assert and investigate? SuperWord::is_vector_use -> seems more complicated than required. Not sure about all the alignment code there. SuperWord::bb_insert_after -> does this do anything useful? FYI: SuperWord::transform_loop Checks that there is no control in the loop. Will be interesting when we want to allow if-conversion (beyond tackling CMove). SuperWord::unrolling_analysis Ignores some Cmp nodes. That will generate some issues when doing if-conversion.
06-10-2023