Bug ID: JDK-8370686 AArch64: C2 SuperWord: investigate long mul reductions performance on NEON

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdUnresolved

Found during work on JDK-8340093

The test has a few cases where we do not vectorize because of long mul reduction / element-wise vectors.
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java

But TestReductions.longMulSimple does vectorize, but it leads to performance regressions compared to non-vectorized code.

We already saw this here:
https://github.com/openjdk/jdk/pull/25387
(We got only 0.38 of the scalar performance)

The issue seems to be this:
- Matcher::match_rule_supported_vector
  - has a comment that says that 64/128bit vector reductions for MulReductionVL is supported
- Matcher::match_rule_supported_auto_vectorization
 - excludes MulVL from vectorization, because apparently no NEON implementation is available.
- Well: in the backend, we implement both MulVL and MulReductionVL, but we do it with a scalar implementation: pack and unpack.
- that is very inefficent, and can lead to slowdowns. I wonder if that also has an impact on the Vector API, probably yes.

We have multiple options here:
- We can just prevent long mul reductions for NEON completely
- But: in some odd cases it may be profitable. And for that, we could just adjust the cost model: make MulVL and MulReductionVL more expensive in the cost model. This is probably the preferrable method,