Bug ID: JDK-8262067 SuperWord loop optimization lost after method inlining

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

Other
tbdResolved

It was reported that after method with loop is inlined the loop is not vectorized (not even converted to Counted loop):

I am encountering a performance issue caused by the interaction between
method inlining and automatic vectorization.

Our application aggregates arrays intensively using a method named
ArrayFloatToArrayFloatVectorBinding.plus() with the following code:

    for (int i = 0; i < srcLen; ++i) {
            dstArray[i] += srcArray[i];
    }

When we microbenchmark this method we observe fast performance close to the practical memory bandwidth and when we print the assembly code we observe loop unrolling and automatic vectorization with SIMD instructions.

In the real application, this method is actually inlined in a higher level
method named AVector.plus(). Unfortunately, the inlined version of the
aggregation code is not vectorized anymore.

This causes a significant performance drop, compared to a run where we explicitly disable the inlining and observe automatically vectorized code
again (-XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus).

Roland provided JDK 11u build with JDK-8253923 and customer verified that it fixed their issue. Closing this bug as duplicate.
05-03-2021
I was dropping this into the micros-jdk8 of https://github.com/openjdk/jmh-jdk-microbenchmarks, build with a JDK 8 JAVA_HOME, and run like $ java -jar target/micros-jdk8-1.0-SNAPSHOT.jar SuperWordPlus or put it into the jdk/test/micro/ and build like shown in doc/testing.md: make test TEST="micro:SuperWordPlus" MICRO="FORK=1;WARMUP_ITER=2"
25-02-2021
[~ecaspole] You need to explain how to run your test with JMH.
24-02-2021
I tried to mimic the class hierarchy here in the attached SuperWordPlus JMH, but I cannot repro the problem. Maybe the reporter can give some guidance to make it closer to the real application?
24-02-2021
ILW = Performance issue due to missing vectorization, reproducible with customer application, disable inlining of affected method = MMM = P3
23-02-2021
[~roland] Do you have any idea what could go wrong here?
23-02-2021
Note, the code was collected with strip mining ON. The assembler for loop in inlined version in AVector::plus() shows that it was strip mined (external loop has Safepoint and internal is not). But inner loop still has Range checks for both arrays accesses which were not moved to predicates.
22-02-2021
It reminds me one case (Loop did not transform into Counted loop) which was fixed in JDK 11.0.3: JDK-8211451. The customer uses 'OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.9.1+1, mixed mode)'. But I think Oracle JDK may have the same issue. I asked customer to run with loop strip mining optimization off: -XX:-UseCountedLoopSafepoints -XX:LoopStripMiningIter=0. But they still see the issue.
22-02-2021
I attached very simple test TestVectPlus.java which has similar LogCompilation output. But loop is vectorized when method test() is inlined as expected.
22-02-2021
I uploaded -XX:+LogCompilation output for ArrayFloatToArrayFloatVectorBinding::plus() method which compiled with vectorized loop. And output for AVector::plus() method which is compiled with inlined ArrayFloatToArrayFloatVectorBinding::plus() method but loop is not vectorized. I also attached -XX:+PrintAssembly output for these methods. Application was run with next JVM flags: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:+LogCompilation -XX:-TieredCompilation
22-02-2021