Bug ID: JDK-8304720 SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 11,17,21

Priority: P3
Status: Resolved
Resolution: Fixed

Submitted: 2023-03-22
Updated: 2023-07-10
Resolved: 2023-05-08

JDK 21
21 b22Fixed

During work of JDK-8304042 (and ultimately of JDK-8298935), I found an example where we have no cyclic dependency, but the C2 graph after SuperWord is not correct. It leads to WRONG RESULTS.

The issue is that we do not re-order the graph for non-vectorized memory operations. But the vectorization of some memory operations may lead to a reordering of memory ops, not just for the vectorized ones but also the unvectorized ones.

In SuperWord::co_locate_pack we have some complicated logic for Stores, where we try to find out what needs to be scheduled before (and presumably after). But we have nothing equivalent for Load.

We could try to implement something equivalent to the logic for Stores. But I am not convinced that this works, because we might have to also re-order stores that are not packed.

My suggestion:
We schedule all packs and left over scalars into a linear order (I already do that in JDK-8304042). From that order, we can re-construct all memory-state edges in the C2 memory slices.

Reproduce like this:

./java -XX:-TieredCompilation -Xbatch -XX:CompileCommand=compileonly,Test4::test -XX:CompileCommand=printcompilation,Test4::* -XX:LoopUnrollLimit=250 -XX:+TraceSuperWord  Test4.java

./java -XX:-TieredCompilation -Xbatch --add-modules java.base --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -XX:CompileCommand=compileonly,Test5::test -XX:CompileCommand=printcompilation,Test5::* -XX:LoopUnrollLimit=250 -XX:+TraceSuperWord -XX:LoopMaxUnroll=5 Test5.java

I will describe Test4.java. Test5.java is very similar, just uses Unsafe so that it reproduces before we implemented conversions in SuperWord.

We have:

        for (int i = 0; i < RANGE; i+=2) {
            dataFa[i + 0] = dataIa[i + 0] * 1.3f;
            dataIb[i + 0] = (int)dataFb[i + 0] * 11;
            dataIb[i + 1] = (int)dataFb[i + 1] * 11;
            dataFa[i + 1] = dataIa[i + 1] * 1.2f;
        }

This gets unrolled, for simplicity let's say only 2x.

        for (int i = 0; i < RANGE; i+=4) {
            dataFa[i + 0] = dataIa[i + 0] * 1.3f;          -> scalar, before
            dataIb[i + 0] = (int)dataFb[i + 0] * 11;         -> vectorize
            dataIb[i + 1] = (int)dataFb[i + 1] * 11;         -> vectorize
            dataFa[i + 1] = dataIa[i + 1] * 1.2f;          -> scalar, after
            dataFa[i + 2] = dataIa[i + 2] * 1.3f;          -> scalar, before
            dataIb[i + 2] = (int)dataFb[i + 2] * 11;         -> vectorize
            dataIb[i + 3] = (int)dataFb[i + 3] * 11;         -> vectorize
            dataFa[i + 3] = dataIa[i + 3] * 1.2f;          -> scalar, after
        }

We can only vectorize half of the statements, the other half does not work because the constants are not matching.

To enable this vectorization, the statements have to be re-ordered.
We have to pay attention to the Store-Load dependencies.
For (i+0) and (i+2), we have a StoreF-LoadF dependency ("before"), for (i+1) and (i+3) we have a StoreI-LoadI dependency ("after").

        for (int i = 0; i < RANGE; i+=4) {
            dataFa[i + 0] = dataIa[i + 0] * 1.3f;          -> scalar, before
            dataFa[i + 2] = dataIa[i + 2] * 1.3f;          -> scalar, before
            dataIb[i + 0] = (int)dataFb[i + 0] * 11;         -> vectorize
            dataIb[i + 1] = (int)dataFb[i + 1] * 11;         -> vectorize
            dataIb[i + 2] = (int)dataFb[i + 2] * 11;         -> vectorize
            dataIb[i + 3] = (int)dataFb[i + 3] * 11;         -> vectorize
            dataFa[i + 1] = dataIa[i + 1] * 1.2f;          -> scalar, after
            dataFa[i + 3] = dataIa[i + 3] * 1.2f;          -> scalar, after
        }

We see that we had to re-order the memory ops. For example we had to swap the order of the StoreF for (i+1) and (i+2). This must be reflected in the C2 graph. Some of the scalar memory ops must be moved before the vector load/stores, some must be pushed to later. This should be determined by the dependency-graph of SuperWord.

I think my Test5.java fails for JDK17.

Task:
Un-comment "test5" from JDK-8304042.
And lift the platform restrictions at the test definitions.

Changeset: ad0e5a99 Author: Emanuel Peter <epeter@openjdk.org> Date: 2023-05-08 06:09:10 +0000 URL: https://git.openjdk.org/jdk/commit/ad0e5a99ca1ad9dd04105f502985735a3536c3f4

08-05-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/13354 Date: 2023-04-05 14:55:55 +0000

25-04-2023

ILW = Wrong execution due to wrong scheduling in SuperWord, rare, disable compilation of affected method or use -XX:-UseSuperWord = HLM = P3

23-03-2023

Relates :	JDK-8306302 - C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
Relates :	JDK-8304042 - C2 SuperWord: schedule must remove packs with cyclic dependencies
Relates :	JDK-8306088 - C2 Superword: verify consistency of packset and my_pack