JDK-8314612 : TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
  • Type: Bug
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 21,22
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: aarch64
  • Submitted: 2023-08-19
  • Updated: 2023-12-15
  • Resolved: 2023-09-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 21 JDK 22
21.0.3-oracleFixed 22 b15Fixed
Related Reports
Relates :  
Relates :  
Description
1. How to produce the bug

When changing -XX:MaxVectorSize to 32 in `test/hotspot/jtreg/compiler/loopopopts/superword/TestUnorderedReduction.java`, and executing it with the following command, we will get an execution error. 

```
zifeihan@d915263bc793:~/jdk$ git diff test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
diff --git a/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java b/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
index 18f3b6930ea..952a56dd842 100644
--- a/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
+++ b/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
@@ -40,7 +40,8 @@ public class TestUnorderedReduction {
     public static void main(String[] args) {
         TestFramework.runWithFlags("-Xbatch",
                                    "-XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test*",
-                                   "-XX:MaxVectorSize=16");
+                                   "-XX:MaxVectorSize=32",
+                                   "-XX:+AvoidUnalignedAccesses");
     }
 
     @Run(test = {"test1", "test2", "test3"})
```

Execute the command as follows(The jdk executed as above is a version of sve packaged with qemu-user):

```
/home/zifeihan/jtreg/bin/jtreg \
-J-Djavatest.maxOutputSize=500000 \
-Djdk.lang.Process.launchMechanism=vfork \
-v:default \
-concurrency:32 \
-timeout:50 \
-javaoption:-XX:UseSVE=2 \
-jdk:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk \
/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
```

The exceptions are as follows:

```
----------System.out:(19/3921)----------
Run Flag VM:
Command line: [/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dtest.jdk=/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk -Djava.library.path=. -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0 -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses compiler.lib.ir_framework.flag.FlagVM compiler.loopopts.superword.TestUnorderedReduction ]
[2023-08-19T02:00:01.969752090Z] Gathering output for process 91032
[2023-08-19T02:00:08.177648302Z] Waiting for completion for process 91032
[2023-08-19T02:00:08.180735552Z] Waiting for completion finished for process 91032
Output and diagnostic info for process 91032 was saved into 'pid-91032-output.log'
[2023-08-19T02:00:08.206587843Z] Waiting for completion for process 91032
[2023-08-19T02:00:08.208777677Z] Waiting for completion finished for process 91032
Run Test VM - [-Xbatch, -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test*, -XX:MaxVectorSize=32, -XX:+AvoidUnalignedAccesses]:
Command line: [/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djava.library.path=. -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dir.framework.server.port=34611 -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=test-vm-compile-commands-pid-91034.log -XX:CompilerDirectivesLimit=31 -XX:-OmitStackTraceInFastThrow -DShouldDoIRVerification=true -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM compiler.loopopts.superword.TestUnorderedReduction ]
[2023-08-19T02:00:08.628667469Z] Gathering output for process 91057
[2023-08-19T02:00:16.553228083Z] Waiting for completion for process 91057
[2023-08-19T02:00:16.554576500Z] Waiting for completion finished for process 91057
Output and diagnostic info for process 91057 was saved into 'pid-91057-output.log'
[2023-08-19T02:00:16.712613250Z] Waiting for completion for process 91057
[2023-08-19T02:00:16.713073750Z] Waiting for completion finished for process 91057
[2023-08-19T02:00:16.719087791Z] Waiting for completion for process 91057
[2023-08-19T02:00:16.723273500Z] Waiting for completion finished for process 91057

----------System.err:(65/4837)----------

TestFramework test VM exited with code 1

Command Line:
/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -DReproduce=true -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djava.library.path=. -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dir.framework.server.port=34611 -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=test-vm-compile-commands-pid-91034.log -XX:CompilerDirectivesLimit=31 -XX:-OmitStackTraceInFastThrow -DShouldDoIRVerification=true -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM compiler.loopopts.superword.TestUnorderedReduction


Error Output
------------
Exception in thread "main" compiler.lib.ir_framework.shared.TestRunException: 

Test Failures (1)
-----------------
Custom Run Test: @Run: runTests - @Tests: {test1,test2,test3}:
compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method public void compiler.loopopts.superword.TestUnorderedReduction.runTests() throws java.lang.Exception
	at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162)
	at compiler.lib.ir_framework.test.AbstractTest.run(AbstractTest.java:104)
	at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:89)
	at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:822)
	at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
	at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)
Caused by: java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159)
	... 5 more
Caused by: java.lang.RuntimeException: Wrong result test2: 3469730 != 5772800
	at compiler.loopopts.superword.TestUnorderedReduction.runTests(TestUnorderedReduction.java:65)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	... 7 more



	at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:857)
	at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
	at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)


  #############################################################
   - To only run the failed tests use -DTest, -DExclude,
     and/or -DScenarios.
   - To also get the standard output of the test VM run with
     -DReportStdout=true or for even more fine-grained logging
     use -DVerbose=true.
  #############################################################
```



2. Reduced test case

```
public class TestUnorderedReduction {
    static final int RANGE = 512;
    static final int ITER  = 10;

    public static void main(String[] args) throws Exception {
        final TestUnorderedReduction testUnorderedReduction = new TestUnorderedReduction();
        for (int i = 0; i < 500; i++) {
            testUnorderedReduction.runTests();
        }
    }

    public void runTests() throws Exception {
        int[] data = new int[RANGE];

        init(data);
        for (int i = 0; i < ITER; i++) {
            int r1 = test2(data, i);
            int r2 = ref2(data, i);
            if (r1 != r2) {
                throw new RuntimeException("Wrong result test2: " + r1 + " != " + r2);
            }
        }
    }

    static int test2(int[] data, int sum) {
        for (int i = 0; i < RANGE; i+=8) {
            sum += 3 * data[i+0];
            sum += 3 * data[i+1];
            sum += 3 * data[i+2];
            sum += 3 * data[i+3];
            sum += 3 * data[i+4];
            sum += 3 * data[i+5];
            sum += 3 * data[i+6];
            sum += 3 * data[i+7];
        }
        return sum;
    }

    static int ref2(int[] data, int sum) {
        for (int i = 0; i < RANGE; i+=8) {
            sum += 3 * data[i+0];
            sum += 3 * data[i+1];
            sum += 3 * data[i+2];
            sum += 3 * data[i+3];
            sum += 3 * data[i+4];
            sum += 3 * data[i+5];
            sum += 3 * data[i+6];
            sum += 3 * data[i+7];
        }
        return sum;
    }

    static void init(int[] data) {
        for (int i = 0; i < RANGE; i++) {
            data[i] = 1;
        }
    }
}
```

Execute this simple test case with:`./java -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction`, it passes normally. But it fails when using :`./java -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction`.
```
zifeihan@d915263bc793:~/jdk/build/linux-aarch64-server-fastdebug/jdk/bin$ /home/zifeihan/qemu-7.1.0-rc1-aarch64/bin/qemu-aarch64 -cpu max,sve256=on ./java-bak -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction


CompileCommand: compileonly TestUnorderedReduction.test* bool compileonly = true
Exception in thread "main" java.lang.RuntimeException: Wrong result test2: 1034 != 1538
        at TestUnorderedReduction.runTests(TestUnorderedReduction.java:20)
        at TestUnorderedReduction.main(TestUnorderedReduction.java:8)
```

The sve is emulated using qemu-user and the sve width is set to 256.

```
/home/zifeihan/qemu-7.1.0-rc1-aarch64/bin/qemu-aarch64 -cpu max,sve256=on ./java -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction
```

3. C2 JIT code

3.1 C2 JIT code for TestUnorderedReduction::test2 when test case passes.

```
160     B15: #	out( B16 ) &lt;- in( B16 ) top-of-loop Freq: 64.0845
160     spill R13 -&gt; R10	# spill size = 32

164     B16: #	out( B15 B17 ) &lt;- in( B13 B15 ) Loop( B16-B15 inner main of N53) Freq: 65.0845
164     add R12, R1, R10, I2L #2	# ptr
168     add R13, R12, #16	# ptr
16c     loadV V17, [R13]	# vector (sve)
170     add R13, R12, #48	# ptr
174     loadV V18, [R13]	# vector (sve)
178     vlsl_imm V19, V17, #1
17c     add R13, R12, #80	# ptr
180     vaddI V17, V19, V17
184     loadV V19, [R13]	# vector (sve)
188     vlsl_imm V20, V18, #1
18c     vaddI V16, V16, V17
190     vaddI V17, V20, V18
194     add R13, R12, #112	# ptr
198     loadV V18, [R13]	# vector (sve)
19c     vlsl_imm V20, V19, #1
1a0     vaddI V16, V16, V17
1a4     vaddI V17, V20, V19
1a8     add R13, R12, #144	# ptr
1ac     loadV V19, [R13]	# vector (sve)
1b0     vlsl_imm V20, V18, #1
1b4     add R13, R12, #176	# ptr
1b8     vaddI V18, V20, V18
1bc     loadV V20, [R13]	# vector (sve)
1c0     vaddI V16, V16, V17
1c4     vlsl_imm V17, V19, #1
1c8     vaddI V16, V16, V18
1cc     vaddI V17, V17, V19
1d0     add R13, R12, #208	# ptr
1d4     loadV V18, [R13]	# vector (sve)
1d8     vlsl_imm V19, V20, #1
1dc     add R12, R12, #240	# ptr
1e0     vaddI V19, V19, V20
1e4     loadV V20, [R12]	# vector (sve)
1e8     vaddI V16, V16, V17
1ec     vlsl_imm V17, V18, #1
1f0     vaddI V16, V16, V19
1f4     vaddI V17, V17, V18
1f8     vlsl_imm V18, V20, #1
1fc     vaddI V16, V16, V17
200     vaddI V17, V18, V20
204     vaddI V16, V16, V17
208     addw R13, R10, #64
20c     cmpw  R13, #456
210     blt B15 	// counted loop end  P=0.984636 C=48127.000000

214     B17: #	out( B22 B18 ) &lt;- in( B16 )  Freq: 0.999989
214     reduce_addI_sve R0, R2, V16	# KILL V17
220     cmpw  R13, #512
224     bge  B22  P=0.500000 C=-1.000000
```

3.2 C2 JIT code for TestUnorderedReduction::test2 when test case fails.

```
170     B15: #	out( B16 ) &lt;- in( B16 ) top-of-loop Freq: 64.0845
170     spill R13 -&gt; R11	# spill size = 32

174     B16: #	out( B15 B17 ) &lt;- in( B13 B15 ) Loop( B16-B15 inner main of N53) Freq: 65.0845
174     add R13, R10, R11, I2L #2	# ptr
178     loadV16 V17, [R13, #16]	# vector (128 bits)
17c     loadV V18, [R13, #32]	# vector (sve)
180     vlsl_imm V19, V17, #1
184     loadV V20, [R13, #64]	# vector (sve)
188     vaddI V17, V19, V17
18c     vlsl_imm V19, V18, #1
190     vaddI V16, V16, V17
194     vaddI V17, V19, V18
198     loadV V18, [R13, #96]	# vector (sve)
19c     vlsl_imm V19, V20, #1
1a0     vaddI V16, V16, V17
1a4     vaddI V17, V19, V20
1a8     loadV16 V19, [R13, #128]	# vector (128 bits)
1ac     vlsl_imm V20, V18, #1
1b0     vaddI V16, V16, V17
1b4     vaddI V17, V20, V18
1b8     loadV16 V18, [R13, #144]	# vector (128 bits)
1bc     vlsl_imm V20, V19, #1
1c0     loadV V21, [R13, #160]	# vector (sve)
1c4     vaddI V19, V20, V19
1c8     vaddI V16, V16, V17
1cc     vlsl_imm V17, V18, #1
1d0     vaddI V16, V16, V19
1d4     vaddI V17, V17, V18
1d8     loadV V18, [R13, #192]	# vector (sve)
1dc     vlsl_imm V19, V21, #1
1e0     loadV V20, [R13, #224]	# vector (sve)
1e4     vaddI V19, V19, V21
1e8     vaddI V16, V16, V17
1ec     vlsl_imm V17, V18, #1
1f0     loadV16 V21, [R13, #256]	# vector (128 bits)
1f4     vaddI V17, V17, V18
1f8     vaddI V16, V16, V19
1fc     vlsl_imm V18, V20, #1
200     vaddI V16, V16, V17
204     vaddI V17, V18, V20
208     vlsl_imm V18, V21, #1
20c     vaddI V16, V16, V17
210     vaddI V17, V18, V21
214     vaddI V16, V16, V17
218     addw R13, R11, #64
21c     cmpw  R13, #456
220     blt B15 	// counted loop end  P=0.984636 C=48127.000000

224     B17: #	out( B22 B18 ) &lt;- in( B16 )  Freq: 0.999989
224     reduce_addI_neon R0, R2, V16	# KILL V17
230     cmpw  R13, #512
234     bge  B22  P=0.500000 C=-1.000000
```

From the C2 JIT code, we can see that `reduce_addI_neon R0, R2, V16 # KILL V17 ` uses V16, which is generated from the above `vaddI V16, V16, V17` in the loop, but V16 and V17 have different vector length, which may result in omitted or over-processed data.

Comments
Thanks!
15-12-2023

Hi Severin, apologies for that. I’ve updated the fix request and left details on additional testing on the PR.
15-12-2023

Fix request [21u] I backport this for parity with 21.0.3-oracle. Patch applies cleanly. Affected test passes, compiler test suite passes and tier 1 passes with GHA. [Edited to correct mistake on what the backport touches on and add testing details]
15-12-2023

[~szaldana] The fix request comment is misleading. It's not only a test change. Changes some loop opt in c2 as well. Please mention what testing you have done. Should at least run the test before/after patch and the c2 compiler test suite.
14-12-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk21u-dev/pull/24 Date: 2023-12-13 18:20:55 +0000
13-12-2023

Changeset: f804f865 Author: Emanuel Peter <epeter@openjdk.org> Date: 2023-09-13 10:47:20 +0000 URL: https://git.openjdk.org/jdk/commit/f804f8652da71b18cc654c08c12d07d6fd43c2a7
13-09-2023

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/15654 Date: 2023-09-11 09:17:46 +0000
11-09-2023

My suspicion was that we get vectors with different vector lengths. I asked [~chagedorn] to run it with "-XX:+TraceNewVectors": with AlignVector: TraceNewVectors [SuperWord]: 1606 MulVI === _ 1604 1605 [[ 930 929 928 927 918 917 916 915 ]] #vectory[8]:{int} !orig=[938],796,225 !jvms: TestUnorderedReduction::test2 @ bci:61 (line 77) TraceNewVectors [SuperWord]: 1627 MulVI === _ 1625 1626 [[ 226 251 276 301 ]] #vectorx[4]:{int} !orig=[225] !jvms: TestUnorderedReduction::test2 @ bci:61 (line 77) without: Only size 8 I suspect that AlignVector creates different alignment boundaries. It cuts it into 4 and 8 elements. Without AlignVector, they happen to all have 8 elements. We need to add a corresponding check, to ensure that all vectors in question have the same number of elements (otherwise we miss some elements, or start hallucinating elements): https://github.com/openjdk/jdk/blob/06b0a5e03852dfed9f1dee4791fc71b4e4e1eeda/src/hotspot/share/opto/loopopts.cpp#L4211C40-L4211C40 I can do it in two weeks when I am back, or someone else can take it over if it is more urgent.
29-08-2023

Thanks for the details [~fgao]. Makes sense. Thanks for confirming [~gcao]. Emanuel can look into this once he's back from vacation but feel free to take it if you have time.
21-08-2023

Hi [~thartmann], I ran some tests to verify that the issue was introduced by JDK-8302652
21-08-2023

Hi [~thartmann], "AvoidUnalignedAccesses" is aarch64 or risc-v specific, but it also passes the value to "AlignVector" in superword. So I reproduced the same bug on 512-bit x86_64 platform as well, instead with "-XX:MaxVectorSize=32 -XX:+AlignVector".
21-08-2023

ILW = Wrong result with C2 compiled code, reproducible on AArch64 with SVE2 and -XX:+AvoidUnalignedAccesses, -XX:-SuperWordReductions or disable compilation of affected method = HLM = P3
21-08-2023

Might be a regression from JDK-8302652. [~gcao] could you check if disabling PhaseIdealLoop::move_unordered_reduction_out_of_loop helps?
21-08-2023

Looks like a duplicate of JDK-8310190 to me. [~epeter] is currently on vacation. [~fgao], maybe you can confirm? EDIT: Okay, JDK-8310190 is only about misaligned accesses. In this case, we get a wrong result. It's probably something different then.
21-08-2023