1. How to produce the bug
When changing -XX:MaxVectorSize to 32 in `test/hotspot/jtreg/compiler/loopopopts/superword/TestUnorderedReduction.java`, and executing it with the following command, we will get an execution error.
```
zifeihan@d915263bc793:~/jdk$ git diff test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
diff --git a/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java b/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
index 18f3b6930ea..952a56dd842 100644
--- a/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
+++ b/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
@@ -40,7 +40,8 @@ public class TestUnorderedReduction {
public static void main(String[] args) {
TestFramework.runWithFlags("-Xbatch",
"-XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test*",
- "-XX:MaxVectorSize=16");
+ "-XX:MaxVectorSize=32",
+ "-XX:+AvoidUnalignedAccesses");
}
@Run(test = {"test1", "test2", "test3"})
```
Execute the command as follows(The jdk executed as above is a version of sve packaged with qemu-user):
```
/home/zifeihan/jtreg/bin/jtreg \
-J-Djavatest.maxOutputSize=500000 \
-Djdk.lang.Process.launchMechanism=vfork \
-v:default \
-concurrency:32 \
-timeout:50 \
-javaoption:-XX:UseSVE=2 \
-jdk:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk \
/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
```
The exceptions are as follows:
```
----------System.out:(19/3921)----------
Run Flag VM:
Command line: [/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dtest.jdk=/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk -Djava.library.path=. -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0 -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses compiler.lib.ir_framework.flag.FlagVM compiler.loopopts.superword.TestUnorderedReduction ]
[2023-08-19T02:00:01.969752090Z] Gathering output for process 91032
[2023-08-19T02:00:08.177648302Z] Waiting for completion for process 91032
[2023-08-19T02:00:08.180735552Z] Waiting for completion finished for process 91032
Output and diagnostic info for process 91032 was saved into 'pid-91032-output.log'
[2023-08-19T02:00:08.206587843Z] Waiting for completion for process 91032
[2023-08-19T02:00:08.208777677Z] Waiting for completion finished for process 91032
Run Test VM - [-Xbatch, -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test*, -XX:MaxVectorSize=32, -XX:+AvoidUnalignedAccesses]:
Command line: [/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djava.library.path=. -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dir.framework.server.port=34611 -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=test-vm-compile-commands-pid-91034.log -XX:CompilerDirectivesLimit=31 -XX:-OmitStackTraceInFastThrow -DShouldDoIRVerification=true -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM compiler.loopopts.superword.TestUnorderedReduction ]
[2023-08-19T02:00:08.628667469Z] Gathering output for process 91057
[2023-08-19T02:00:16.553228083Z] Waiting for completion for process 91057
[2023-08-19T02:00:16.554576500Z] Waiting for completion finished for process 91057
Output and diagnostic info for process 91057 was saved into 'pid-91057-output.log'
[2023-08-19T02:00:16.712613250Z] Waiting for completion for process 91057
[2023-08-19T02:00:16.713073750Z] Waiting for completion finished for process 91057
[2023-08-19T02:00:16.719087791Z] Waiting for completion for process 91057
[2023-08-19T02:00:16.723273500Z] Waiting for completion finished for process 91057
----------System.err:(65/4837)----------
TestFramework test VM exited with code 1
Command Line:
/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -DReproduce=true -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djava.library.path=. -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dir.framework.server.port=34611 -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=test-vm-compile-commands-pid-91034.log -XX:CompilerDirectivesLimit=31 -XX:-OmitStackTraceInFastThrow -DShouldDoIRVerification=true -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM compiler.loopopts.superword.TestUnorderedReduction
Error Output
------------
Exception in thread "main" compiler.lib.ir_framework.shared.TestRunException:
Test Failures (1)
-----------------
Custom Run Test: @Run: runTests - @Tests: {test1,test2,test3}:
compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method public void compiler.loopopts.superword.TestUnorderedReduction.runTests() throws java.lang.Exception
at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162)
at compiler.lib.ir_framework.test.AbstractTest.run(AbstractTest.java:104)
at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:89)
at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:822)
at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)
Caused by: java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159)
... 5 more
Caused by: java.lang.RuntimeException: Wrong result test2: 3469730 != 5772800
at compiler.loopopts.superword.TestUnorderedReduction.runTests(TestUnorderedReduction.java:65)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
... 7 more
at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:857)
at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)
#############################################################
- To only run the failed tests use -DTest, -DExclude,
and/or -DScenarios.
- To also get the standard output of the test VM run with
-DReportStdout=true or for even more fine-grained logging
use -DVerbose=true.
#############################################################
```
2. Reduced test case
```
public class TestUnorderedReduction {
static final int RANGE = 512;
static final int ITER = 10;
public static void main(String[] args) throws Exception {
final TestUnorderedReduction testUnorderedReduction = new TestUnorderedReduction();
for (int i = 0; i < 500; i++) {
testUnorderedReduction.runTests();
}
}
public void runTests() throws Exception {
int[] data = new int[RANGE];
init(data);
for (int i = 0; i < ITER; i++) {
int r1 = test2(data, i);
int r2 = ref2(data, i);
if (r1 != r2) {
throw new RuntimeException("Wrong result test2: " + r1 + " != " + r2);
}
}
}
static int test2(int[] data, int sum) {
for (int i = 0; i < RANGE; i+=8) {
sum += 3 * data[i+0];
sum += 3 * data[i+1];
sum += 3 * data[i+2];
sum += 3 * data[i+3];
sum += 3 * data[i+4];
sum += 3 * data[i+5];
sum += 3 * data[i+6];
sum += 3 * data[i+7];
}
return sum;
}
static int ref2(int[] data, int sum) {
for (int i = 0; i < RANGE; i+=8) {
sum += 3 * data[i+0];
sum += 3 * data[i+1];
sum += 3 * data[i+2];
sum += 3 * data[i+3];
sum += 3 * data[i+4];
sum += 3 * data[i+5];
sum += 3 * data[i+6];
sum += 3 * data[i+7];
}
return sum;
}
static void init(int[] data) {
for (int i = 0; i < RANGE; i++) {
data[i] = 1;
}
}
}
```
Execute this simple test case with:`./java -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction`, it passes normally. But it fails when using :`./java -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction`.
```
zifeihan@d915263bc793:~/jdk/build/linux-aarch64-server-fastdebug/jdk/bin$ /home/zifeihan/qemu-7.1.0-rc1-aarch64/bin/qemu-aarch64 -cpu max,sve256=on ./java-bak -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction
CompileCommand: compileonly TestUnorderedReduction.test* bool compileonly = true
Exception in thread "main" java.lang.RuntimeException: Wrong result test2: 1034 != 1538
at TestUnorderedReduction.runTests(TestUnorderedReduction.java:20)
at TestUnorderedReduction.main(TestUnorderedReduction.java:8)
```
The sve is emulated using qemu-user and the sve width is set to 256.
```
/home/zifeihan/qemu-7.1.0-rc1-aarch64/bin/qemu-aarch64 -cpu max,sve256=on ./java -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction
```
3. C2 JIT code
3.1 C2 JIT code for TestUnorderedReduction::test2 when test case passes.
```
160 B15: # out( B16 ) <- in( B16 ) top-of-loop Freq: 64.0845
160 spill R13 -> R10 # spill size = 32
164 B16: # out( B15 B17 ) <- in( B13 B15 ) Loop( B16-B15 inner main of N53) Freq: 65.0845
164 add R12, R1, R10, I2L #2 # ptr
168 add R13, R12, #16 # ptr
16c loadV V17, [R13] # vector (sve)
170 add R13, R12, #48 # ptr
174 loadV V18, [R13] # vector (sve)
178 vlsl_imm V19, V17, #1
17c add R13, R12, #80 # ptr
180 vaddI V17, V19, V17
184 loadV V19, [R13] # vector (sve)
188 vlsl_imm V20, V18, #1
18c vaddI V16, V16, V17
190 vaddI V17, V20, V18
194 add R13, R12, #112 # ptr
198 loadV V18, [R13] # vector (sve)
19c vlsl_imm V20, V19, #1
1a0 vaddI V16, V16, V17
1a4 vaddI V17, V20, V19
1a8 add R13, R12, #144 # ptr
1ac loadV V19, [R13] # vector (sve)
1b0 vlsl_imm V20, V18, #1
1b4 add R13, R12, #176 # ptr
1b8 vaddI V18, V20, V18
1bc loadV V20, [R13] # vector (sve)
1c0 vaddI V16, V16, V17
1c4 vlsl_imm V17, V19, #1
1c8 vaddI V16, V16, V18
1cc vaddI V17, V17, V19
1d0 add R13, R12, #208 # ptr
1d4 loadV V18, [R13] # vector (sve)
1d8 vlsl_imm V19, V20, #1
1dc add R12, R12, #240 # ptr
1e0 vaddI V19, V19, V20
1e4 loadV V20, [R12] # vector (sve)
1e8 vaddI V16, V16, V17
1ec vlsl_imm V17, V18, #1
1f0 vaddI V16, V16, V19
1f4 vaddI V17, V17, V18
1f8 vlsl_imm V18, V20, #1
1fc vaddI V16, V16, V17
200 vaddI V17, V18, V20
204 vaddI V16, V16, V17
208 addw R13, R10, #64
20c cmpw R13, #456
210 blt B15 // counted loop end P=0.984636 C=48127.000000
214 B17: # out( B22 B18 ) <- in( B16 ) Freq: 0.999989
214 reduce_addI_sve R0, R2, V16 # KILL V17
220 cmpw R13, #512
224 bge B22 P=0.500000 C=-1.000000
```
3.2 C2 JIT code for TestUnorderedReduction::test2 when test case fails.
```
170 B15: # out( B16 ) <- in( B16 ) top-of-loop Freq: 64.0845
170 spill R13 -> R11 # spill size = 32
174 B16: # out( B15 B17 ) <- in( B13 B15 ) Loop( B16-B15 inner main of N53) Freq: 65.0845
174 add R13, R10, R11, I2L #2 # ptr
178 loadV16 V17, [R13, #16] # vector (128 bits)
17c loadV V18, [R13, #32] # vector (sve)
180 vlsl_imm V19, V17, #1
184 loadV V20, [R13, #64] # vector (sve)
188 vaddI V17, V19, V17
18c vlsl_imm V19, V18, #1
190 vaddI V16, V16, V17
194 vaddI V17, V19, V18
198 loadV V18, [R13, #96] # vector (sve)
19c vlsl_imm V19, V20, #1
1a0 vaddI V16, V16, V17
1a4 vaddI V17, V19, V20
1a8 loadV16 V19, [R13, #128] # vector (128 bits)
1ac vlsl_imm V20, V18, #1
1b0 vaddI V16, V16, V17
1b4 vaddI V17, V20, V18
1b8 loadV16 V18, [R13, #144] # vector (128 bits)
1bc vlsl_imm V20, V19, #1
1c0 loadV V21, [R13, #160] # vector (sve)
1c4 vaddI V19, V20, V19
1c8 vaddI V16, V16, V17
1cc vlsl_imm V17, V18, #1
1d0 vaddI V16, V16, V19
1d4 vaddI V17, V17, V18
1d8 loadV V18, [R13, #192] # vector (sve)
1dc vlsl_imm V19, V21, #1
1e0 loadV V20, [R13, #224] # vector (sve)
1e4 vaddI V19, V19, V21
1e8 vaddI V16, V16, V17
1ec vlsl_imm V17, V18, #1
1f0 loadV16 V21, [R13, #256] # vector (128 bits)
1f4 vaddI V17, V17, V18
1f8 vaddI V16, V16, V19
1fc vlsl_imm V18, V20, #1
200 vaddI V16, V16, V17
204 vaddI V17, V18, V20
208 vlsl_imm V18, V21, #1
20c vaddI V16, V16, V17
210 vaddI V17, V18, V21
214 vaddI V16, V16, V17
218 addw R13, R11, #64
21c cmpw R13, #456
220 blt B15 // counted loop end P=0.984636 C=48127.000000
224 B17: # out( B22 B18 ) <- in( B16 ) Freq: 0.999989
224 reduce_addI_neon R0, R2, V16 # KILL V17
230 cmpw R13, #512
234 bge B22 P=0.500000 C=-1.000000
```
From the C2 JIT code, we can see that `reduce_addI_neon R0, R2, V16 # KILL V17 ` uses V16, which is generated from the above `vaddI V16, V16, V17` in the loop, but V16 and V17 have different vector length, which may result in omitted or over-processed data.