Bug ID: JDK-8183103 Post loop vectorization produces incorrect results

Type: Bug
Component: hotspot
Sub-Component: compiler
Affected Version: 9,10

Priority: P1
Status: Closed
Resolution: Fixed
CPU: x86_64

Submitted: 2017-06-28
Updated: 2017-10-09
Resolved: 2017-07-05

JDK 10	JDK 9
10Fixed	9 b178Fixed

compiler/codegen/TestIntFloatVect.java failed w/ jdk9+175 on hosts w/ Intel SKL -- Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz

#section:main
----------messages:(4/372)----------
command: main -Xbatch -XX:+IgnoreUnrecognizedVMOptions -XX:-TieredCompilation -XX:-OptimizeFill compiler.codegen.TestIntFloatVect
reason: User specified action: run main/othervm/timeout=300 -Xbatch -XX:+IgnoreUnrecognizedVMOptions -XX:-TieredCompilation -XX:-OptimizeFill compiler.codegen.TestIntFloatVect 
Mode: othervm [/othervm specified]
elapsed time (seconds): 0.847
----------configuration:(0/0)----------
----------System.out:(3/52)----------
Testing Integer + Float vectors
Warmup
Verification
----------System.err:(3/85)----------
test_ci_oppos: a1[0] = -1 != -123
test_ci_oppos: a1[1] = -1 != -123
FAILED: 2 errors

Fix verified by regression test.
07-08-2017
Thanks Igor.
05-07-2017
Approved for JDK9.
05-07-2017
Thanks Vladimir. [~epavlova], could you please approve from SQE side?
04-07-2017
Approved for JDK 9.
04-07-2017
Fix Request This is a regression in JDK 9 that causes loops to produce incorrect results on AVX-512 machines. The fix was reviewed by Vladimir K. and is low risk because it just disables an optimization. http://cr.openjdk.java.net/~thartmann/8183103/webrev.00/ Tested with all HotSpot tests on newer hardware and with RBT.
04-07-2017
Here's the fix that disables and makes PostLoopMultiversioning and AVX=3 experimental in JDK 9: http://cr.openjdk.java.net/~thartmann/8183103/webrev.00/ I've executed all HotSpot tests on an Intel SKL machine and will run pre-integration RBT testing. I've filed JDK-8183390 for the full fix in JDK 10.
03-07-2017
More details: C2 compiles the following loop: static void test_ci_oppos(int[] a, float[] b) { int limit = a.length-1; for (int i = 0; i < a.length; i+=1) { a[limit-i] = -123; b[i] = -103.f; } } The SuperWord optimization is only able to vectorize the b[i] (rdx) array access in the main loop: B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner main of N105 Freq: 958.844 movslq R10, RDI # i2l vmovdqul [RDX + #16 + R10 << #2] k0,XMM0 ! store vector (64 bytes) movq R11, R9 # spill subq R11, R10 # long movl [RSI + #12 + R11 << #2], #-123 # int movl [RSI + #8 + R11 << #2], #-123 # int [...] movl [RSI + #-44 + R11 << #2], #-123 # int movl [RSI + #-48 + R11 << #2], #-123 # int addl RDI, #16 # int cmpl RDI, RCX jl B12 # loop end P=0.998958 C=7671.000000 The (correct) post loop looks like this: B18: # B18 B19 <- B17 B18 Loop: B18-B18 inner post of N190 Freq: 0.998953 movl [RDX + #16 + RDI << #2], #-103.000000 # float movslq R10, RDI # i2l movq R11, R9 # spill subq R11, R10 # long movl [RSI + #12 + R11 << #2], #-123 # int incl RDI # int cmpl RDI, RAX jl,s B18 # loop end P=0.500000 C=7671.000000 Now PostLoopMultiversioning vectorizes the post loop: B15: # B15 B16 <- B14 B15 Loop: B15-B15 inner post of N190 Freq: 0.998952 vmovdqul [RDX + #16 + RDI << #2] k0,XMM0 ! store vector (64 bytes) movslq R11, RDI # i2l movq R8, R9 # spill subq R8, R11 # long vmovdqul [RSI + #12 + R8 << #2] k0,XMM1 ! store vector (64 bytes) addl RDI, R10 # int cmpl RDI, RAX jl B15 # loop end This time both b[i] (rdx) and a[limit - i] (rsi) are vectorized but the offset into a is incorrect. For the vmovdqul which writes 16 ints (64 bytes) to 'a' we use offset "R8 = R9 - RDI = limit - i" which will write a[limit - i] a[limit - i + 1] [...] a[limit - i + 15] but we should write to a[limit - i] a[limit - i - 1] [...] a[limit - i - 15] because i is incremented in each loop iteration and therefore "limit - i" decreases. As a result, array elements a[0] ... a[limit - i - 1] are not initialized. Now my understanding of the post loop vectorization optimization is very limited but here is what I found: In SuperWord::SLP_extract() we add both the StoreF and the StoreI nodes from the post loop to a singleton packset, i.e. we directly vectorize these instructions without going through the regular SuperWord optimization steps of selecting adjacent operations and creating/sorting packsets: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/ccfc68592c92#l16.287 In SuperWord::output() we then replace both the StoreF and the StoreI by the corresponding VectorNodes by using the lowest address 'low_adr' from the same packset as input. This works for the StoreF to b[i] because the existing (and only) instruction in the singleton pack set is the one with the lowest address. However, for the StoreI to a[limit-i] this is not correct because the lowest address would be the StoreI to a[limit - i - 15]. For non-post loops this works because all store instructions of the unrolled loop are added to the packset and SuperWord::packset_sort() guarantees that the packset entries are sorted according to alignment, i.e. we alway choose the correct lowest address. This was only triggered by the fix for JDK-8177095 because without the fix post loop vectorization fails with: slp analysis fails: unroll limit greater than max vector CountedLoopReserveKit::create_reserve: 352 not canonical loop SWPointer::output: loop was not reserved correctly, exiting SuperWord
30-06-2017
Here's the workaround fix that makes UseAVX=3 and PostLoopMultiversioning experimental in JDK 9 and disabled by default: http://cr.openjdk.java.net/~thartmann/8183103/webrev.00/
30-06-2017
Here's the method: static void test_ci_oppos(int[] a, float[] b) { int limit = a.length-1; for (int i = 0; i < a.length; i+=1) { a[limit-i] = -123; b[i] = -103.f; } } I attached a debugger and tracked the contents of 'a' = rsi ('b' = rdx is initialized just fine). The problem seems to be that we use a wrong offset into 'a' in the first (vector) post loop: 1bc B15: # B15 B16 <- B14 B15 Loop: B15-B15 inner post of N190 Freq: 0.998952 1bc vmovdqul [RDX + #16 + RDI << #2] k0,XMM0 ! store vector (64 bytes) 1c7 movslq R11, RDI # i2l 1ca movq R8, R9 # spill 1cd subq R8, R11 # long 1d0 vmovdqul [RSI + #12 + R8 << #2] k0,XMM1 ! store vector (64 bytes) 1db addl RDI, R10 # int 1de cmpl RDI, RAX 1e0 jl B15 # loop end Here are the relevant register contents at the 'vmovdqul' instruction: rdi = 982 r10 = 15 r8 = 15 r9 = 997 rax = length = 997 r11 = 982 That means we are initializing elements 982 - 997 (the last 16 elements) of 'b' and should therefore initialize the first 16 elements in 'a'. However, we write at a[R8] with R8 = 15 and therefore fail to initialize the first 14 elements of 'a'. I also noticed that only the stores to 'b' are vectorized in the main loop. The stores to 'a' are not vectorized (although they could be). Maybe this confuses the post loop vectorization.
29-06-2017
Could be related to JDK-8149421 / JDK-8153998/ JDK-8151573. The problem disappears with -XX:-PostLoopMultiversioning.
29-06-2017
Updated ILW = Loop optimizations return incorrect result, regression in JDK 9 that only showed up on Intel Skylake (with AVX3) but may affect other platforms as well, -XX:-PostLoopMultiversioning = HHM = P1
29-06-2017
I'm able to reproduce this with -XX:CompileCommand=compileonly,compiler.codegen.TestIntFloatVect::test_ci_oppos and attached the opto and gdb assembly output for that method.
29-06-2017
Try to run with -XX:+TraceLoopOpts -XX:+TraceSuperWord -XX:+TraceNewVectors with and without 8177095. It could be different loop opts happened which allow different vectorization.
28-06-2017
AVX3 support was added in JDK 9.
28-06-2017
We can fallback to limit AVX in JDK 9 RC if the fix is complicated and do proper fix in JDK 9 update.
28-06-2017
at this point in the release cycle, I'd recommend to change default value for UseAVX on SKL and KNL hosts to 2.
28-06-2017
If it is regression in JDK 9 we should set it as P1 and fix it in JDK 9 RC.
28-06-2017
[~vdeshpande] [~mcberg] Vivek and Michael, please, look on this too. It is urgent. We very short on time to be able fixed it in JDK 9.
28-06-2017
initial ILW = incorrect codegen by C2, at present only reported with Intel Skylake, disable vectorization = HMM = P2
28-06-2017

Relates :	JDK-8183390 - Fix and re-enable post loop vectorization
Relates :	JDK-8184036 - Restore -XX:UseAVX=3 as product value
Relates :	JDK-8183319 - Post loop vectorization fails with "Error mixing types: int and vectorz"
Relates :	JDK-8153998 - Masked vector post loops
Relates :	JDK-8185802 - Enable AVX3 by default
Relates :	JDK-8151573 - Multiversioning for range check elimination