JDK-8175096 : Analyse subword in the loop to set maximum vector size.
  • Type: Enhancement
  • Component: hotspot
  • Sub-Component: compiler
  • Affected Version: 9,10
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2017-02-16
  • Updated: 2018-03-03
  • Resolved: 2017-07-19
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 10
10 b21Fixed
Related Reports
Relates :  
Relates :  
Relates :  
Description
Currently subword types cannot use entire vector width using SLP.
This work helps analyse subword in the loop to set maximum vector size to take advantage of full vector width for subword types.
In this we analyze if narrowing is likely to happen and if it is we set vector size more aggressively.
We check for possibility of narrowing by looking through chain operations using subword types.   
Comments
Pre-integration testing passed.
19-07-2017

Main loop for short[] 0c0 B13: # B13 B14 <- B12 B13 Loop: B13-B13 inner main of N77 Freq: 1024.99 0c0 vmovdqul XMM0 k0,[R10 + #16 + RDX << #1] ! load vector (64 bytes) 0cb vpaddw XMM0,XMM0,[R8 + #16 + RDX << #1] ! add packed32S 0d6 vmovdqul [R11 + #16 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 0e1 vmovdqul XMM0 k0,[R10 + #80 + RDX << #1] ! load vector (64 bytes) 0ec vpaddw XMM0,XMM0,[R8 + #80 + RDX << #1] ! add packed32S 0f7 vmovdqul [R11 + #80 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 102 vmovdqul XMM0 k0,[R10 + #144 + RDX << #1] ! load vector (64 bytes) 10d vpaddw XMM0,XMM0,[R8 + #144 + RDX << #1] ! add packed32S 118 vmovdqul [R11 + #144 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 123 vmovdqul XMM0 k0,[R10 + #208 + RDX << #1] ! load vector (64 bytes) 12e vpaddw XMM0,XMM0,[R8 + #208 + RDX << #1] ! add packed32S 139 vmovdqul [R11 + #208 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 144 vmovdqul XMM0 k0,[R10 + #272 + RDX << #1] ! load vector (64 bytes) 14f vpaddw XMM0,XMM0,[R8 + #272 + RDX << #1] ! add packed32S 15a vmovdqul [R11 + #272 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 165 vmovdqul XMM0 k0,[R10 + #336 + RDX << #1] ! load vector (64 bytes) 170 vpaddw XMM0,XMM0,[R8 + #336 + RDX << #1] ! add packed32S 17b vmovdqul [R11 + #336 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 186 vmovdqul XMM0 k0,[R10 + #400 + RDX << #1] ! load vector (64 bytes) 191 vpaddw XMM0,XMM0,[R8 + #400 + RDX << #1] ! add packed32S 19c vmovdqul [R11 + #400 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 1a7 vmovdqul XMM0 k0,[R10 + #464 + RDX << #1] ! load vector (64 bytes) 1b2 vpaddw XMM0,XMM0,[R8 + #464 + RDX << #1] ! add packed32S 1bd vmovdqul [R11 + #464 + RDX << #1] k0,XMM0 ! store vector (64 bytes) 1c8 addl RDX, #256 # int 1ce cmpl RDX, #769 1d4 jl B13 # loop end P=0.999024 C=24462636.000000
18-07-2017

Main loop For byte[] 0c0 B13: # B13 B14 <- B12 B13 Loop: B13-B13 inner main of N77 Freq: 1024.99 0c0 movslq R11, R10 # i2l 0c3 vmovdqul XMM0 k0,[RDI + #16 + R11] ! load vector (64 bytes) 0ce vpaddb XMM0,XMM0,[RBX + #16 + R11] ! add packed64B 0d9 vmovdqul [RDX + #16 + R11] k0,XMM0 ! store vector (64 bytes) 0e4 vmovdqul XMM0 k0,[RDI + #80 + R11] ! load vector (64 bytes) 0ef vpaddb XMM0,XMM0,[RBX + #80 + R11] ! add packed64B 0fa vmovdqul [RDX + #80 + R11] k0,XMM0 ! store vector (64 bytes) 105 vmovdqul XMM0 k0,[RDI + #144 + R11] ! load vector (64 bytes) 110 vpaddb XMM0,XMM0,[RBX + #144 + R11] ! add packed64B 11b vmovdqul [RDX + #144 + R11] k0,XMM0 ! store vector (64 bytes) 126 vmovdqul XMM0 k0,[RDI + #208 + R11] ! load vector (64 bytes) 131 vpaddb XMM0,XMM0,[RBX + #208 + R11] ! add packed64B 13c vmovdqul [RDX + #208 + R11] k0,XMM0 ! store vector (64 bytes) 147 addl R10, #256 # int 14e cmpl R10, #769 155 jl B13 # loop end P=0.999024 C=66936736.000000
18-07-2017

Main Loop generated on Skylake server: 1d0 B31: # B31 B32 <- B30 B31 Loop: B31-B31 inner main of N161 Freq: 1024.97 1d0 movslq R11, RBP # i2l 1d3 vmovdqul XMM1 k0,[RBX + #16 + R11 << #2] ! load vector (64 bytes) 1de vpaddd XMM1,XMM1,[R9 + #16 + R11 << #2] ! add packed16I 1e9 movdq R10, XMM0 # spill 1ee vmovdqul XMM2 k0,[R10 + #16 + R11] ! load vector (64 bytes) 1f9 vpaddb XMM2,XMM2,[R13 + #16 + R11] ! add packed64B 204 vmovdqul [R14 + #16 + R11] k0,XMM2 ! store vector (64 bytes) 20f vmovdqul XMM2 k0,[RDX + #80 + R11 << #1] ! load vector (64 bytes) 21a vmovdqul XMM3 k0,[R10 + #80 + R11] ! load vector (64 bytes) 225 vpaddb XMM3,XMM3,[R13 + #80 + R11] ! add packed64B 230 vmovdqul [R14 + #80 + R11] k0,XMM3 ! store vector (64 bytes) 23b vmovdqul XMM3 k0,[RAX + #80 + R11 << #1] ! load vector (64 bytes) 246 vmovdqul XMM4 k0,[R10 + #144 + R11] ! load vector (64 bytes) 251 vpaddb XMM4,XMM4,[R13 + #144 + R11] ! add packed64B 25c vmovdqul [R14 + #144 + R11] k0,XMM4 ! store vector (64 bytes) 267 vmovdqul XMM4 k0,[RBX + #80 + R11 << #2] ! load vector (64 bytes) 272 vmovdqul XMM5 k0,[R10 + #208 + R11] ! load vector (64 bytes) 27d vpaddb XMM5,XMM5,[R13 + #208 + R11] ! add packed64B 288 vmovdqul [R14 + #208 + R11] k0,XMM5 ! store vector (64 bytes) 293 vmovdqul XMM5 k0,[RDX + #16 + R11 << #1] ! load vector (64 bytes) 29e vpaddw XMM5,XMM5,[RAX + #16 + R11 << #1] ! add packed32S 2a9 vmovdqul [RSI + #16 + R11 << #1] k0,XMM5 ! store vector (64 bytes) 2b4 vmovdqul XMM5 k0,[RBX + #208 + R11 << #2] ! load vector (64 bytes) 2bf vmovdqul XMM6 k0,[R9 + #208 + R11 << #2] ! load vector (64 bytes) 2ca vmovdqul XMM7 k0,[RBX + #144 + R11 << #2] ! load vector (64 bytes) 2d5 vmovdqul XMM8 k0,[R9 + #80 + R11 << #2] ! load vector (64 bytes) 2e0 vmovdqul XMM9 k0,[R9 + #144 + R11 << #2] ! load vector (64 bytes) 2eb vmovdqul [RDI + #16 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 2f6 vpaddw XMM1,XMM3,XMM2 ! add packed32S 2fc vmovdqul [RSI + #80 + R11 << #1] k0,XMM1 ! store vector (64 bytes) 307 vmovdqul XMM1 k0,[RDX + #144 + R11 << #1] ! load vector (64 bytes) 312 vpaddw XMM1,XMM1,[RAX + #144 + R11 << #1] ! add packed32S 31d vmovdqul XMM2 k0,[RDX + #208 + R11 << #1] ! load vector (64 bytes) 328 vmovdqul XMM3 k0,[RAX + #208 + R11 << #1] ! load vector (64 bytes) 333 vmovdqul [RSI + #144 + R11 << #1] k0,XMM1 ! store vector (64 bytes) 33e vpaddd XMM1,XMM9,XMM7 ! add packed16I 344 vpaddw XMM2,XMM3,XMM2 ! add packed32S 34a vmovdqul [RSI + #208 + R11 << #1] k0,XMM2 ! store vector (64 bytes) 355 vmovdqul XMM2 k0,[RDX + #272 + R11 << #1] ! load vector (64 bytes) 360 vpaddw XMM2,XMM2,[RAX + #272 + R11 << #1] ! add packed32S 36b vmovdqul XMM3 k0,[RDX + #336 + R11 << #1] ! load vector (64 bytes) 376 vmovdqul XMM7 k0,[RAX + #336 + R11 << #1] ! load vector (64 bytes) 381 vmovdqul [RSI + #272 + R11 << #1] k0,XMM2 ! store vector (64 bytes) 38c vpaddd XMM2,XMM4,XMM8 ! add packed16I 392 vmovdqul [RDI + #80 + R11 << #2] k0,XMM2 ! store vector (64 bytes) 39d vmovdqul [RDI + #144 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 3a8 vpaddw XMM1,XMM7,XMM3 ! add packed32S 3ae vmovdqul [RSI + #336 + R11 << #1] k0,XMM1 ! store vector (64 bytes) 3b9 vmovdqul XMM1 k0,[RDX + #400 + R11 << #1] ! load vector (64 bytes) 3c4 vpaddw XMM1,XMM1,[RAX + #400 + R11 << #1] ! add packed32S 3cf vmovdqul XMM2 k0,[RDX + #464 + R11 << #1] ! load vector (64 bytes) 3da vmovdqul XMM3 k0,[RAX + #464 + R11 << #1] ! load vector (64 bytes) 3e5 vmovdqul [RSI + #400 + R11 << #1] k0,XMM1 ! store vector (64 bytes) 3f0 vpaddd XMM1,XMM6,XMM5 ! add packed16I 3f6 vmovdqul [RDI + #208 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 401 vmovdqul XMM1 k0,[RBX + #272 + R11 << #2] ! load vector (64 bytes) 40c vpaddd XMM1,XMM1,[R9 + #272 + R11 << #2] ! add packed16I 417 vmovdqul XMM4 k0,[RBX + #464 + R11 << #2] ! load vector (64 bytes) 422 vmovdqul XMM5 k0,[R9 + #464 + R11 << #2] ! load vector (64 bytes) 42d vmovdqul XMM6 k0,[RBX + #400 + R11 << #2] ! load vector (64 bytes) 438 vmovdqul XMM7 k0,[RBX + #336 + R11 << #2] ! load vector (64 bytes) 443 vmovdqul XMM8 k0,[R9 + #400 + R11 << #2] ! load vector (64 bytes) 44e vmovdqul XMM9 k0,[R9 + #336 + R11 << #2] ! load vector (64 bytes) 459 vmovdqul [RDI + #272 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 464 vpaddw XMM1,XMM3,XMM2 ! add packed32S 46a vmovdqul [RSI + #464 + R11 << #1] k0,XMM1 ! store vector (64 bytes) 475 vpaddd XMM1,XMM7,XMM9 ! add packed16I 47b vmovdqul [RDI + #336 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 486 vpaddd XMM1,XMM8,XMM6 ! add packed16I 48c vmovdqul [RDI + #400 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 497 vpaddd XMM1,XMM5,XMM4 ! add packed16I 49d vmovdqul [RDI + #464 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 4a8 vmovdqul XMM1 k0,[RBX + #528 + R11 << #2] ! load vector (64 bytes) 4b3 vpaddd XMM1,XMM1,[R9 + #528 + R11 << #2] ! add packed16I 4be vmovdqul XMM2 k0,[RBX + #720 + R11 << #2] ! load vector (64 bytes) 4c9 vmovdqul XMM3 k0,[R9 + #720 + R11 << #2] ! load vector (64 bytes) 4d4 vmovdqul XMM4 k0,[RBX + #656 + R11 << #2] ! load vector (64 bytes) 4df vmovdqul XMM5 k0,[RBX + #592 + R11 << #2] ! load vector (64 bytes) 4ea vmovdqul XMM6 k0,[R9 + #656 + R11 << #2] ! load vector (64 bytes) 4f5 vmovdqul XMM7 k0,[R9 + #592 + R11 << #2] ! load vector (64 bytes) 500 vmovdqul [RDI + #528 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 50b vpaddd XMM1,XMM3,XMM2 ! add packed16I 511 vpaddd XMM2,XMM5,XMM7 ! add packed16I 517 vmovdqul [RDI + #592 + R11 << #2] k0,XMM2 ! store vector (64 bytes) 522 vpaddd XMM2,XMM6,XMM4 ! add packed16I 528 vmovdqul [RDI + #656 + R11 << #2] k0,XMM2 ! store vector (64 bytes) 533 vmovdqul [RDI + #720 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 53e vmovdqul XMM1 k0,[RBX + #784 + R11 << #2] ! load vector (64 bytes) 549 vpaddd XMM1,XMM1,[R9 + #784 + R11 << #2] ! add packed16I 554 vmovdqul XMM2 k0,[RBX + #976 + R11 << #2] ! load vector (64 bytes) 55f vmovdqul XMM3 k0,[R9 + #976 + R11 << #2] ! load vector (64 bytes) 56a vmovdqul XMM4 k0,[RBX + #912 + R11 << #2] ! load vector (64 bytes) 575 vmovdqul XMM5 k0,[RBX + #848 + R11 << #2] ! load vector (64 bytes) 580 vmovdqul XMM6 k0,[R9 + #912 + R11 << #2] ! load vector (64 bytes) 58b vmovdqul XMM7 k0,[R9 + #848 + R11 << #2] ! load vector (64 bytes) 596 vmovdqul [RDI + #784 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 5a1 vpaddd XMM1,XMM3,XMM2 ! add packed16I 5a7 vpaddd XMM2,XMM5,XMM7 ! add packed16I 5ad vmovdqul [RDI + #848 + R11 << #2] k0,XMM2 ! store vector (64 bytes) 5b8 vpaddd XMM2,XMM6,XMM4 ! add packed16I 5be vmovdqul [RDI + #912 + R11 << #2] k0,XMM2 ! store vector (64 bytes) 5c9 vmovdqul [RDI + #976 + R11 << #2] k0,XMM1 ! store vector (64 bytes) 5d4 addl RBP, #256 # int 5da cmpl RBP, #769 5e0 jl B31 # loop end P=0.999024 C=103415288.000000
18-07-2017

Updated Webrev: http://cr.openjdk.java.net/~vdeshpande/8175096/webrev.01/ With this update the generated code for the main loop is like this: 140 B22: # B22 B23 <- B21 B22 Loop: B22-B22 inner main of N119 Freq: 1024.98 140 movslq R10, RBX # i2l 143 vmovdqu XMM0,[R9 + #16 + R10 << #2] ! load vector (32 bytes) 14a vpaddd XMM0,XMM0,[R11 + #16 + R10 << #2] ! add packed8I 151 vmovdqu XMM1,[RDI + #16 + R10] ! load vector (32 bytes) 158 vpaddb XMM1,XMM1,[RAX + #16 + R10] ! add packed32B 15f vmovdqu [RSI + #16 + R10],XMM1 ! store vector (32 bytes) 166 vmovdqu XMM1,[R9 + #112 + R10 << #2] ! load vector (32 bytes) 16d vmovdqu XMM2,[RDI + #48 + R10] ! load vector (32 bytes) 174 vpaddb XMM2,XMM2,[RAX + #48 + R10] ! add packed32B 17b vmovdqu [RSI + #48 + R10],XMM2 ! store vector (32 bytes) 182 vmovdqu XMM2,[R11 + #112 + R10 << #2] ! load vector (32 bytes) 189 vmovdqu XMM3,[RDI + #80 + R10] ! load vector (32 bytes) 190 vpaddb XMM3,XMM3,[RAX + #80 + R10] ! add packed32B 197 vmovdqu [RSI + #80 + R10],XMM3 ! store vector (32 bytes) 19e vmovdqu XMM3,[R9 + #48 + R10 << #2] ! load vector (32 bytes) 1a5 vmovdqu XMM4,[RDI + #112 + R10] ! load vector (32 bytes) 1ac vpaddb XMM4,XMM4,[RAX + #112 + R10] ! add packed32B 1b3 vmovdqu [RSI + #112 + R10],XMM4 ! store vector (32 bytes) 1ba vmovdqu XMM4,[R9 + #80 + R10 << #2] ! load vector (32 bytes) 1c1 vmovdqu XMM5,[R11 + #80 + R10 << #2] ! load vector (32 bytes) 1c8 vmovdqu XMM6,[R11 + #48 + R10 << #2] ! load vector (32 bytes) 1cf vmovdqu [RCX + #16 + R10 << #2],XMM0 ! store vector (32 bytes) 1d6 vpaddd XMM0,XMM2,XMM1 ! add packed8I 1da vpaddd XMM1,XMM3,XMM6 ! add packed8I 1de vmovdqu [RCX + #48 + R10 << #2],XMM1 ! store vector (32 bytes) 1e5 vpaddd XMM1,XMM5,XMM4 ! add packed8I 1e9 vmovdqu [RCX + #80 + R10 << #2],XMM1 ! store vector (32 bytes) 1f0 vmovdqu [RCX + #112 + R10 << #2],XMM0 ! store vector (32 bytes) 1f7 vmovdqu XMM0,[R9 + #144 + R10 << #2] ! load vector (32 bytes) 201 vpaddd XMM0,XMM0,[R11 + #144 + R10 << #2] ! add packed8I 20b vmovdqu XMM1,[R9 + #240 + R10 << #2] ! load vector (32 bytes) 215 vmovdqu XMM2,[R11 + #240 + R10 << #2] ! load vector (32 bytes) 21f vmovdqu XMM3,[R9 + #208 + R10 << #2] ! load vector (32 bytes) 229 vmovdqu XMM4,[R9 + #176 + R10 << #2] ! load vector (32 bytes) 233 vmovdqu XMM5,[R11 + #208 + R10 << #2] ! load vector (32 bytes) 23d vmovdqu XMM6,[R11 + #176 + R10 << #2] ! load vector (32 bytes) 247 vmovdqu [RCX + #144 + R10 << #2],XMM0 ! store vector (32 bytes) 251 vpaddd XMM0,XMM2,XMM1 ! add packed8I 255 vpaddd XMM1,XMM4,XMM6 ! add packed8I 259 vmovdqu [RCX + #176 + R10 << #2],XMM1 ! store vector (32 bytes) 263 vpaddd XMM1,XMM5,XMM3 ! add packed8I 267 vmovdqu [RCX + #208 + R10 << #2],XMM1 ! store vector (32 bytes) 271 vmovdqu [RCX + #240 + R10 << #2],XMM0 ! store vector (32 bytes) 27b vmovdqu XMM0,[R9 + #272 + R10 << #2] ! load vector (32 bytes) 285 vpaddd XMM0,XMM0,[R11 + #272 + R10 << #2] ! add packed8I 28f vmovdqu XMM1,[R9 + #368 + R10 << #2] ! load vector (32 bytes) 299 vmovdqu XMM2,[R11 + #368 + R10 << #2] ! load vector (32 bytes) 2a3 vmovdqu XMM3,[R9 + #336 + R10 << #2] ! load vector (32 bytes) 2ad vmovdqu XMM4,[R9 + #304 + R10 << #2] ! load vector (32 bytes) 2b7 vmovdqu XMM5,[R11 + #336 + R10 << #2] ! load vector (32 bytes) 2c1 vmovdqu XMM6,[R11 + #304 + R10 << #2] ! load vector (32 bytes) 2cb vmovdqu [RCX + #272 + R10 << #2],XMM0 ! store vector (32 bytes) 2d5 vpaddd XMM0,XMM2,XMM1 ! add packed8I 2d9 vpaddd XMM1,XMM4,XMM6 ! add packed8I 2dd vmovdqu [RCX + #304 + R10 << #2],XMM1 ! store vector (32 bytes) 2e7 vpaddd XMM1,XMM5,XMM3 ! add packed8I 2eb vmovdqu [RCX + #336 + R10 << #2],XMM1 ! store vector (32 bytes) 2f5 vmovdqu [RCX + #368 + R10 << #2],XMM0 ! store vector (32 bytes) 2ff vmovdqu XMM0,[R9 + #400 + R10 << #2] ! load vector (32 bytes) 309 vpaddd XMM0,XMM0,[R11 + #400 + R10 << #2] ! add packed8I 313 vmovdqu XMM1,[R9 + #496 + R10 << #2] ! load vector (32 bytes) 31d vmovdqu XMM2,[R11 + #496 + R10 << #2] ! load vector (32 bytes) 327 vmovdqu XMM3,[R9 + #464 + R10 << #2] ! load vector (32 bytes) 331 vmovdqu XMM4,[R9 + #432 + R10 << #2] ! load vector (32 bytes) 33b vmovdqu XMM5,[R11 + #464 + R10 << #2] ! load vector (32 bytes) 345 vmovdqu XMM6,[R11 + #432 + R10 << #2] ! load vector (32 bytes) 34f vmovdqu [RCX + #400 + R10 << #2],XMM0 ! store vector (32 bytes) 359 vpaddd XMM0,XMM2,XMM1 ! add packed8I 35d vpaddd XMM1,XMM4,XMM6 ! add packed8I 361 vmovdqu [RCX + #432 + R10 << #2],XMM1 ! store vector (32 bytes) 36b vpaddd XMM1,XMM5,XMM3 ! add packed8I 36f vmovdqu [RCX + #464 + R10 << #2],XMM1 ! store vector (32 bytes) 379 vmovdqu [RCX + #496 + R10 << #2],XMM0 ! store vector (32 bytes) 383 addl RBX, #128 # int 389 cmpl RBX, #897 38f jl B22 # loop end P=0.999024 C=29277936.000000
13-07-2017

So it looks like it is not correct place for the fix. Loop unrolling should be increase too but main fix should be allow different vector elements count for different types.
19-05-2017

I got the same (not unrolled result) with just simple change: @@ -345,7 +345,7 @@ // Map the maximal common vector if (VectorNode::implemented(n->Opcode(), cur_max_vector, bt)) { - if (cur_max_vector < max_vector) { + if (cur_max_vector > max_vector) { max_vector = cur_max_vector; }
19-05-2017

I run with next command: $JAVA_HOME/bin/java -XX:-TieredCompilation -XX:+PrintCompilation -Xbatch -XX:CICompilerCount=1 -XX:+PrintOptoAssembly -XX:CompileOnly=TestVect.doit2 TestVect
19-05-2017

As you see the only difference produced by patch is that does not unroll with vector operation. And number of elements in vectors do not change. I think it is regression and not improvement.
19-05-2017

With webrev.00 changes the loop code is: 1e0 B31: # B32 <- B32 top-of-loop Freq: 992.001 1e0 movdq XMM0, R9 # spill 1e5 movdq XMM1, R11 # spill 1e5 1ea B32: # B31 B33 <- B30 B31 Loop: B32-B31 inner main of N161 Freq: 993.001 1ea movslq R10, RDI # i2l 1ed movdq R11, XMM1 # spill 1f2 movdqu XMM1,[R11 + #16 + R10] ! load vector (16 bytes) 1f9 movdq R9, XMM0 # spill 1fe vpaddb XMM0,XMM1,[R9 + #16 + R10] ! add packed16B 205 movdqu [R14 + #16 + R10],XMM0 ! store vector (16 bytes) 20c vmovdqul XMM0 k0,[RCX + #16 + R10 << #2] ! load vector (64 bytes) 217 vpaddd XMM0,XMM0,[RBX + #16 + R10 << #2] ! add packed16I 222 vmovdqul [RDX + #16 + R10 << #2] k0,XMM0 ! store vector (64 bytes) 22d vmovdqu XMM0,[RSI + #16 + R10 << #1] ! load vector (32 bytes) 234 vpaddw XMM0,XMM0,[R13 + #16 + R10 << #1] ! add packed16S 23b vmovdqu [R8 + #16 + R10 << #1],XMM0 ! store vector (32 bytes) 242 addl RDI, #16 # int 245 cmpl RDI, #1009 24b jl,s B31 # loop end P=0.998993 C=6944.000000
19-05-2017

$ cat TestVect.java public class TestVect { public static void main(String[] args) { System.out.println("Speed: " + count() + " ops/s"); } static final int NUM = 1024; static final int LIM = 10000; static byte[] data = new byte[NUM], data2 = new byte[NUM], data3 = new byte[NUM]; static short[] data4 = new short[NUM], data5 = new short[NUM], data6 = new short[NUM]; static int[] data7 = new int[NUM], data8 = new int[NUM], data9 = new int[NUM]; public static double count() { // Warmup for (int i = 0; i < 1000; i++) { doit2(); } long time1, time0 = System.nanoTime(); for (int i = 0; i < LIM; i++) { doit2(); } time1 = System.nanoTime(); return 1f*10000/(time1-time0)*1e9; } public static void doit2() { for (int i = 0; i < NUM; i++) { data[i] = (byte)(data2[i] + data3[i]); data4[i] = (short)(data5[i] + data6[i]); data7[i] = data8[i] + data9[i]; } } }
19-05-2017

Here is currently generated code in 'main' loop (latest JDK10/hs): 1e0 B31: # B32 <- B32 top-of-loop Freq: 992.001 1e0 movdq XMM0, R8 # spill 1e5 movdq XMM1, R11 # spill 1e5 1ea B32: # B31 B33 <- B30 B31 Loop: B32-B31 inner main of N169 Freq: 993.001 1ea movslq R10, RBX # i2l 1ed movdq R11, XMM1 # spill 1f2 movdqu XMM1,[R11 + #16 + R10] ! load vector (16 bytes) 1f9 movdq R8, XMM0 # spill 1fe vpaddb XMM0,XMM1,[R8 + #16 + R10] ! add packed16B 205 movdqu [R14 + #16 + R10],XMM0 ! store vector (16 bytes) 20c vmovdqul XMM0 k0,[RCX + #16 + R10 << #2] ! load vector (64 bytes) 217 vpaddd XMM0,XMM0,[RDI + #16 + R10 << #2] ! add packed16I 222 vmovdqul [RDX + #16 + R10 << #2] k0,XMM0 ! store vector (64 bytes) 22d movdqu XMM0,[R11 + #32 + R10] ! load vector (16 bytes) 234 vpaddb XMM0,XMM0,[R8 + #32 + R10] ! add packed16B 23b movdqu [R14 + #32 + R10],XMM0 ! store vector (16 bytes) 242 vmovdqul XMM0 k0,[RCX + #80 + R10 << #2] ! load vector (64 bytes) 24d vpaddd XMM0,XMM0,[RDI + #80 + R10 << #2] ! add packed16I 258 vmovdqul [RDX + #80 + R10 << #2] k0,XMM0 ! store vector (64 bytes) 263 vmovdqu XMM0,[RAX + #16 + R10 << #1] ! load vector (32 bytes) 26a vpaddw XMM0,XMM0,[R13 + #16 + R10 << #1] ! add packed16S 271 vmovdqu [R9 + #16 + R10 << #1],XMM0 ! store vector (32 bytes) 278 vmovdqu XMM0,[RAX + #48 + R10 << #1] ! load vector (32 bytes) 27f vpaddw XMM0,XMM0,[R13 + #48 + R10 << #1] ! add packed16S 286 vmovdqu [R9 + #48 + R10 << #1],XMM0 ! store vector (32 bytes) 28d addl RBX, #32 # int 290 cmpl RBX, #993 296 jl B31 # loop end P=0.998993 C=6944.000000
19-05-2017

Yes, it should go into jdk 10. We can consider to backport it to 9u later.
22-02-2017

This looks as a performance enhancement for me and not as a correctness bug. I'd recommend we fix this in 10 (and set the "Fix Version" accordingly).
17-02-2017

http://cr.openjdk.java.net/~vdeshpande/8175096/webrev.00/
16-02-2017