Relates :
|
This uses SIMD ldp/stp Qx, Qy instructions instead of scalar ldp/stp instructions, thereby loading/storing 32 bytes at a time instead of 16. It also extends the small copy code to copy 0-96 instead of 0-80 (because 80 is not divisible by 32). This improves performance on some micro-arches and not on others so I have provided a -XX:+UseSIMDForMemoryOps switch which defaults to false (we could look at enabling this by default for micro-arches where we know SIMD is better).