Spotted this during related performance work. If you run the current bytebuffer microbenchmarks, then one of them stands out:
```
% CONF=linux-x86_64-server-release make images test TEST="micro:ByteBuffers.testDirect.*Long" MICRO="FORK=1;OPTIONS=-p size=131072"
ByteBuffers.testDirectLoopGetLong: 1904.220 +- 0.555 ns/op
ByteBuffers.testDirectLoopGetLongRO: 1914.562 +- 7.225 ns/op
ByteBuffers.testDirectLoopGetLongSwap: 4839.337 +- 2.398 ns/op <---- !!!
ByteBuffers.testDirectLoopGetLongSwapRO: 1902.759 +- 0.812 ns/op
ByteBuffers.testDirectLoopPutLong: 2068.266 +- 2.197 ns/op
ByteBuffers.testDirectLoopPutLongSwap: 2104.532 +- 2.153 ns/op
```
testDirectLoopGetLongSwap is way out of band, with 2x throughput loss.
Perfasm shows that in the bad case we have not auto-vectorized the loop, there is a sequence of 8-byte reads+adds. Good cases are all auto-vectorized with 256-byte reads. What is even more funky that the bad case gets "repaired" when one asks for read-only (RO) version of it, see testDirectLoopGetLongSwapRO!
(Note that "swap" is misleading, it "swaps" default big-endian BB to little-endian, which matches x86.)
This reliably reproduces Xeon Platinum 8124M. I have not investigated deeply (at least yet).