The attached benchmark has interesting results:
```
Benchmark (size) Mode Cnt Score Error Units
TestLoadBytes.arrayScalar 1024 avgt 10 241.256 ? 1.028 ns/op
TestLoadBytes.arrayScalarConst 1024 avgt 10 244.251 ? 5.218 ns/op
TestLoadBytes.bufferNativeScalar 1024 avgt 10 262.128 ? 1.251 ns/op
TestLoadBytes.bufferNativeScalarConst 1024 avgt 10 250.552 ? 2.710 ns/op
TestLoadBytes.segmentNativeScalar 1024 avgt 10 722.670 ? 6.427 ns/op
TestLoadBytes.segmentNativeScalarConst 1024 avgt 10 253.419 ? 3.043 ns/op
```
Access using segment is almost 4x slower than using byte buffers. When investigating the generated compiled code, it seems like all the time is spent in the post-loop, and that the main loop (which seems to unroll correctly) is never executed.