This benchmark:
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.TearDown;
import org.openjdk.jmh.annotations.Warmup;
import sun.misc.Unsafe;
import java.lang.invoke.VarHandle;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.util.concurrent.TimeUnit;
import static jdk.incubator.foreign.MemoryLayout.PathElement.sequenceElement;
import static jdk.incubator.foreign.MemoryLayouts.JAVA_FLOAT;
import static jdk.incubator.foreign.MemoryLayouts.JAVA_INT;
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(org.openjdk.jmh.annotations.Scope.Thread)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Fork(value = 3, jvmArgsAppend = { "--add-modules=jdk.incubator.foreign" })
public class LoopOverPolluted {
static final int ELEM_SIZE = 1_000_000;
static final int CARRIER_SIZE = (int) JAVA_INT.byteSize();
static final int ALLOC_SIZE = ELEM_SIZE * CARRIER_SIZE;
static final Unsafe unsafe = Utils.unsafe;
ByteBuffer bb = ByteBuffer.allocateDirect(ALLOC_SIZE).order(ByteOrder.nativeOrder());
byte[] arr = new byte[ALLOC_SIZE];
FloatBuffer fb = ByteBuffer.wrap(arr).order(ByteOrder.nativeOrder()).asFloatBuffer();
@Setup
public void setup() {
for (int i = 0; i < ELEM_SIZE; i++) {
bb.putFloat(i * 4, i);
}
for (int i = 0; i < ELEM_SIZE; i++) {
fb.put(i, i);
}
}
@TearDown
public void tearDown() {
unsafe.invokeCleaner(bb);
arr = null;
fb = null;
}
@Benchmark
public int byte_buffer_get_float() {
int sum = 0;
for (int k = 0; k < ELEM_SIZE; k++) {
bb.putFloat(k, (float)k + 1);
float v = bb.getFloat(k * 4);
sum += (int)v;
}
return sum;
}
@Benchmark
public int float_buffer_get() {
int sum = 0;
for (int k = 0; k < ELEM_SIZE; k ++) {
fb.put(k, k + 1);
float v = fb.get(k);
sum += (int)v;
}
return sum;
}
@Benchmark
public int unsafe_get_float() {
int sum = 0;
for (int k = 0; k < ALLOC_SIZE; k += 4) {
unsafe.putFloat(arr, k + Unsafe.ARRAY_BYTE_BASE_OFFSET, k + 1);
float v = unsafe.getFloat(arr, k + Unsafe.ARRAY_BYTE_BASE_OFFSET);
sum += (int)v;
}
return sum;
}
}
Reveals a performance regression between Java 15 and Java 16. Here are the results on Java 15:
Benchmark Mode Cnt Score Error Units
LoopOverPolluted.byte_buffer_get_float avgt 30 0.802 ? 0.011 ms/op
LoopOverPolluted.float_buffer_get avgt 30 0.789 ? 0.009 ms/op
LoopOverPolluted.unsafe_get_float avgt 30 0.494 ? 0.006 ms/op
On Java 16 we get this:
Benchmark Mode Cnt Score Error Units
LoopOverPolluted.byte_buffer_get_float avgt 30 0.590 ? 0.012 ms/op
LoopOverPolluted.float_buffer_get avgt 30 2.432 ? 0.060 ms/op
LoopOverPolluted.unsafe_get_float avgt 30 0.504 ? 0.008 ms/op
This is likely caused by profile pollution in ScopedMemoryAccess - which is now used by the ByteBuffer API to access memory (at least in the heap views).