| Other |
|---|
| tbdUnresolved |
|
Relates :
|
|
|
Relates :
|
|
|
Relates :
|
|
|
Relates :
|
|
|
Relates :
|
First investigation into benchmarks done here: https://github.com/openjdk/jdk/pull/26747#issuecomment-3269114783 / JDK-8365290. It seems to me that people are making decisions about fill and copy intrinsics on benchmarks that are noisy and don't properly control for alignment - that can give us misleading results. It turns out that we barely have any fill and copy benchmarks that really test automatic alignment. We should also compare to auto-vectorization performance. We should test Array.fill, System.arraycopy, but also some MemorySegment bulk operations. Then also compare to naive loops, both with intrinsics enabled and disabled: -XX:-OptimizeFill Also look at JDK-8299808, and the discussion there. We could take a similar approach as in JDK-8355094 with: test/micro/org/openjdk/bench/vm/compiler/VectorAutoAlignment.java We should also go through the benchmarks mentioned in https://github.com/openjdk/jdk/pull/26747#issuecomment-3269114783 and see if they still behave as the comments in them suggest: - alignment assumptions - performance assumptions / comparison with SuperWord, especially after JDK-8324751. This is also a really good way to better understand the performance of auto-vectorization (SuperWord) on small iteration counts. This is where the intrinsics are currently much better than auto-vectorization. See also JDK-8344085. But it is possible that auto-vectorization is actually faster with large iteration counts. For MemorySegment, we already have: - ./test/micro/org/openjdk/bench/java/lang/foreign/BulkOps.java - ./test/micro/org/openjdk/bench/java/lang/foreign/SegmentBulkFill.java - ./test/micro/org/openjdk/bench/java/lang/foreign/SegmentBulkCopy.java We also should make sure to check fill for zero separately, some platforms are much faster when they zero out memory. We should also check the impact of Lilliput / CompactObjectHeaders, as those change the alignment of some element types. We should also benchmark Oop copy / fill. Auto-vectorization could pay off here too, though it would be harder because of GC barriers in the vectorized LoadP and StoreP. ./java -XX:CompileCommand=compileonly,TestOopCopy::copy* -XX:CompileCommand=printcompilation,TestOopCopy::copy* -Xbatch TestOopCopy.java
|