JDK 18 |
---|
18Resolved |
Blocks :
|
|
Duplicate :
|
|
Duplicate :
|
|
Relates :
|
Apache Lucene is still working on integrating project Panama for accessing memory mapped files. We have an implementation for JDK 17 (4th incubator of Panama) now available which performs quite well for direct access to off-heap memory using VarHandles - you can get more details about the performance problems described here by looking at the benchmark results posted there: https://github.com/apache/lucene/pull/177 There is still the requirement to copy data from off-heap memory segments to standard Java arrays (and vice versa), so you can do normal processing on-heap with java language. With the vector APIs we will for sure go away with that more and more, but currently we often need to read data into byte[], float[] or long[] to process them with conventional code. Lucene's API to do this copying is looking like: private static void copySegmentToHeap(MemorySegment src, long srcOffset, byte[] target, int targetOffset, int len) { MemorySegment.ofArray(target).asSlice(targetOffset, len).copyFrom(src.asSlice(srcOffset, len)); } We analyzed the performance of this in our code (often it is called million of times with quite small byte arrays, but also with huge ones). To my knowledge of escape analysis the above code above should IMHO perfectly optimize and remove all allocation of HeapMemorySegment.OfByte instances, but it isn't! The problem with this code is: (1) it produces tons of heap allocations (we ran our benchmark with JFR enabled): WARNING: Using incubator modules: jdk.incubator.foreign PROFILE SUMMARY from 14415681 events (total: 5004408M) tests.profile.mode=heap tests.profile.count=30 tests.profile.stacksize=1 tests.profile.linenumbers=false PERCENT HEAP SAMPLES STACK 55.11% 2757776M jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#fromArray() 22.41% 1121632M jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#dup() 20.85% 1043452M jdk.internal.foreign.MappedMemorySegmentImpl#dup() 0.71% 35428M jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#dup() 0.18% 9206M jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#fromArray() 0.12% 6150M org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts() 0.09% 4729M org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts() 0.09% 4620M org.apache.lucene.util.FixedBitSet#<init>() 0.06% 3099M java.util.AbstractList#iterator() 0.04% 1773M java.util.ArrayList#grow() 0.03% 1360M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>() 0.02% 1090M java.util.ArrayList#iterator() 0.02% 1085M org.apache.lucene.util.PriorityQueue#<init>() 0.02% 908M org.apache.lucene.util.ArrayUtil#growExact() 0.02% 796M org.apache.lucene.queryparser.charstream.FastCharStream#refill() 0.02% 787M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>() 0.01% 577M org.apache.lucene.util.BytesRef#<init>() 0.01% 537M jdk.internal.misc.Unsafe#allocateUninitializedArray() 0.01% 517M org.apache.lucene.util.fst.ByteSequenceOutputs#read() 0.01% 394M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame() 0.01% 370M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState() 0.01% 366M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>() 0.01% 352M org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>() 0.01% 349M org.apache.lucene.queryparser.charstream.FastCharStream#GetImage() 0.01% 322M org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load() 0.01% 317M org.apache.lucene.codecs.lucene90.ForUtil#<init>() 0.01% 295M org.apache.lucene.store.MemorySegmentIndexInput#buildSlice() 0.01% 261M org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>() 0.00% 240M org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document() 0.00% 237M java.util.Arrays#copyOfRange() (2) the code is much slower than our old MappedByteBuffer code (which simply set position() on the byteBuffer and called the getBytes(byte[], into offset, int length) method)). So it looks like there is some problem with escape analysis. It does not matter if we enable tiered compilation or not. The results are similar. With tiered compilation enabled, the garbage created seems to be smaller, but it still creates a lot of that. The JFR output of our old MappedByteBuffer#getBytes() code has the same heap statistics, except the first ones (HeapMemorySegment ) removed. Nevertheless, we would like to have utility methods like the example above in MemoryAccess for all array types. Utility methods could also be annotated with @ForecInline or similar, which cannot be done in our code. In addition: There is a copy operation missing that takes the byte order and automatically swaps bytes. E.g., for copying from a MemorySegment to a long[] or float[] we have to first check that the endianness of platform fits the data in the file. Therefor it would be good to have a copy method with allows to specify the byte order of the source/target memory segment. As mentioned about it is essential to have methods that take offset/length to prevent mess of wrapping. In short: Please add all combinations of System.arrayCopy variants for transferring slices from/to MemorySegments and on-heap arrays. WORKAROUND: As workaround, I implemented the following hack: private static void copySegmentToHeap(MemorySegment src, long srcOffset, byte[] target, int targetOffset, int len) { Objects.checkFromIndexSize(srcOffset, len, src.byteSize()); theUnsafe.copyMemory(null, src.address().toRawLongValue() + srcOffset, target, Unsafe.ARRAY_BYTE_BASE_OFFSET + targetOffset, len); //MemorySegment.ofArray(target).asSlice(targetOffset, len).copyFrom(src.asSlice(srcOffset, len)); } When using this code, the performance of our code is identical to the previous code. No object allocations! The JFR statistics on heap allocations look identical to our old code. I also implemented another workaround that did a simple for-loop to copy instead of sun.misc.Unsafe. This worked well for small arrays, but was much slower for large arrays (because it does not seem to optimize to do 8-bytes-at-a-time copying). Of course this code is not safe, as it does not use ScopedMemoryAccess, but it was enough for performance testing. I propose to add methods like the implementation I did (using ScopedMemoryAccess) to MemoryAccess class. And also add swapping as needed (to adapt byte order). Thanks, Uwe and Robert Muir (for the Lucene team)
|