Bug ID: JDK-8268743 Require a better way for copying data between MemorySegments and on-heap arrays

Type: Bug
Component: core-libs
Affected Version: 17

Priority: P4
Status: Closed
Resolution: Duplicate
OS: generic
CPU: generic

Submitted: 2021-06-15
Updated: 2021-12-07
Resolved: 2021-12-07

JDK 18
18Resolved

Apache Lucene is still working on integrating project Panama for accessing memory mapped files.

We have an implementation for JDK 17 (4th incubator of Panama) now available which performs quite well for direct access to off-heap memory using VarHandles - you can get more details about the performance problems described here by looking at the benchmark results posted there: https://github.com/apache/lucene/pull/177

There is still the requirement to copy data from off-heap memory segments to standard Java arrays (and vice versa), so you can do normal processing on-heap with java language. With the vector APIs we will for sure go away with that more and more, but currently we often need to read data into byte[], float[] or long[] to process them with conventional code. Lucene's API to do this copying is looking like:

  private static void copySegmentToHeap(MemorySegment src, long srcOffset, byte[] target, int targetOffset, int len) {
    MemorySegment.ofArray(target).asSlice(targetOffset, len).copyFrom(src.asSlice(srcOffset, len));
  }

We analyzed the performance of this in our code (often it is called million of times with quite small byte arrays, but also with huge ones). To my knowledge of escape analysis the above code above should IMHO perfectly optimize and remove all allocation of HeapMemorySegment.OfByte instances, but it isn't!

The problem with this code is:
(1) it produces tons of heap allocations (we ran our benchmark with JFR enabled):

WARNING: Using incubator modules: jdk.incubator.foreign
PROFILE SUMMARY from 14415681 events (total: 5004408M)
  tests.profile.mode=heap
  tests.profile.count=30
  tests.profile.stacksize=1
  tests.profile.linenumbers=false
PERCENT       HEAP SAMPLES  STACK
55.11%        2757776M      jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#fromArray()
22.41%        1121632M      jdk.internal.foreign.HeapMemorySegmentImpl$OfByte#dup()
20.85%        1043452M      jdk.internal.foreign.MappedMemorySegmentImpl#dup()
0.71%         35428M        jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#dup()
0.18%         9206M         jdk.internal.foreign.HeapMemorySegmentImpl$OfLong#fromArray()
0.12%         6150M         org.apache.lucene.search.ExactPhraseMatcher$1$1#getImpacts()
0.09%         4729M         org.apache.lucene.search.ExactPhraseMatcher$1#getImpacts()
0.09%         4620M         org.apache.lucene.util.FixedBitSet#<init>()
0.06%         3099M         java.util.AbstractList#iterator()
0.04%         1773M         java.util.ArrayList#grow()
0.03%         1360M         org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame#<init>()
0.02%         1090M         java.util.ArrayList#iterator()
0.02%         1085M         org.apache.lucene.util.PriorityQueue#<init>()
0.02%         908M          org.apache.lucene.util.ArrayUtil#growExact()
0.02%         796M          org.apache.lucene.queryparser.charstream.FastCharStream#refill()
0.02%         787M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#<init>()
0.01%         577M          org.apache.lucene.util.BytesRef#<init>()
0.01%         537M          jdk.internal.misc.Unsafe#allocateUninitializedArray()
0.01%         517M          org.apache.lucene.util.fst.ByteSequenceOutputs#read()
0.01%         394M          org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#getFrame()
0.01%         370M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#newTermState()
0.01%         366M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#<init>()
0.01%         352M          org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum#<init>()
0.01%         349M          org.apache.lucene.queryparser.charstream.FastCharStream#GetImage()
0.01%         322M          org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame#load()
0.01%         317M          org.apache.lucene.codecs.lucene90.ForUtil#<init>()
0.01%         295M          org.apache.lucene.store.MemorySegmentIndexInput#buildSlice()
0.01%         261M          org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsPostingsEnum#<init>()
0.00%         240M          org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
0.00%         237M          java.util.Arrays#copyOfRange()

(2) the code is much slower than our old MappedByteBuffer code (which simply set position() on the byteBuffer and called the getBytes(byte[], into offset, int length) method)).

So it looks like there is some problem with escape analysis. It does not matter if we enable tiered compilation or not. The results are similar. With tiered compilation enabled, the garbage created seems to be smaller, but it still creates a lot of that. The JFR output of our old MappedByteBuffer#getBytes() code has the same heap statistics, except the first ones (HeapMemorySegment ) removed.

Nevertheless, we would like to have utility methods like the example above in MemoryAccess for all array types.

Utility methods could also be annotated with @ForecInline or similar, which cannot be done in our code.

In addition: There is a copy operation missing that takes the byte order and automatically swaps bytes. E.g., for copying from a MemorySegment to a long[] or float[] we have to first check that the endianness of platform fits the data in the file. Therefor it would be good to have a copy method with allows to specify the byte order of the source/target memory segment.

As mentioned about it is essential to have methods that take offset/length to prevent mess of wrapping. In short: Please add all combinations of System.arrayCopy variants for transferring slices from/to MemorySegments and on-heap arrays.

WORKAROUND: As workaround, I implemented the following hack:

  private static void copySegmentToHeap(MemorySegment src, long srcOffset, byte[] target, int targetOffset, int len) {
    Objects.checkFromIndexSize(srcOffset, len, src.byteSize());
    theUnsafe.copyMemory(null, src.address().toRawLongValue() + srcOffset, 
        target, Unsafe.ARRAY_BYTE_BASE_OFFSET + targetOffset, len);
    //MemorySegment.ofArray(target).asSlice(targetOffset, len).copyFrom(src.asSlice(srcOffset, len));
  }

When using this code, the performance of our code is identical to the previous code. No object allocations! The JFR statistics on heap allocations look identical to our old code.

I also implemented another workaround that did a simple for-loop to copy instead of sun.misc.Unsafe. This worked well for small arrays, but was much slower for large arrays (because it does not seem to optimize to do 8-bytes-at-a-time copying).

Of course this code is not safe, as it does not use ScopedMemoryAccess, but it was enough for performance testing. I propose to add methods like the implementation I did (using ScopedMemoryAccess) to MemoryAccess class. And also add swapping as needed (to adapt byte order).

Thanks,
Uwe and Robert Muir (for the Lucene team)

Fixed as part of JEP-419

07-12-2021

Hi, this issue is a duplicate of pull request https://github.com/openjdk/panama-foreign/pull/555 and later JDK-8270376. It won't be fixed for JDK 17. Uwe

19-07-2021

Blocks :	JDK-8275063 - Implementation of Foreign Function & Memory API (Second incubator)
Duplicate :	JDK-8275063 - Implementation of Foreign Function & Memory API (Second incubator)
Duplicate :	JDK-8270376 - Finalize API for memory copy
Relates :	JDK-8268822 - Performance of bulk copy with segments seems suboptimal