United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6347575 : FileImageInputStream.readInt() and similar methods are inefficient

Submit Date:
Updated Date:
Project Name:
Resolved Date:
Affected Versions:
Fixed Versions:

Related Reports

Sub Tasks

In the evaluation of 6215304, an observation was made that the PNGImageReader seemed
to be making a large number of calls to ImageInputStream.read(), in other words,
reading a single byte at a time.  For a FileImageInputStream, this corresponds to
a JNI downcall and a system call to read a single byte from a RandomAccessFile object.
It turns out that the source of these frequent read() calls was in PNGImageReader's
readHeader() and readMetadata() methods.  In particular, these methods make a number
of calls to methods like ImageInputStream.readInt().  Digging further, I found that
ImageInputStreamImpl.readInt(), as currently implemented, calls read() 4x, and then
composes the 4 bytes to form a single int value.

Experiments show that calling read() 4x is much more expensive than making a single
call to read(byte[]) with a cached byte array.  Interestingly, this (and related
changes) have a profound impact on performance of reading small PNG and other images.



I made similar changes to ImageOutputStreamImpl's writeShort(), writeInt(), and
writeLong() methods.  (Until 6277756 is fixed, I've changed writeLong() as
I did with readLong(), so that it writes the long value in two 4-byte chunks.
Again, not the ideal 7x gain we should be seeing, but still 2-3.5x the
performance of the current builds.)

Here are some J2DBench results for the same solaris-sparc configuration
mentioned above ("base" is 1.6.0-b60, "test" is based on that build and
includes the optimizations described above, scores are in bytes/millisecond):

base: 2245.642260 (var=2.8%) (100.0%)
test: 5268.056220 (var=1.92%) (234.59%)
base: 107.1258278 (var=1.17%) (100.0%)
test: 406.3215137 (var=1.79%) (379.29%)
base: 2225.153685 (var=2.36%) (100.0%)
test: 5268.491466 (var=1.35%) (236.77%)
base: 2224.693881 (var=2.04%) (100.0%)
test: 5291.293159 (var=0.02%) (237.84%)
base: 107.9310165 (var=1.08%) (100.0%)
test: 423.7662979 (var=0.32%) (392.63%)
base: 2227.985565 (var=0.66%) (100.0%)
test: 5281.228567 (var=0.98%) (237.04%)
base: 2186.983911 (var=1.31%) (100.0%)
test: 2673.064879 (var=1.69%) (122.23%)
base: 108.5753759 (var=0.05%) (100.0%)
test: 211.2727644 (var=1.37%) (194.59%)
base: 2207.045604 (var=2.38%) (100.0%)
test: 2678.192426 (var=1.78%) (121.35%)

This change also has a positive impact on performance of reading small BMP
images from disk (since the BMPImageReader makes a number of calls to readInt(),
readShort(), etc when reading the image header):

base: 2.129012269 (var=1.81%) (100.0%)
test: 3.725943646 (var=0.83%) (175.01%)
base: 628.9602381 (var=1.31%) (100.0%)
test: 890.0648177 (var=1.42%) (141.51%)

My original thought was to apply a fix only to FileImageInputStream by
overriding the readShort(), readInt(), and readLong() methods.  Instead of
calling read() 2, 4, or 8x, we would call read(byte[]) just once, and then
pack the bytes into the appropriate primitive value.  This change makes for
a big win in performance for FileImageInputStream, for example, here are
some J2DBench results for solaris-sparc, 900 MHz USIII, SB2000 (I'm seeing
similar numbers on solaris-i586, etc):

base: 155.6176279 (var=3.08%) (100.0%)
test: 533.5529643 (var=1.86%) (342.86%)
base: 156.4748700 (var=1.05%) (100.0%)
test: 269.8893829 (var=1.56%) (172.48%)

These numbers are in bytes/millisecond.  As you can see, using this buffering
technique, readInt() is about 3.5x faster, and readShort() is 70% faster.
I'll come back to readLong() later, that's where it gets tricky.

These numbers looked promising, but if I made the change only in
FileImageInputStream, then other similar classes would not benefit.  For
example, I would probably want to make the same change to
FileCacheImageInputStream (which also uses RandomAccessFile) and to
MemoryCacheImageInputStream as well.  Also, their related ImageOutputStreams
have readInt() and similar methods, so we'd need to override there as well.
This would be a hassle, so my next thought was that maybe we should just make
this optimization in the ImageInputStreamImpl base class, and then all those
subclasses mentioned above would benefit.  This assumes of course that for
all subclasses of IISI, calling read(byte[4]) for example would be faster than
calling read() 4x.

It turns out (not surprisingly) that making this change in the base class
also benefits FileCacheImageInputStream and MemoryCacheImageInputStream.
Here are more results from the same system as above:

base: 2167.563921 (var=0.47%) (100.0%)
test: 5597.578478 (var=1.57%) (258.24%)
base: 114.5184795 (var=0.4%) (100.0%)
test: 393.7213387 (var=1.68%) (343.81%)

In these tests we have a URLInputStream from an on-disk file, useCache=false
corresponds to MemoryCacheImageInputStream, and useCache=true corresponds to
FileCacheImageInputStream.  As expected, we get almost exactly the same
benefit for the latter as we did in the FileImageInputStream case since they
both use RandomAccessFile under the covers.  MCIIS benefits less dramatically,
but 2.5x is nothing to sneeze at.  So far so good.

Again, readShort() also benefits, although not as dramatically:

base: 2097.001822 (var=0.02%) (100.0%)
test: 2799.464220 (var=0.79%) (133.5%)
base: 113.8649677 (var=0.48%) (100.0%)
test: 196.9390994 (var=2.07%) (172.96%)

I also tested this assumption using the FileChannelImageInputStream class from
the JAI IIO Tools package.  Here we don't see much benefit for readShort(),
but it's no worse either, and it's a modest win for readInt():

base: 6254.436884 (var=1.85%) (100.0%)
test: 6187.474271 (var=1.58%) (98.93%)
base: 6688.157979 (var=0.02%) (100.0%)
test: 9346.610076 (var=1.05%) (139.75%)

The problem child though is readLong(), and only for
FileChannelImageInputStream.  For the other stream impls, this
operation scales just as nicely as readShort() and readInt():

base: 157.9105603 (var=0.34%) (100.0%)
test: 1149.623363 (var=1.77%) (728.02%)
base: 2179.529160 (var=0.97%) (100.0%)
test: 10406.09603 (var=1.63%) (477.45%)
base: 114.7144744 (var=0.04%) (100.0%)
test: 796.6563779 (var=1.76%) (694.47%)

Again, so far so good, we're looking at 4-7x improvement in readLong() scores.
But what about FileChannelImageInputStream?

base: 6688.157979 (var=0.02%) (100.0%)
test: 4861.068985 (var=1.65%) (72.68%)

D'oh!  It's slower?  How can this be?  Well, it turns out that I'm being
bitten for the second time by 6277756, which was a bug that I filed against
NIO back when I was working on STR.  In that case, I was concerned with the
performance of direct ByteBuffers, but this case is hitting the same code
since FileChannelImageInputStream uses MappedByteBuffer under the hood, which
is just another variant of DirectByteBuffer and shares the same code.  As
described in that bug report, the heuristics for the ByteBuffer.read(<prim>[])
are wrong, currently they are set as if to say "if the length of the array
in this operation is greater than 6 bytes, then we will use our super-duper
fast native memcpy-based operation" which also incurs an added hit of JNI
downcall overhead and Get/ReleasePrimArrayCritical.  So we don't run into
this problem for readShort() (where length is 2 bytes) or for readInt()
(where length is 4 bytes), but we do run into this problem for readLong()
(where length is 8 bytes, larger than the 6 byte threshold for the
ByteBuffer.get(byte[]) method).

So clearly this is more evidence that the thresholds are set incorrectly
and need to be raised so that ByteBuffer.get(byte[]) is always as fast as,
or faster than, making the equivalent number of calls to ByteBuffer.get().
I'll prod whoever is responsible for 6277756 to see if we can come up with
some better values, but it's not clear that we could make that change for
Mustang, so in the meantime I will propose a middle of the road solution...

We should fix ImageInputStreamImpl.readShort() and readInt() as I've
suggested above, since only goodness comes from those changes for all
known implementations of ImageInputStream.  As for readLong(), I suggest
that we leave it as is.  As it is currently implemented, IISI.readLong()
simply makes two calls to readInt() and then composes those int values into
one long value.  If we leave this method as is, we will still benefit from
the readInt() optimizations, and while readLong() may not end up being
7x faster, it will still be up to 3.5x faster than it currently is for
FileImageInputStreams and the *CacheImageInputStreams:

base: 157.9105603 (var=0.34%) (100.0%)
test: 533.4320692 (var=1.42%) (337.81%)
base: 2179.529160 (var=0.97%) (100.0%)
test: 5376.337126 (var=3.49%) (246.67%)
base: 114.7144744 (var=0.04%) (100.0%)
test: 392.5908877 (var=0.07%) (342.23%)

And for FileChannelImageInputStreams, we will no longer have a regression
and in fact it will be faster than the status quo (a nice 40% gain):

base: 6688.157979 (var=0.02%) (100.0%)
test: 9346.610076 (var=1.05%) (139.75%)

It is interesting to note that the similarly named methods on RandomAccessFile
could probably benefit from this same optimization, so perhaps I will file
a separate bug report for that issue.

"And this we shall call the electrifying conclusion":
As mentioned in the description field, this change actually has a significant
improvement on PNGImageReader performance, otherwise, I don't think I would
have wasted all this time investigating this change.  These J2DBench results
were taken on the same configuration as above, with the core PNGImageReader,
useCache=false, testing 1x1 and 20x20 PNG images:

base: 1.583562001 (var=2.22%) (100.0%)
test: 2.172492509 (var=1.77%) (137.19%)
base: 571.7214069 (var=2.76%) (100.0%)
test: 738.4439359 (var=7.06%) (129.16%)
base: 1.663441088 (var=4.26%) (100.0%)
test: 1.601706418 (var=2.79%) (96.29%)
base: 590.0220102 (var=8.26%) (100.0%)
test: 572.8101037 (var=3.03%) (97.08%)
base: 1.854567177 (var=2.18%) (100.0%)
test: 1.859985261 (var=2.37%) (100.29%)
base: 652.6736001 (var=6.0%) (100.0%)
test: 647.9447689 (var=0.74%) (99.28%)

(There's not much benefit for useCache=false, but useCache=true improves just
as much as the FileImageInputStream numbers.)  So all told, this change helps
reduce overhead of reading small PNG images by 30% or more.  I think this also
(modestly) helps larger images that have a number of IDAT chunks, since we need
to read the IDAT length, type, and crc values every so often via readInt().
There are plenty more optimizations like this in the works that will be
tracked under 6215304.

Probably we should make a similar change to ImageOutputStreamImpl's
writeShort() and writeInt() methods, but I haven't measured the impact on
those yet.  If you've read this far: congrats, you deserve a break.

Hardware and Software, Engineered to Work Together