Bug ID: JDK-6347575 FileImageInputStream.readInt() and similar methods are inefficient

Type: Bug
Component: client-libs
Sub-Component: javax.imageio
Affected Version: 1.4.2

Priority: P4
Status: Resolved
Resolution: Fixed
OS: solaris_9
CPU: generic

Submitted: 2005-11-08
Updated: 2008-02-05
Resolved: 2005-12-05

JDK 6
6 b63Fixed

In the evaluation of 6215304, an observation was made that the PNGImageReader seemed
to be making a large number of calls to ImageInputStream.read(), in other words,
reading a single byte at a time.  For a FileImageInputStream, this corresponds to
a JNI downcall and a system call to read a single byte from a RandomAccessFile object.
It turns out that the source of these frequent read() calls was in PNGImageReader's
readHeader() and readMetadata() methods.  In particular, these methods make a number
of calls to methods like ImageInputStream.readInt().  Digging further, I found that
ImageInputStreamImpl.readInt(), as currently implemented, calls read() 4x, and then
composes the 4 bytes to form a single int value.

Experiments show that calling read() 4x is much more expensive than making a single
call to read(byte[]) with a cached byte array.  Interestingly, this (and related
changes) have a profound impact on performance of reading small PNG and other images.

EVALUATION I made similar changes to ImageOutputStreamImpl's writeShort(), writeInt(), and writeLong() methods. (Until 6277756 is fixed, I've changed writeLong() as I did with readLong(), so that it writes the long value in two 4-byte chunks. Again, not the ideal 7x gain we should be seeing, but still 2-3.5x the performance of the current builds.) Here are some J2DBench results for the same solaris-sparc configuration mentioned above ("base" is 1.6.0-b60, "test" is based on that build and includes the optimizations described above, scores are in bytes/millisecond): imageio.output.stream.tests.writeInt,imageio.output.opts.general.dest=byteArray: base: 2245.642260 (var=2.8%) (100.0%) test: 5268.056220 (var=1.92%) (234.59%) imageio.output.stream.tests.writeInt,imageio.output.opts.general.dest=file: base: 107.1258278 (var=1.17%) (100.0%) test: 406.3215137 (var=1.79%) (379.29%) imageio.output.stream.tests.writeInt,imageio.output.opts.general.dest=fileChannel: base: 2225.153685 (var=2.36%) (100.0%) test: 5268.491466 (var=1.35%) (236.77%) imageio.output.stream.tests.writeLong,imageio.output.opts.general.dest=byteArray: base: 2224.693881 (var=2.04%) (100.0%) test: 5291.293159 (var=0.02%) (237.84%) imageio.output.stream.tests.writeLong,imageio.output.opts.general.dest=file: base: 107.9310165 (var=1.08%) (100.0%) test: 423.7662979 (var=0.32%) (392.63%) imageio.output.stream.tests.writeLong,imageio.output.opts.general.dest=fileChannel: base: 2227.985565 (var=0.66%) (100.0%) test: 5281.228567 (var=0.98%) (237.04%) imageio.output.stream.tests.writeShort,imageio.output.opts.general.dest=byteArray: base: 2186.983911 (var=1.31%) (100.0%) test: 2673.064879 (var=1.69%) (122.23%) imageio.output.stream.tests.writeShort,imageio.output.opts.general.dest=file: base: 108.5753759 (var=0.05%) (100.0%) test: 211.2727644 (var=1.37%) (194.59%) imageio.output.stream.tests.writeShort,imageio.output.opts.general.dest=fileChannel: base: 2207.045604 (var=2.38%) (100.0%) test: 2678.192426 (var=1.78%) (121.35%)

11-11-2005

EVALUATION This change also has a positive impact on performance of reading small BMP images from disk (since the BMPImageReader makes a number of calls to readInt(), readShort(), etc when reading the image header): imageio.opts.size=1: base: 2.129012269 (var=1.81%) (100.0%) test: 3.725943646 (var=0.83%) (175.01%) imageio.opts.size=20: base: 628.9602381 (var=1.31%) (100.0%) test: 890.0648177 (var=1.42%) (141.51%)

09-11-2005

EVALUATION My original thought was to apply a fix only to FileImageInputStream by overriding the readShort(), readInt(), and readLong() methods. Instead of calling read() 2, 4, or 8x, we would call read(byte[]) just once, and then pack the bytes into the appropriate primitive value. This change makes for a big win in performance for FileImageInputStream, for example, here are some J2DBench results for solaris-sparc, 900 MHz USIII, SB2000 (I'm seeing similar numbers on solaris-i586, etc): imageio.input.stream.tests.readInt,imageio.input.opts.general.source=file,imageio.input.opts.imageio.useCache=false: base: 155.6176279 (var=3.08%) (100.0%) test: 533.5529643 (var=1.86%) (342.86%) imageio.input.stream.tests.readShort,imageio.input.opts.general.source=file,imageio.input.opts.imageio.useCache=false: base: 156.4748700 (var=1.05%) (100.0%) test: 269.8893829 (var=1.56%) (172.48%) These numbers are in bytes/millisecond. As you can see, using this buffering technique, readInt() is about 3.5x faster, and readShort() is 70% faster. I'll come back to readLong() later, that's where it gets tricky. These numbers looked promising, but if I made the change only in FileImageInputStream, then other similar classes would not benefit. For example, I would probably want to make the same change to FileCacheImageInputStream (which also uses RandomAccessFile) and to MemoryCacheImageInputStream as well. Also, their related ImageOutputStreams have readInt() and similar methods, so we'd need to override there as well. This would be a hassle, so my next thought was that maybe we should just make this optimization in the ImageInputStreamImpl base class, and then all those subclasses mentioned above would benefit. This assumes of course that for all subclasses of IISI, calling read(byte[4]) for example would be faster than calling read() 4x. It turns out (not surprisingly) that making this change in the base class also benefits FileCacheImageInputStream and MemoryCacheImageInputStream. Here are more results from the same system as above: imageio.input.stream.tests.readInt,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=false: base: 2167.563921 (var=0.47%) (100.0%) test: 5597.578478 (var=1.57%) (258.24%) imageio.input.stream.tests.readInt,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=true: base: 114.5184795 (var=0.4%) (100.0%) test: 393.7213387 (var=1.68%) (343.81%) In these tests we have a URLInputStream from an on-disk file, useCache=false corresponds to MemoryCacheImageInputStream, and useCache=true corresponds to FileCacheImageInputStream. As expected, we get almost exactly the same benefit for the latter as we did in the FileImageInputStream case since they both use RandomAccessFile under the covers. MCIIS benefits less dramatically, but 2.5x is nothing to sneeze at. So far so good. Again, readShort() also benefits, although not as dramatically: imageio.input.stream.tests.readShort,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=false: base: 2097.001822 (var=0.02%) (100.0%) test: 2799.464220 (var=0.79%) (133.5%) imageio.input.stream.tests.readShort,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=true: base: 113.8649677 (var=0.48%) (100.0%) test: 196.9390994 (var=2.07%) (172.96%) I also tested this assumption using the FileChannelImageInputStream class from the JAI IIO Tools package. Here we don't see much benefit for readShort(), but it's no worse either, and it's a modest win for readInt(): imageio.input.stream.tests.readShort,imageio.input.opts.general.source=fileChannel,imageio.input.opts.imageio.useCache=false: base: 6254.436884 (var=1.85%) (100.0%) test: 6187.474271 (var=1.58%) (98.93%) imageio.input.stream.tests.readLong,imageio.input.opts.general.source=fileChannel,imageio.input.opts.imageio.useCache=false: base: 6688.157979 (var=0.02%) (100.0%) test: 9346.610076 (var=1.05%) (139.75%) The problem child though is readLong(), and only for FileChannelImageInputStream. For the other stream impls, this operation scales just as nicely as readShort() and readInt(): imageio.input.stream.tests.readLong,imageio.input.opts.general.source=file,imageio.input.opts.imageio.useCache=false: base: 157.9105603 (var=0.34%) (100.0%) test: 1149.623363 (var=1.77%) (728.02%) imageio.input.stream.tests.readLong,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=false: base: 2179.529160 (var=0.97%) (100.0%) test: 10406.09603 (var=1.63%) (477.45%) imageio.input.stream.tests.readLong,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=true: base: 114.7144744 (var=0.04%) (100.0%) test: 796.6563779 (var=1.76%) (694.47%) Again, so far so good, we're looking at 4-7x improvement in readLong() scores. But what about FileChannelImageInputStream? imageio.input.stream.tests.readLong,imageio.input.opts.general.source=fileChannel,imageio.input.opts.imageio.useCache=false: base: 6688.157979 (var=0.02%) (100.0%) test: 4861.068985 (var=1.65%) (72.68%) D'oh! It's slower? How can this be? Well, it turns out that I'm being bitten for the second time by 6277756, which was a bug that I filed against NIO back when I was working on STR. In that case, I was concerned with the performance of direct ByteBuffers, but this case is hitting the same code since FileChannelImageInputStream uses MappedByteBuffer under the hood, which is just another variant of DirectByteBuffer and shares the same code. As described in that bug report, the heuristics for the ByteBuffer.read(<prim>[]) are wrong, currently they are set as if to say "if the length of the array in this operation is greater than 6 bytes, then we will use our super-duper fast native memcpy-based operation" which also incurs an added hit of JNI downcall overhead and Get/ReleasePrimArrayCritical. So we don't run into this problem for readShort() (where length is 2 bytes) or for readInt() (where length is 4 bytes), but we do run into this problem for readLong() (where length is 8 bytes, larger than the 6 byte threshold for the ByteBuffer.get(byte[]) method). So clearly this is more evidence that the thresholds are set incorrectly and need to be raised so that ByteBuffer.get(byte[]) is always as fast as, or faster than, making the equivalent number of calls to ByteBuffer.get(). I'll prod whoever is responsible for 6277756 to see if we can come up with some better values, but it's not clear that we could make that change for Mustang, so in the meantime I will propose a middle of the road solution... We should fix ImageInputStreamImpl.readShort() and readInt() as I've suggested above, since only goodness comes from those changes for all known implementations of ImageInputStream. As for readLong(), I suggest that we leave it as is. As it is currently implemented, IISI.readLong() simply makes two calls to readInt() and then composes those int values into one long value. If we leave this method as is, we will still benefit from the readInt() optimizations, and while readLong() may not end up being 7x faster, it will still be up to 3.5x faster than it currently is for FileImageInputStreams and the *CacheImageInputStreams: imageio.input.stream.tests.readLong,imageio.input.opts.general.source=file,imageio.input.opts.imageio.useCache=false: base: 157.9105603 (var=0.34%) (100.0%) test: 533.4320692 (var=1.42%) (337.81%) imageio.input.stream.tests.readLong,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=false: base: 2179.529160 (var=0.97%) (100.0%) test: 5376.337126 (var=3.49%) (246.67%) imageio.input.stream.tests.readLong,imageio.input.opts.general.source=url,imageio.input.opts.imageio.useCache=true: base: 114.7144744 (var=0.04%) (100.0%) test: 392.5908877 (var=0.07%) (342.23%) And for FileChannelImageInputStreams, we will no longer have a regression and in fact it will be faster than the status quo (a nice 40% gain): imageio.input.stream.tests.readLong,imageio.input.opts.general.source=fileChannel,imageio.input.opts.imageio.useCache=false: base: 6688.157979 (var=0.02%) (100.0%) test: 9346.610076 (var=1.05%) (139.75%) It is interesting to note that the similarly named methods on RandomAccessFile could probably benefit from this same optimization, so perhaps I will file a separate bug report for that issue. "And this we shall call the electrifying conclusion": As mentioned in the description field, this change actually has a significant improvement on PNGImageReader performance, otherwise, I don't think I would have wasted all this time investigating this change. These J2DBench results were taken on the same configuration as above, with the core PNGImageReader, useCache=false, testing 1x1 and 20x20 PNG images: imageio.input.opts.general.source=file,imageio.opts.size=1: base: 1.583562001 (var=2.22%) (100.0%) test: 2.172492509 (var=1.77%) (137.19%) imageio.input.opts.general.source=file,imageio.opts.size=20: base: 571.7214069 (var=2.76%) (100.0%) test: 738.4439359 (var=7.06%) (129.16%) imageio.input.opts.general.source=fileChannel,imageio.opts.size=1: base: 1.663441088 (var=4.26%) (100.0%) test: 1.601706418 (var=2.79%) (96.29%) imageio.input.opts.general.source=fileChannel,imageio.opts.size=20: base: 590.0220102 (var=8.26%) (100.0%) test: 572.8101037 (var=3.03%) (97.08%) imageio.input.opts.general.source=url,imageio.opts.size=1: base: 1.854567177 (var=2.18%) (100.0%) test: 1.859985261 (var=2.37%) (100.29%) imageio.input.opts.general.source=url,imageio.opts.size=20: base: 652.6736001 (var=6.0%) (100.0%) test: 647.9447689 (var=0.74%) (99.28%) (There's not much benefit for useCache=false, but useCache=true improves just as much as the FileImageInputStream numbers.) So all told, this change helps reduce overhead of reading small PNG images by 30% or more. I think this also (modestly) helps larger images that have a number of IDAT chunks, since we need to read the IDAT length, type, and crc values every so often via readInt(). There are plenty more optimizations like this in the works that will be tracked under 6215304. Probably we should make a similar change to ImageOutputStreamImpl's writeShort() and writeInt() methods, but I haven't measured the impact on those yet. If you've read this far: congrats, you deserve a break.

09-11-2005