Bug ID: JDK-6514990 OGL: accelerate Convolve/Rescale/LookupOp using fragment shaders

Type: Enhancement
Component: client-libs
Sub-Component: 2d
Affected Version: 6

Priority: P3
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2007-01-19
Updated: 2011-03-08
Resolved: 2011-03-08

JDK 6	JDK 7
6u4Fixed	7 b08Fixed

We can accelerate various BufferedImageOps in hardware using fragment shaders when
the OGL pipeline is enabled.  These operations can go anywhere from 2x to 500x
faster when executed in hardware, and those numbers will continue to improve with
each new generation of shader-level GPUs.

EVALUATION There are a couple of limitations in this first putback that are worth documenting: ConvolveOp - works on any source image - accelerated only for 3x3 and 5x5 kernels (in the future, we may be able to generalize this and increase the supported kernel dimensions, but only on newer hardware) LookupOp - works only on 1-band ColorSpace.TYPE_GRAY and 3/4-band ColorSpace.TYPE_RGB images - accelerated only for ByteLookupTables and ShortLookupTables with no more than 256 entries RescaleOp - works only on 1-band ColorSpace.TYPE_GRAY and 3/4-band ColorSpace.TYPE_RGB images In the cases where an image or op does not meet these restrictions, we will simply fall back on the existing (slower) software code paths.

29-01-2007

EVALUATION More comprehensive performance numbers are still being prepared, but in the meantime, here are some results for an Nvidia GeForce 6800 GT (AGP) with 93.71 drivers, Windows XP, 2x 2.8GHz P4, 1 GB RAM. Here's a legend: 6800.ddraw: no command line options, using jdk1.7.0-b06 6800.old: -Dsun.java2d.opengl=True, using jdk1.7.0-b06 6800.new: -Dsun.java2d.opengl=True, using test build with OGL-accelerated BufImgOps So the "ddraw" and "old" results are mainly measuring the cost of BufferedImageOp.filter() (which uses software loops executed on the CPU) plus rendering the filtered image to a VolatileImage destination. (The "ddraw" vs "old" comparison is useful to get a sense of the performance differences in that last drawImage() step between the DirectDraw and OpenGL pipelines.) The "new" results use the accelerated fast path executed almost entirely on the GPU. In picture form: "old" src BufferedImage --> BufferedImageOp.filter() --> dst BufferedImage | V glDrawPixels() | V dst VolatileImage "new" src BufferedImage | V glTexSubImage2D() | V src OpenGL texture --> OpenGL fragment shader --> dst VolatileImage Options common across all tests: testname=graphics.imaging.imageops.tests.graphics2d.drawimageop graphics.opts.xormode=false graphics.opts.renderhint=Default graphics.opts.alpharule=SrcOver graphics.opts.extraalpha=false global.dest=VolatileImg graphics.imaging.src=IntXrgb opaque graphics.opts.clip=false graphics.opts.anim=2 graphics.imaging.imageops.opts.op=convolve3x3noop,graphics.opts.sizes=1000: 6800.ddraw: 6260.869565 (var=0.0%) (100.0%) 6800.old: 5005.362888 (var=0.57%) (79.95%) 6800.new: 255333.33333 (var=1.58%) (4078.24%) graphics.imaging.imageops.opts.op=convolve3x3noop,graphics.opts.sizes=250: 6800.ddraw: 8233.152726 (var=0.54%) (100.0%) 6800.old: 9391.846361 (var=0.0%) (114.07%) 6800.new: 212500.0 (var=7.26%) (2581.03%) graphics.imaging.imageops.opts.op=convolve3x3zero,graphics.opts.sizes=1000: 6800.ddraw: 6466.984343 (var=0.51%) (100.0%) 6800.old: 5107.252298 (var=0.03%) (78.97%) 6800.new: 271551.72413 (var=1.03%) (4199.05%) graphics.imaging.imageops.opts.op=convolve3x3zero,graphics.opts.sizes=250: 6800.ddraw: 8250.0 (var=0.0%) (100.0%) 6800.old: 9401.172529 (var=1.59%) (113.95%) 6800.new: 214297.74334 (var=0.51%) (2597.55%) graphics.imaging.imageops.opts.op=convolve5x5noop,graphics.opts.sizes=1000: 6800.ddraw: 3686.327077 (var=0.0%) (100.0%) 6800.old: 3199.431212 (var=0.57%) (86.79%) 6800.new: 120358.09018 (var=0.53%) (3264.99%) graphics.imaging.imageops.opts.op=convolve5x5noop,graphics.opts.sizes=250: 6800.ddraw: 4333.333333 (var=0.0%) (100.0%) 6800.old: 4625.0 (var=0.0%) (106.73%) 6800.new: 115851.65736 (var=1.02%) (2673.5%) graphics.imaging.imageops.opts.op=convolve5x5zero,graphics.opts.sizes=1000: 6800.ddraw: 3764.544832 (var=0.0%) (100.0%) 6800.old: 3253.796095 (var=0.0%) (86.43%) 6800.new: 125000.0 (var=1.08%) (3320.45%) graphics.imaging.imageops.opts.op=convolve5x5zero,graphics.opts.sizes=250: 6800.ddraw: 4395.833333 (var=0.0%) (100.0%) 6800.old: 4708.333333 (var=0.0%) (107.11%) 6800.new: 118979.16666 (var=1.04%) (2706.64%) graphics.imaging.imageops.opts.op=lookup1byte,graphics.opts.sizes=1000: 6800.ddraw: 13404.82573 (var=0.03%) (100.0%) 6800.old: 8695.652173 (var=0.0%) (64.87%) 6800.new: 1367333.33333 (var=0.54%) (10200.31%) graphics.imaging.imageops.opts.op=lookup1byte,graphics.opts.sizes=250: 6800.ddraw: 27345.06567 (var=0.0%) (100.0%) 6800.old: 11732.92642 (var=2.08%) (42.91%) 6800.new: 831312.5 (var=0.5%) (3040.08%) graphics.imaging.imageops.opts.op=lookup1short,graphics.opts.sizes=1000: 6800.ddraw: 4108.182129 (var=1.62%) (100.0%) 6800.old: 3536.067892 (var=0.0%) (86.07%) 6800.new: 1366000.0 (var=0.0%) (33250.72%) graphics.imaging.imageops.opts.op=lookup1short,graphics.opts.sizes=250: 6800.ddraw: 4750.0 (var=0.0%) (100.0%) 6800.old: 5099.502487 (var=0.03%) (107.36%) 6800.new: 819083.33333 (var=0.54%) (17243.86%) graphics.imaging.imageops.opts.op=lookup3byte,graphics.opts.sizes=1000: 6800.ddraw: 13404.82573 (var=0.03%) (100.0%) 6800.old: 8713.136729 (var=0.03%) (65.0%) 6800.new: 1362935.65683 (var=0.54%) (10167.5%) graphics.imaging.imageops.opts.op=lookup3byte,graphics.opts.sizes=250: 6800.ddraw: 27250.0 (var=0.5%) (100.0%) 6800.old: 11746.37925 (var=2.17%) (43.11%) 6800.new: 801312.5 (var=0.53%) (2940.6%) graphics.imaging.imageops.opts.op=lookup3short,graphics.opts.sizes=1000: 6800.ddraw: 4106.776180 (var=0.58%) (100.0%) 6800.old: 3554.923569 (var=0.53%) (86.56%) 6800.new: 1359249.32975 (var=0.03%) (33097.72%) graphics.imaging.imageops.opts.op=lookup3short,graphics.opts.sizes=250: 6800.ddraw: 4724.801061 (var=0.03%) (100.0%) 6800.old: 5097.811671 (var=0.03%) (107.89%) 6800.new: 761354.16666 (var=0.54%) (16113.99%) graphics.imaging.imageops.opts.op=rescale1band,graphics.opts.sizes=1000: 6800.ddraw: 4496.713939 (var=0.52%) (100.0%) 6800.old: 3786.444528 (var=0.61%) (84.2%) 6800.new: 2653666.66666 (var=0.54%) (59013.46%) graphics.imaging.imageops.opts.op=rescale1band,graphics.opts.sizes=250: 6800.ddraw: 5342.085090 (var=0.04%) (100.0%) 6800.old: 5749.329244 (var=0.58%) (107.62%) 6800.new: 2376312.5 (var=0.5%) (44482.87%) graphics.imaging.imageops.opts.op=rescale3band,graphics.opts.sizes=1000: 6800.ddraw: 4473.503097 (var=1.63%) (100.0%) 6800.old: 3787.878787 (var=0.57%) (84.67%) 6800.new: 1752116.49170 (var=1.64%) (39166.54%) graphics.imaging.imageops.opts.op=rescale3band,graphics.opts.sizes=250: 6800.ddraw: 5389.996167 (var=2.44%) (100.0%) 6800.old: 5769.976726 (var=1.85%) (107.05%) 6800.new: 1648667.89544 (var=0.54%) (30587.55%) Summary: 6800.ddraw: Number of tests: 20 Overall average: 8209.391001471771 Best spread: 0.0% variance Worst spread: 2.44% variance (Basis for results comparison) 6800.old: Number of tests: 20 Overall average: 6098.111210436934 Best spread: 0.0% variance Worst spread: 2.17% variance Comparison to basis: Best result: 114.07% of basis Worst result: 42.91% of basis Number of wins: 8 Number of ties: 0 Number of losses: 12 6800.new: Number of tests: 20 Overall average: 926660.8044390163 Best spread: 0.0% variance Worst spread: 7.26% variance Comparison to basis: Best result: 59013.46% of basis Worst result: 2581.03% of basis Number of wins: 20 Number of ties: 0 Number of losses: 0 (I can't yet explain why some of the "old" numbers are so much slower, sometimes as much as 50%, than the "ddraw" numbers, mainly for the LookupOp tests. For all of these cases, I'd expect the op.filter() method to be producing IntRgb images, so it's strange that the numbers would be different between the different ops.) It's worth noting that this particular board is two generations old, and it's running on AGP, not PCI-Express. I'm working on getting a full set of numbers that covers newer boards (ATI Radeon X1900, Nvidia GeForce 7xxx/8xxx) as well as older ones (ATI Radeon 9800, Nvidia GeForce FX 5600). But the results above are indicative of what we can expect (and just imagine what they'll look like on newer boards). So the executive summary (for the Nvidia GeForce 6800 GT, at least): ConvolveOp: 25x to 50x (faster than "old" OpenGL results) LookupOp: 70x to 160x RescaleOp: 400x to 700x These numbers reflect the maximum possible performance when the source image is unchanging, and is therefore always cached in texture memory. I'm planning to run some more tests that touch the source image (a simple 1x1 fill rect) between drawImage() calls so that we include the cost of uploading the source image to texture memory each time; this would more adequately reflect a common user scenario where the BufferedImageOp is applied only once to an image, or the case where the image is dynamically changing between each drawImage() call. It's also worth noting that I did some more experimentation with the idea above of having the various BufferedImageOp.filter() implementations delegate to this new accelerated codepath (only when the OGL pipeline is enabled, of course). The way I made this work was to hack in some code in ConvolveOp.filter() that creates a temporary VolatileImage, then calls vimgGraphics.drawImage(src, this, 0, 0), then copies the VolatileImage contents into the destination BufferedImage. So far I only have performance data on the AGP configuration described above, and the results aren't great (this is for convolve3x3, 1000x1000 source image): swonly: 7528.230865 (var=0.04%) (100.0%) ogl: 6260.434056 (var=1.3%) (83.16%) ogl.cachevimg: 6606.110652 (var=0.66%) (87.75%) ogl.skipread: 19846.09153 (var=0.61%) (263.62%) Here's a legend to explain the above: swonly: OGL pipeline is disabled, just use the existing medialib (software) path ogl: uses new accelerated codepath, no extra optimizations ogl.cachevimg: same as "ogl", except cache the VolatileImage to avoid re-creation ogl.skipread: same as "ogl.cachevimg", except skip the part that copies the VolatileImage into the destination BufferedImage What these numbers show is that on an AGP bus, the VRAM->sysmem readback step is a killer (no news there). So it seems that this sort of optimization will not make sense for older machines, but I would like to repeat the experiment on a newer machine with a PCI-Express bus (where readback is much much faster). Anyway, this was just an experiment, and has no bearing on the RFE at hand, but I did want to mention it now in case we want to consider it as a future performance enhancement (because it could certainly be a win on newer systems).

25-01-2007

EVALUATION For now we will accelerate these operations only when the user calls: Graphics2D.drawImage(BufferedImage img, BufferedImageOp op, int x, int y); Sometime later, we could consider putting hooks into the BufferedImageOp.filter() implementations, but that's a little less straightforward.

19-01-2007