United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6514990 OGL: accelerate Convolve/Rescale/LookupOp using fragment shaders
JDK-6514990 : OGL: accelerate Convolve/Rescale/LookupOp using fragment shaders

Details
Type:
Enhancement
Submit Date:
2007-01-19
Status:
Closed
Updated Date:
2011-03-08
Project Name:
JDK
Resolved Date:
2011-03-08
Component:
client-libs
OS:
generic
Sub-Component:
2d
CPU:
generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
6
Fixed Versions:

Related Reports
Backport:
Relates:

Sub Tasks

Description
We can accelerate various BufferedImageOps in hardware using fragment shaders when
the OGL pipeline is enabled.  These operations can go anywhere from 2x to 500x
faster when executed in hardware, and those numbers will continue to improve with
each new generation of shader-level GPUs.

                                    

Comments
EVALUATION

For now we will accelerate these operations only when the user calls:
    Graphics2D.drawImage(BufferedImage img, BufferedImageOp op, int x, int y);

Sometime later, we could consider putting hooks into the BufferedImageOp.filter()
implementations, but that's a little less straightforward.
                                     
2007-01-19
EVALUATION

More comprehensive performance numbers are still being prepared, but in the
meantime, here are some results for an Nvidia GeForce 6800 GT (AGP) with
93.71 drivers, Windows XP, 2x 2.8GHz P4, 1 GB RAM.

Here's a legend:
  6800.ddraw:
    no command line options, using jdk1.7.0-b06
  6800.old:
    -Dsun.java2d.opengl=True, using jdk1.7.0-b06
  6800.new:
    -Dsun.java2d.opengl=True, using test build with OGL-accelerated BufImgOps

So the "ddraw" and "old" results are mainly measuring the cost of
BufferedImageOp.filter() (which uses software loops executed on the CPU)
plus rendering the filtered image to a VolatileImage destination.  (The
"ddraw" vs "old" comparison is useful to get a sense of the performance
differences in that last drawImage() step between the DirectDraw and OpenGL
pipelines.)  The "new" results use the accelerated fast path executed almost
entirely on the GPU.  In picture form:

"old"
  src BufferedImage --> BufferedImageOp.filter() --> dst BufferedImage
                                                            |
                                                            V
                                                       glDrawPixels()
                                                            |
                                                            V
                                                     dst VolatileImage

"new"
  src BufferedImage
         |
         V
  glTexSubImage2D()
         |
         V
  src OpenGL texture --> OpenGL fragment shader --> dst VolatileImage

Options common across all tests:
  testname=graphics.imaging.imageops.tests.graphics2d.drawimageop
  graphics.opts.xormode=false
  graphics.opts.renderhint=Default
  graphics.opts.alpharule=SrcOver
  graphics.opts.extraalpha=false
  global.dest=VolatileImg
  graphics.imaging.src=IntXrgb opaque
  graphics.opts.clip=false
  graphics.opts.anim=2

graphics.imaging.imageops.opts.op=convolve3x3noop,graphics.opts.sizes=1000:
6800.ddraw: 6260.869565 (var=0.0%) (100.0%)
6800.old: 5005.362888 (var=0.57%) (79.95%)
6800.new: 255333.33333 (var=1.58%) (4078.24%)
graphics.imaging.imageops.opts.op=convolve3x3noop,graphics.opts.sizes=250:
6800.ddraw: 8233.152726 (var=0.54%) (100.0%)
6800.old: 9391.846361 (var=0.0%) (114.07%)
6800.new: 212500.0 (var=7.26%) (2581.03%)
graphics.imaging.imageops.opts.op=convolve3x3zero,graphics.opts.sizes=1000:
6800.ddraw: 6466.984343 (var=0.51%) (100.0%)
6800.old: 5107.252298 (var=0.03%) (78.97%)
6800.new: 271551.72413 (var=1.03%) (4199.05%)
graphics.imaging.imageops.opts.op=convolve3x3zero,graphics.opts.sizes=250:
6800.ddraw: 8250.0 (var=0.0%) (100.0%)
6800.old: 9401.172529 (var=1.59%) (113.95%)
6800.new: 214297.74334 (var=0.51%) (2597.55%)
graphics.imaging.imageops.opts.op=convolve5x5noop,graphics.opts.sizes=1000:
6800.ddraw: 3686.327077 (var=0.0%) (100.0%)
6800.old: 3199.431212 (var=0.57%) (86.79%)
6800.new: 120358.09018 (var=0.53%) (3264.99%)
graphics.imaging.imageops.opts.op=convolve5x5noop,graphics.opts.sizes=250:
6800.ddraw: 4333.333333 (var=0.0%) (100.0%)
6800.old: 4625.0 (var=0.0%) (106.73%)
6800.new: 115851.65736 (var=1.02%) (2673.5%)
graphics.imaging.imageops.opts.op=convolve5x5zero,graphics.opts.sizes=1000:
6800.ddraw: 3764.544832 (var=0.0%) (100.0%)
6800.old: 3253.796095 (var=0.0%) (86.43%)
6800.new: 125000.0 (var=1.08%) (3320.45%)
graphics.imaging.imageops.opts.op=convolve5x5zero,graphics.opts.sizes=250:
6800.ddraw: 4395.833333 (var=0.0%) (100.0%)
6800.old: 4708.333333 (var=0.0%) (107.11%)
6800.new: 118979.16666 (var=1.04%) (2706.64%)
graphics.imaging.imageops.opts.op=lookup1byte,graphics.opts.sizes=1000:
6800.ddraw: 13404.82573 (var=0.03%) (100.0%)
6800.old: 8695.652173 (var=0.0%) (64.87%)
6800.new: 1367333.33333 (var=0.54%) (10200.31%)
graphics.imaging.imageops.opts.op=lookup1byte,graphics.opts.sizes=250:
6800.ddraw: 27345.06567 (var=0.0%) (100.0%)
6800.old: 11732.92642 (var=2.08%) (42.91%)
6800.new: 831312.5 (var=0.5%) (3040.08%)
graphics.imaging.imageops.opts.op=lookup1short,graphics.opts.sizes=1000:
6800.ddraw: 4108.182129 (var=1.62%) (100.0%)
6800.old: 3536.067892 (var=0.0%) (86.07%)
6800.new: 1366000.0 (var=0.0%) (33250.72%)
graphics.imaging.imageops.opts.op=lookup1short,graphics.opts.sizes=250:
6800.ddraw: 4750.0 (var=0.0%) (100.0%)
6800.old: 5099.502487 (var=0.03%) (107.36%)
6800.new: 819083.33333 (var=0.54%) (17243.86%)
graphics.imaging.imageops.opts.op=lookup3byte,graphics.opts.sizes=1000:
6800.ddraw: 13404.82573 (var=0.03%) (100.0%)
6800.old: 8713.136729 (var=0.03%) (65.0%)
6800.new: 1362935.65683 (var=0.54%) (10167.5%)
graphics.imaging.imageops.opts.op=lookup3byte,graphics.opts.sizes=250:
6800.ddraw: 27250.0 (var=0.5%) (100.0%)
6800.old: 11746.37925 (var=2.17%) (43.11%)
6800.new: 801312.5 (var=0.53%) (2940.6%)
graphics.imaging.imageops.opts.op=lookup3short,graphics.opts.sizes=1000:
6800.ddraw: 4106.776180 (var=0.58%) (100.0%)
6800.old: 3554.923569 (var=0.53%) (86.56%)
6800.new: 1359249.32975 (var=0.03%) (33097.72%)
graphics.imaging.imageops.opts.op=lookup3short,graphics.opts.sizes=250:
6800.ddraw: 4724.801061 (var=0.03%) (100.0%)
6800.old: 5097.811671 (var=0.03%) (107.89%)
6800.new: 761354.16666 (var=0.54%) (16113.99%)
graphics.imaging.imageops.opts.op=rescale1band,graphics.opts.sizes=1000:
6800.ddraw: 4496.713939 (var=0.52%) (100.0%)
6800.old: 3786.444528 (var=0.61%) (84.2%)
6800.new: 2653666.66666 (var=0.54%) (59013.46%)
graphics.imaging.imageops.opts.op=rescale1band,graphics.opts.sizes=250:
6800.ddraw: 5342.085090 (var=0.04%) (100.0%)
6800.old: 5749.329244 (var=0.58%) (107.62%)
6800.new: 2376312.5 (var=0.5%) (44482.87%)
graphics.imaging.imageops.opts.op=rescale3band,graphics.opts.sizes=1000:
6800.ddraw: 4473.503097 (var=1.63%) (100.0%)
6800.old: 3787.878787 (var=0.57%) (84.67%)
6800.new: 1752116.49170 (var=1.64%) (39166.54%)
graphics.imaging.imageops.opts.op=rescale3band,graphics.opts.sizes=250:
6800.ddraw: 5389.996167 (var=2.44%) (100.0%)
6800.old: 5769.976726 (var=1.85%) (107.05%)
6800.new: 1648667.89544 (var=0.54%) (30587.55%)

Summary:
  6800.ddraw: 
    Number of tests:  20
    Overall average:  8209.391001471771
    Best spread:      0.0% variance
    Worst spread:     2.44% variance
    (Basis for results comparison)

  6800.old: 
    Number of tests:  20
    Overall average:  6098.111210436934
    Best spread:      0.0% variance
    Worst spread:     2.17% variance
    Comparison to basis:
      Best result:      114.07% of basis
      Worst result:     42.91% of basis
      Number of wins:   8
      Number of ties:   0
      Number of losses: 12

  6800.new: 
    Number of tests:  20
    Overall average:  926660.8044390163
    Best spread:      0.0% variance
    Worst spread:     7.26% variance
    Comparison to basis:
      Best result:      59013.46% of basis
      Worst result:     2581.03% of basis
      Number of wins:   20
      Number of ties:   0
      Number of losses: 0

(I can't yet explain why some of the "old" numbers are so much slower,
sometimes as much as 50%, than the "ddraw" numbers, mainly for the
LookupOp tests.  For all of these cases, I'd expect the op.filter()
method to be producing IntRgb images, so it's strange that the numbers
would be different between the different ops.)

It's worth noting that this particular board is two generations old,
and it's running on AGP, not PCI-Express.  I'm working on getting a full
set of numbers that covers newer boards (ATI Radeon X1900,
Nvidia GeForce 7xxx/8xxx) as well as older ones (ATI Radeon 9800,
Nvidia GeForce FX 5600).  But the results above are indicative of what we
can expect (and just imagine what they'll look like on newer boards).

So the executive summary (for the Nvidia GeForce 6800 GT, at least):
  ConvolveOp: 25x to 50x (faster than "old" OpenGL results)
  LookupOp:   70x to 160x
  RescaleOp:  400x to 700x

These numbers reflect the maximum possible performance when the source image
is unchanging, and is therefore always cached in texture memory.  I'm planning
to run some more tests that touch the source image (a simple 1x1 fill rect)
between drawImage() calls so that we include the cost of uploading the
source image to texture memory each time; this would more adequately reflect
a common user scenario where the BufferedImageOp is applied only once to an
image, or the case where the image is dynamically changing between each
drawImage() call.

It's also worth noting that I did some more experimentation with the idea
above of having the various BufferedImageOp.filter() implementations delegate
to this new accelerated codepath (only when the OGL pipeline is enabled,
of course).  The way I made this work was to hack in some code in
ConvolveOp.filter() that creates a temporary VolatileImage, then calls
vimgGraphics.drawImage(src, this, 0, 0), then copies the VolatileImage
contents into the destination BufferedImage.  So far I only have performance
data on the AGP configuration described above, and the results aren't great
(this is for convolve3x3, 1000x1000 source image):

swonly: 7528.230865 (var=0.04%) (100.0%)
ogl: 6260.434056 (var=1.3%) (83.16%)
ogl.cachevimg: 6606.110652 (var=0.66%) (87.75%)
ogl.skipread: 19846.09153 (var=0.61%) (263.62%)

Here's a legend to explain the above:
  swonly:
    OGL pipeline is disabled, just use the existing medialib (software) path
  ogl:
    uses new accelerated codepath, no extra optimizations
  ogl.cachevimg:
    same as "ogl", except cache the VolatileImage to avoid re-creation
  ogl.skipread:
    same as "ogl.cachevimg", except skip the part that copies the VolatileImage
    into the destination BufferedImage

What these numbers show is that on an AGP bus, the VRAM->sysmem readback step
is a killer (no news there).  So it seems that this sort of optimization will
not make sense for older machines, but I would like to repeat the experiment
on a newer machine with a PCI-Express bus (where readback is much much faster).
Anyway, this was just an experiment, and has no bearing on the RFE at hand,
but I did want to mention it now in case we want to consider it as a future
performance enhancement (because it could certainly be a win on newer
systems).
                                     
2007-01-25
EVALUATION

There are a couple of limitations in this first putback that are worth
documenting:

  ConvolveOp
    - works on any source image
    - accelerated only for 3x3 and 5x5 kernels (in the future, we may be
      able to generalize this and increase the supported kernel dimensions,
      but only on newer hardware)

  LookupOp
    - works only on 1-band ColorSpace.TYPE_GRAY and
      3/4-band ColorSpace.TYPE_RGB images
    - accelerated only for ByteLookupTables and ShortLookupTables with
      no more than 256 entries

  RescaleOp
    - works only on 1-band ColorSpace.TYPE_GRAY and
      3/4-band ColorSpace.TYPE_RGB images

In the cases where an image or op does not meet these restrictions, we will
simply fall back on the existing (slower) software code paths.
                                     
2007-01-29



Hardware and Software, Engineered to Work Together