United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6536952 OGL: ConvolveOp with 5x5 kernel extremely slow on ATI Radeon 9800
JDK-6536952 : OGL: ConvolveOp with 5x5 kernel extremely slow on ATI Radeon 9800

Details
Type:
Bug
Submit Date:
2007-03-21
Status:
Closed
Updated Date:
2011-03-08
Project Name:
JDK
Resolved Date:
2011-03-08
Component:
client-libs
OS:
generic
Sub-Component:
2d
CPU:
generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
7
Fixed Versions:

Related Reports
Backport:
Relates:

Sub Tasks

Description
In 6514990, we added GPU acceleration for ConvolveOp for kernel sizes of 3x3 and 5x5.
Things work great on all shader-level Nvidia hardware from GeForce FX 5600 on up,
and on ATI hardware from R5xx on up.  But on R300 boards, such as Radeon 9600 and 9800,
performance of 5x5 ConvolveOp is unacceptably slow.  Enabling native tracing with
J2D_TRACE_LEVEL=2 shows the following on Radeon 9800 with Catalyst 7.2 (same can be
seen with earlier Catalyst drivers on Windows, or on Radeon 9600 with 8.34 on Linux):

[W] OGLContext_CreateFragmentProgram: linker msg (106):
  Link successful. The GLSL fragment shader will run in software - available number
  of constants exceeded.

Clearly the operation is causing the driver to fall back to a software path, which
means even a simple convolution operation can take seconds to complete, instead of
a few milliseconds.

                                    

Comments
EVALUATION

The problem is that R300 has a limited set of constant registers, and our ConvolveOp
shader is currently making inefficient use of uniform arrays.  The hardware has only
so many vec4 registers, and for a 5x5 ConvolveOp, our current code uses:
    uniform vec2 imgMin;
    uniform vec2 imgMax;
    uniform vec2 offsets[25];
    uniform float kernelVals[25];

ATI's drivers aren't smart enough to know to pack the offsets and kernelVals into
a single array, so we should take care of that ourselves.  Also, we can do the
same for imgMin and imgMax.  Ultimately we end up with:
    // image edge limits:
    //   imgEdge.xy = imgMin.xy (anything < will be treated as edge case)
    //   imgEdge.zw = imgMax.xy (anything > will be treated as edge case)
    "uniform vec4 imgEdge;"
    // value for each location in the convolution kernel:
    //   kernelVals[i].x = offsetX[i]
    //   kernelVals[i].y = offsetY[i]
    //   kernelVals[i].z = kernel[i]
    "uniform vec3 kernelVals[MAX_KERNEL_SIZE];"

After making these changes, the shader compiler no longer complains about exceeding
the number of available constants, but on Catalyst 7.2 and earlier, it now complains
about something else:
  Link successful. The GLSL fragment shader will run in software - available number
  of texture instructions exceeded.

This problem only occurs when the source texture has non-pow2 dimensions because
we use the GL_ARB_texture_rectangle extension in this case.  We worked with ATI
to confirm that this is indeed a driver issue that has been fixed for their upcoming
Catalyst 7.3 release (and hopefully fixed soon on Linux as well).

So in summary, we're making the changes described above in the JDK to work around
the constant register limit issue, but folks will need to install Catalyst 7.3
or later for the complete problem to go away.  (A workaround for Catalyst 7.2 and
earlier is to simply use pow2-sized images only, but that's a fairly limiting
restriction, so it would be better to just install 7.3 when it becomes available.)
                                     
2007-03-21



Hardware and Software, Engineered to Work Together