Bug ID: JDK-6536952 OGL: ConvolveOp with 5x5 kernel extremely slow on ATI Radeon 9800

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 6	JDK 7
6u4Fixed	7 b14Fixed

In 6514990, we added GPU acceleration for ConvolveOp for kernel sizes of 3x3 and 5x5.
Things work great on all shader-level Nvidia hardware from GeForce FX 5600 on up,
and on ATI hardware from R5xx on up.  But on R300 boards, such as Radeon 9600 and 9800,
performance of 5x5 ConvolveOp is unacceptably slow.  Enabling native tracing with
J2D_TRACE_LEVEL=2 shows the following on Radeon 9800 with Catalyst 7.2 (same can be
seen with earlier Catalyst drivers on Windows, or on Radeon 9600 with 8.34 on Linux):

[W] OGLContext_CreateFragmentProgram: linker msg (106):
  Link successful. The GLSL fragment shader will run in software - available number
  of constants exceeded.

Clearly the operation is causing the driver to fall back to a software path, which
means even a simple convolution operation can take seconds to complete, instead of
a few milliseconds.

EVALUATION The problem is that R300 has a limited set of constant registers, and our ConvolveOp shader is currently making inefficient use of uniform arrays. The hardware has only so many vec4 registers, and for a 5x5 ConvolveOp, our current code uses: uniform vec2 imgMin; uniform vec2 imgMax; uniform vec2 offsets[25]; uniform float kernelVals[25]; ATI's drivers aren't smart enough to know to pack the offsets and kernelVals into a single array, so we should take care of that ourselves. Also, we can do the same for imgMin and imgMax. Ultimately we end up with: // image edge limits: // imgEdge.xy = imgMin.xy (anything < will be treated as edge case) // imgEdge.zw = imgMax.xy (anything > will be treated as edge case) "uniform vec4 imgEdge;" // value for each location in the convolution kernel: // kernelVals[i].x = offsetX[i] // kernelVals[i].y = offsetY[i] // kernelVals[i].z = kernel[i] "uniform vec3 kernelVals[MAX_KERNEL_SIZE];" After making these changes, the shader compiler no longer complains about exceeding the number of available constants, but on Catalyst 7.2 and earlier, it now complains about something else: Link successful. The GLSL fragment shader will run in software - available number of texture instructions exceeded. This problem only occurs when the source texture has non-pow2 dimensions because we use the GL_ARB_texture_rectangle extension in this case. We worked with ATI to confirm that this is indeed a driver issue that has been fixed for their upcoming Catalyst 7.3 release (and hopefully fixed soon on Linux as well). So in summary, we're making the changes described above in the JDK to work around the constant register limit issue, but folks will need to install Catalyst 7.3 or later for the complete problem to go away. (A workaround for Catalyst 7.2 and earlier is to simply use pow2-sized images only, but that's a fairly limiting restriction, so it would be better to just install 7.3 when it becomes available.)

21-03-2007