JDK-6521533 : OGL: accelerate Linear/RadialGradientPaint using fragment shaders
  • Type: Bug
  • Component: client-libs
  • Sub-Component: 2d
  • Affected Version: 6
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2007-02-05
  • Updated: 2011-03-08
  • Resolved: 2011-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7
6u4Fixed 7 b10Fixed
Related Reports
Relates :  
Relates :  
Description
In JDK 6 we added new LinearGradientPaint and RadialGradientPaint implementations
(see 6296064).  We should be able to accelerate these operations via OpenGL just
as we've already done for GradientPaint and TexturePaint.

Comments
EVALUATION It would be pretty easy to get carried away generating performance data for these changes, so to make things a little easier I've just gathered numbers for three boards that I had handy: Nvidia Quadro FX 1100 (Solaris 10, 2x 2.0GHz Opteron, 97.46) Nvidia GeForce 6800 GT (Windows XP, 2x 2.8GHz P4, 93.71) ATI Radeon 9800 (Windows XP, 2x 2.8GHz P4, 7.1) Note that these boards are relatively old for shader-level hardware, so the performance benefits seen in these changes will look even better on the latest and greatest hardware. For each of these boards I measured the following: - linear2,linear3,radial2,radial3 - fillRect,fillOval - aaOff,aaOn for a total of 16 tests each. I ran each set under three different conditions: NoOGL - 1.7.0-b07, sun.java2d.opengl=false SlowOGL - 1.7.0-b07, sun.java2d.opengl=True FastOGL - test build, sun.java2d.opengl=True Here are the results for each individual board, with SlowOGL as a baseline... Options common across all tests: graphics.opts.xormode=false graphics.opts.renderhint=Default graphics.opts.alpharule=SrcOver graphics.opts.extraalpha=false graphics.render.opts.alphacolor=false global.dest=VolatileImg graphics.opts.sizes=250 graphics.opts.clip=false graphics.opts.anim=2 Nvidia Quadro FX 1100 -------------------------- graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear2: SlowOGL: 503.1894662 (var=1.35%) (100.0%) FastOGL: 426025.72541 (var=0.87%) (84665.07%) NoOGL: 6540.594630 (var=0.73%) (1299.83%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear3: SlowOGL: 486.8409712 (var=0.34%) (100.0%) FastOGL: 33956.88503 (var=0.03%) (6974.94%) NoOGL: 4933.449747 (var=0.1%) (1013.36%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial2: SlowOGL: 468.4512610 (var=0.14%) (100.0%) FastOGL: 22243.82560 (var=0.07%) (4748.38%) NoOGL: 4145.916199 (var=24.4%) (885.03%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial3: SlowOGL: 468.9307403 (var=0.34%) (100.0%) FastOGL: 22229.00133 (var=0.03%) (4740.36%) NoOGL: 4148.653681 (var=0.23%) (884.7%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear2: SlowOGL: 9947.431788 (var=4.75%) (100.0%) FastOGL: 31985.18853 (var=0.2%) (321.54%) NoOGL: 36578.21518 (var=0.14%) (367.72%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear3: SlowOGL: 6839.311464 (var=1.05%) (100.0%) FastOGL: 23346.65597 (var=0.2%) (341.36%) NoOGL: 13814.41413 (var=0.33%) (201.99%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial2: SlowOGL: 4026.218613 (var=0.67%) (100.0%) FastOGL: 19227.24807 (var=0.03%) (477.55%) NoOGL: 5414.834369 (var=0.03%) (134.49%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial3: SlowOGL: 3995.831237 (var=0.07%) (100.0%) FastOGL: 19197.96018 (var=0.03%) (480.45%) NoOGL: 5414.834369 (var=0.07%) (135.51%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear2: SlowOGL: 13198.16373 (var=1.27%) (100.0%) FastOGL: 1480669.86095 (var=0.04%) (11218.76%) NoOGL: 58518.68327 (var=8.92%) (443.39%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear3: SlowOGL: 10000.0 (var=2.15%) (100.0%) FastOGL: 69421.14093 (var=0.03%) (694.21%) NoOGL: 24914.82112 (var=1.84%) (249.15%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial2: SlowOGL: 5065.856129 (var=1.12%) (100.0%) FastOGL: 44678.71485 (var=0.1%) (881.96%) NoOGL: 7132.815084 (var=5.0%) (140.8%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial3: SlowOGL: 5009.221998 (var=10.07%) (100.0%) FastOGL: 44980.74346 (var=1.07%) (897.96%) NoOGL: 7128.267973 (var=0.0%) (142.3%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear2: SlowOGL: 12403.62051 (var=3.85%) (100.0%) FastOGL: 62994.60431 (var=0.14%) (507.87%) NoOGL: 73072.66848 (var=0.38%) (589.12%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear3: SlowOGL: 9596.211365 (var=2.06%) (100.0%) FastOGL: 38575.85398 (var=0.03%) (401.99%) NoOGL: 26958.73705 (var=0.07%) (280.93%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial2: SlowOGL: 4912.483487 (var=1.13%) (100.0%) FastOGL: 28534.99824 (var=0.07%) (580.87%) NoOGL: 7067.062818 (var=0.03%) (143.86%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial3: SlowOGL: 4910.420775 (var=0.07%) (100.0%) FastOGL: 28498.49347 (var=0.03%) (580.37%) NoOGL: 7067.062818 (var=0.07%) (143.92%) Summary: SlowOGL: Number of tests: 16 Overall average: 5739.511471902666 Best spread: 0.07% variance Worst spread: 10.07% variance (Basis for results comparison) FastOGL: Number of tests: 16 Overall average: 149785.43127528526 Best spread: 0.03% variance Worst spread: 1.07% variance Comparison to basis: Best result: 84665.07% of basis Worst result: 321.54% of basis Number of wins: 16 Number of ties: 0 Number of losses: 0 NoOGL: Number of tests: 16 Overall average: 18303.189434133998 Best spread: 0.0% variance Worst spread: 24.4% variance Comparison to basis: Best result: 1299.83% of basis Worst result: 134.49% of basis Number of wins: 16 Number of ties: 0 Number of losses: 0 Nvidia GeForce 6800 GT -------------------------- graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear2: SlowOGL: 624.8931323 (var=0.5%) (100.0%) FastOGL: 452759.90912 (var=0.03%) (72453.97%) NoOGL: 607.8885448 (var=2.68%) (97.28%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear3: SlowOGL: 611.7275176 (var=0.57%) (100.0%) FastOGL: 205118.21066 (var=0.54%) (33530.98%) NoOGL: 8183.922196 (var=1318.05%) (1337.84%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial2: SlowOGL: 572.6816666 (var=0.0%) (100.0%) FastOGL: 144168.51900 (var=0.54%) (25174.29%) NoOGL: 4232.493095 (var=1.04%) (739.07%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial3: SlowOGL: 575.5594639 (var=0.54%) (100.0%) FastOGL: 144037.62033 (var=0.54%) (25025.67%) NoOGL: 4194.767091 (var=0.54%) (728.82%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear2: SlowOGL: 11868.79701 (var=0.03%) (100.0%) FastOGL: 29254.86532 (var=0.03%) (246.49%) NoOGL: 34479.56452 (var=0.03%) (290.51%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear3: SlowOGL: 9636.385032 (var=0.04%) (100.0%) FastOGL: 22101.48341 (var=0.54%) (229.35%) NoOGL: 19929.32200 (var=0.0%) (206.81%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial2: SlowOGL: 4123.307999 (var=0.0%) (100.0%) FastOGL: 20665.62700 (var=0.0%) (501.19%) NoOGL: 5328.036180 (var=0.03%) (129.22%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial3: SlowOGL: 4123.307999 (var=0.0%) (100.0%) FastOGL: 20420.19199 (var=0.5%) (495.24%) NoOGL: 5329.821715 (var=0.03%) (129.26%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear2: SlowOGL: 9240.246406 (var=0.03%) (100.0%) FastOGL: 2192552.20613 (var=0.0%) (23728.29%) NoOGL: 12881.19973 (var=2.12%) (139.4%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear3: SlowOGL: 7938.170241 (var=0.03%) (100.0%) FastOGL: 526479.84886 (var=20.51%) (6632.26%) NoOGL: 36424.82221 (var=251.34%) (458.86%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial2: SlowOGL: 4209.953083 (var=0.03%) (100.0%) FastOGL: 336520.83333 (var=0.54%) (7993.46%) NoOGL: 7100.368632 (var=0.03%) (168.66%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial3: SlowOGL: 5292.145593 (var=0.04%) (100.0%) FastOGL: 336520.83333 (var=0.5%) (6358.87%) NoOGL: 7083.333333 (var=0.0%) (133.85%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear2: SlowOGL: 8721.919673 (var=0.54%) (100.0%) FastOGL: 26848.88814 (var=0.0%) (307.83%) NoOGL: 72328.80247 (var=0.55%) (829.28%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear3: SlowOGL: 7600.502512 (var=1.04%) (100.0%) FastOGL: 40968.49865 (var=0.03%) (539.02%) NoOGL: 31877.76642 (var=0.03%) (419.42%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial2: SlowOGL: 4041.666666 (var=0.0%) (100.0%) FastOGL: 37198.39142 (var=0.0%) (920.37%) NoOGL: 6875.0 (var=0.0%) (170.1%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial3: SlowOGL: 5173.424932 (var=0.54%) (100.0%) FastOGL: 36758.54557 (var=0.03%) (710.53%) NoOGL: 6875.0 (var=0.0%) (132.89%) Summary: SlowOGL: Number of tests: 16 Overall average: 5272.168058651091 Best spread: 0.0% variance Worst spread: 1.04% variance (Basis for results comparison) FastOGL: Number of tests: 16 Overall average: 285773.404520359 Best spread: 0.0% variance Worst spread: 20.51% variance Comparison to basis: Best result: 72453.97% of basis Worst result: 229.35% of basis Number of wins: 16 Number of ties: 0 Number of losses: 0 NoOGL: Number of tests: 16 Overall average: 16483.256760270844 Best spread: 0.0% variance Worst spread: 1318.05% variance Comparison to basis: Best result: 1337.84% of basis Worst result: 97.28% of basis Number of wins: 15 Number of ties: 0 Number of losses: 1 ATI Radeon 9800 ------------------- (Note: The SlowOGL numbers look especially bad on this board due to known glDrawPixels() slowness on this series of ATI boards on Windows. The key thing to note though is how much better the "FastOGL" numbers are when compared to "NoOGL", often hundreds of times faster.) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear2: SlowOGL: 118.8958477 (var=0.03%) (100.0%) FastOGL: 279169.15407 (var=3.2%) (234801.43%) NoOGL: 1951.572102 (var=0.54%) (1641.41%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear3: SlowOGL: 118.2412250 (var=0.52%) (100.0%) FastOGL: 79684.56333 (var=0.0%) (67391.52%) NoOGL: 8274.383713 (var=367.83%) (6997.88%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial2: SlowOGL: 116.9533696 (var=0.03%) (100.0%) FastOGL: 57971.74700 (var=0.0%) (49568.26%) NoOGL: 4259.140033 (var=0.03%) (3641.74%) graphics.render.tests.fillOval,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial3: SlowOGL: 117.5937713 (var=0.55%) (100.0%) FastOGL: 57988.10933 (var=0.5%) (49312.23%) NoOGL: 4237.844333 (var=0.54%) (3603.8%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear2: SlowOGL: 598.4192346 (var=0.0%) (100.0%) FastOGL: 26618.41360 (var=0.0%) (4448.12%) NoOGL: 34952.34989 (var=1.06%) (5840.78%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear3: SlowOGL: 595.1943415 (var=0.0%) (100.0%) FastOGL: 13449.83799 (var=0.54%) (2259.74%) NoOGL: 20335.80666 (var=0.03%) (3416.67%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial2: SlowOGL: 545.5948130 (var=0.0%) (100.0%) FastOGL: 13220.76533 (var=0.0%) (2423.18%) NoOGL: 5350.483000 (var=0.0%) (980.67%) graphics.render.tests.fillOval,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial3: SlowOGL: 548.5509651 (var=0.0%) (100.0%) FastOGL: 13106.22899 (var=0.0%) (2389.25%) NoOGL: 5350.483000 (var=0.0%) (975.38%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear2: SlowOGL: 430.1445285 (var=0.52%) (100.0%) FastOGL: 1254420.68036 (var=0.54%) (291627.72%) NoOGL: 21773.92183 (var=1.05%) (5062.0%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=linear3: SlowOGL: 430.1445285 (var=0.55%) (100.0%) FastOGL: 180895.83333 (var=0.0%) (42054.66%) NoOGL: 36557.78894 (var=134.05%) (8498.95%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial2: SlowOGL: 408.4967320 (var=0.55%) (100.0%) FastOGL: 127052.61394 (var=0.03%) (31102.48%) NoOGL: 7125.0 (var=0.0%) (1744.2%) graphics.render.tests.fillRect,graphics.render.opts.antialias=false,graphics.render.opts.paint=radial3: SlowOGL: 410.7575233 (var=0.55%) (100.0%) FastOGL: 126458.33333 (var=0.54%) (30786.61%) NoOGL: 7125.0 (var=0.0%) (1734.6%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear2: SlowOGL: 481.5745393 (var=0.03%) (100.0%) FastOGL: 58245.85167 (var=0.54%) (12094.88%) NoOGL: 73865.56044 (var=0.51%) (15338.34%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=linear3: SlowOGL: 481.7359249 (var=0.03%) (100.0%) FastOGL: 24625.0 (var=0.0%) (5111.72%) NoOGL: 32186.01089 (var=0.03%) (6681.26%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial2: SlowOGL: 453.9951573 (var=0.03%) (100.0%) FastOGL: 23500.33512 (var=0.03%) (5176.34%) NoOGL: 6930.485762 (var=0.03%) (1526.55%) graphics.render.tests.fillRect,graphics.render.opts.antialias=true,graphics.render.opts.paint=radial3: SlowOGL: 456.5217391 (var=0.0%) (100.0%) FastOGL: 22997.65415 (var=0.03%) (5037.58%) NoOGL: 6937.5 (var=0.53%) (1519.64%) Summary: SlowOGL: Number of tests: 16 Overall average: 394.55089009593866 Best spread: 0.0% variance Worst spread: 0.55% variance (Basis for results comparison) FastOGL: Number of tests: 16 Overall average: 147462.82010035522 Best spread: 0.0% variance Worst spread: 3.2% variance Comparison to basis: Best result: 291627.72% of basis Worst result: 2259.74% of basis Number of wins: 16 Number of ties: 0 Number of losses: 0 NoOGL: Number of tests: 16 Overall average: 17325.83316423228 Best spread: 0.0% variance Worst spread: 367.83% variance Comparison to basis: Best result: 15338.34% of basis Worst result: 975.38% of basis Number of wins: 16 Number of ties: 0 Number of losses: 0
13-02-2007

EVALUATION There are quite a few different options on LGP and RGP: CycleMethod (3 choices) ColorSpaceType (2 choices) Number of colors/fractions (lots of choices) Antialiasing on/off To handle any of these cases with maximum performance, it was necessary to make the shader source code pluggable so that at runtime we could compile a really optimized shader that works specifically for that set of options. (We did something similar for the BufferedImageOp shaders in 6514990.) So for example, we'd compile/link a shader for REFLECT/SRGB/4/AAon, and another for NO_CYCLE/LINEAR_RGB/12/AAoff, and so on. The tricky part is figuring out the maximum number of "stops" to support. Most shader-level hardware available today does not support non-constant sized loop bounds, and most only support a limited number of array/texture lookups, which means we can't just arbitrarily support any number of gradient stops (we have to cap it somehow). Since it's relatively uncommon for developers to specify lots of gradient stops, we decided to have two separate shader versions: one that supports up to 4 stops (doesn't have to iterate lots of times and is therefore highly optimized) and another one that supports up to 12 stops (more general and supports more colors, but at the cost of performance). Detailed performance data will come later in this evaluation. The fragment shaders devised for this fix worked well even on the oldest shader-level hardware from Nvidia, but we ran into some problems when running on older ATI hardware (specifically the R300 series, which includes Radeon 9500, 9800, x300, etc). There were two specific bugs encountered: 1) The 4-color shader would run in hardware, but the 16-color shader would run in software (painfully slow). 2) The 16-color shader looked correct, but the 4-color shader would produce totally wrong results. Well, for (1) I determined that, as suspected, the hardware is only able to access a certain number of array values (it probably puts each array value into a constant, and the hardware can only handle a certain number of constants in hardware). After some experimentation and loop unrolling I eventually figured out that it can handle 12 fractions, but anything greater is too much. So the fix for this can just be to put a cap on the number of fractions we can support at 12. (2) turned out to be really bizarre. I noticed in my testcase that the visual results were actually correct in the MaskFill case, but totally wrong in the "solid" case. It took me a while to figure out why one would work, but not the other. The following comment sums it up: /* * REMIND: This is really wacky, but the gradient shaders will * produce completely incorrect results on ATI hardware (at least * on first-gen (R300-based) boards) if the shader program does not * try to access texture coordinates by using a gl_TexCoord[*] * variable. This problem really should be addressed by ATI, but * in the meantime it seems we can workaround the issue by inserting * a benign operation that accesses gl_TexCoord[0]. Note that we * only need to do this for ATI boards and only in the !useMask case, * because the useMask case already does access gl_TexCoord[1] and * is therefore not affected by this driver bug. */ const char *vendor = (const char *)j2d_glGetString(GL_VENDOR); if (vendor != NULL && strncmp(vendor, "ATI", 3) == 0) { maskCode = "dist = gl_TexCoord[0].s;"; } As much as possible I try to keep vendor-specific workarounds like this out of our codebase, but in this case the fix is very localized and won't harm anything in the future even if ATI fixes the problem, so it should be safe. Now on to the performance data. While designing these shaders we came up with some competing algorithms (some faster than others), so it's worth including this information for the record; in case someone in the future wants to try to improve performance, they can learn from things we've already tried. First, Jim found a cool way to use an accumulation function for calculating the current gradient position (the texcoord for the 1D gradient texture). My original code looked like this: static const char *texCoordCalcCode = "int i;" "int clrIndex = 0;" "float relFraction = 0.0;" "" // iterate through fractions to determine the subrange to // which this fragment belongs "for (i = 1; i < MAX_FRACTIONS; i++) {" " if (dist < fractions[i]) {" // this is a value in [0,1] representing the relative position // between the two colors " relFraction =" " (dist - fractions[i-1]) /" " (fractions[i] - fractions[i-1]);" // save the index for later " clrIndex = i-1;" " // hack " dist = 10.0;" " }" "}" "" // we offset by half a texel so that we find the linearly interpolated // color between the two texel centers of interest "tc = HALF_TEXEL + (FULL_TEXEL * (float(clrIndex) + relFraction));"; This is not ideal because there is a conditional in each iteration of the loop (and the break statement is not supported on most hardware, so it's not like we could just break out of the loop once we've found the correct position); it's best to avoid conditionals whenever possible in fragment shaders for performance reasons. Jim's approach is much cleaner because it avoids the conditional: static const char *texCoordCalcCode = "int i;" "float relFraction = 0.0;" "for (i = 0; i < MAX_FRACTIONS-1; i++) {" " relFraction +=" " clamp((dist - fractions[i]) * scaleFactors[i], 0.0, 1.0);" "}" // we offset by half a texel so that we find the linearly interpolated // color between the two texel centers of interest "tc = HALF_TEXEL + (FULL_TEXEL * relFraction);"; I measured the performance difference between the two options on my Nvidia GeForce 6800 GT (AGP), and found that the latter code is at least 50% faster than the older code (this is for 1000x1000 non-AA fillRects): graphics.render.opts.paint=linear2: tc1: 149798.92761 (var=0.03%) (100.0%) tc2: 245333.33333 (var=0.54%) (163.78%) graphics.render.opts.paint=radial2: tc1: 129310.34482 (var=0.53%) (100.0%) tc2: 196362.41158 (var=0.54%) (151.85%) The next area of optimization was in the REFLECT code. My original code was straightforward, but again had a conditional: static const char *reflectCode = "dist = mod(dist, 2.0);" "if (dist > 1.0) {" " dist = 2.0 - dist;" "}" // (placeholder for texcoord calculation) "%s"; Jim came up with a way to do this using an abs instead of a conditional (again, more calculations, but even that's better than having a conditional): static const char *reflectCode = "dist = 1.0 - (abs(fract(dist * 0.5) - 0.5) * 2.0);" // (placeholder for texcoord calculation) "%s"; In this case, there wasn't a major speedup on the 6800, but it was better than nothing so we went with it: graphics.render.opts.paint=linear2: ref1: 145728.64321 (var=0.03%) (100.0%) ref2: 149798.92761 (var=0.03%) (102.79%) graphics.render.opts.paint=radial2: ref1: 123333.33333 (var=0.5%) (100.0%) ref2: 129353.23383 (var=0.5%) (104.88%) (One thing to note about these linear2 results: this was before I added code that redirected 2-stop LinearGradientPaints to the older-but-faster basic GradientPaint acceleration code. But nonetheless, one would see very similar performance for linear3 or any of the other options.) Finally, as mentioned earlier we had to figure out optimal values for the "small" and "large" variants of the shaders. To measure this, I changed the MAX_FRACTIONS value in my source code and recompiled before running the same J2DBench tests. This allows us to see the impact of having a smaller/larger number of iterations in our main loop. As expected, there is a pretty linear speedup seen with having a smaller number of iterations: graphics.render.opts.paint=linear2: num32: 143048.57621 (var=0.54%) (100.0%) num16: 245333.33333 (var=0.54%) (171.5%) num8: 414879.35656 (var=0.57%) (290.03%) num4: 583474.43278 (var=0.54%) (407.89%) graphics.render.opts.paint=radial2: num32: 123666.66666 (var=0.5%) (100.0%) num16: 196362.41158 (var=0.54%) (158.78%) num8: 275376.88442 (var=0.54%) (222.68%) num4: 364000.0 (var=0.53%) (294.34%) Here we're using MAX_FRACTIONS=32 as a baseline, and comparing 16, 8, and 4 to that baseline. Not surprisingly, a smaller number of iterations means better performance. For example, using only 4 iterations is 3-4x faster than using 32 iterations. To hit the sweet spots described earlier, we chose to use MAX_FRACTIONS_SMALL=4 and MAX_FRACTIONS_LARGE=12 (we would have gone with 16, but this would cause problems on older ATI hardware, see above). This means that for MultipleGradientPaints with 2-4 stops, we will use the optimized 4-color shader variant; for 5-12 stops, we will use the slightly slower 12-color shader variant; for anything larger, we will simply fall back on our existing software loops. Final performance comparisons between various boards are coming soon...
12-02-2007

EVALUATION There are some big performance gains to be had (at least 4x faster than software even on the oldest boards with fragment shader support, and as much as 100x faster on newer boards). More detailed performance data will be added here in the near future. Now would also be a good time to clean up the way that we share the paint acceleration code between OGLRenderer (for non-AA rendering) and OGLMaskFill (for AA rendering). This should also pave the way to sharing much of this code between the OGL pipeline and the new D3D pipeline when it's ready.
05-02-2007