United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6273431 OGL: improve performance of parameter queuing
JDK-6273431 : OGL: improve performance of parameter queuing

Details
Type:
Bug
Submit Date:
2005-05-20
Status:
Resolved
Updated Date:
2008-02-05
Project Name:
JDK
Resolved Date:
2005-06-27
Component:
client-libs
OS:
solaris_9
Sub-Component:
2d
CPU:
generic
Priority:
P4
Resolution:
Fixed
Affected Versions:
5.0
Fixed Versions:

Related Reports
Relates:

Sub Tasks

Description
In our single-threaded OGL pipeline, we enqueue parameters for each rendering
operation onto a java.nio.ByteBuffer.  For example, a fillRect() operation
currently looks like this:
    buf.putInt(FILL_RECT);
    buf.putInt(x).putInt(y).putInt(w).putInt(h);

We have a number of operations whose parameters consist entirely of integers (no floats, longs, etc).  For these operations we can exploit the fact that we always keep the current buffer position aligned on a 4-byte boundary, and therefore use an IntBuffer view on the original ByteBuffer to enqueue those
int parameters. (In a microbenchmark, tested on both Solaris/SPARC and Linux,
I found that using IntBuffer.put() can be up to 3x faster than
ByteBuffer.putInt() when the buffer is in the native machine endianness.)
So for the above example, we would instead use:
    IntBuffer ibuf = buf.asIntBuffer();
    ibuf.put(FILL_RECT);
    ibuf.put(x).put(y).put(w).put(h);
    buf.position(buf.position() + 20);
###@###.### 2005-05-20 00:09:21 GMT

                                    

Comments
EVALUATION

Using the approach described above, it appears that we can improve performance
of a number of common operations.  For example, on Solaris/SPARC with XVR-1200:

Operation           Performance
---------           -----------
20x20 fillRect()       +45%
20x20 drawLine()       +52%
20x20 drawImage()      +20%
20x20 copyArea()       +10%

There are a few other operations that will also likely improve using these
techniques (e.g. setClip(), MaskFill, MaskBlit).
###@###.### 2005-05-20 00:07:37 GMT

Similar improvements are seen on Linux as well (JDS, Nvidia GF FX 5600,
7590 drivers, 2x 2.6GHz P4):

Operation           Performance
---------           -----------
20x20 fillRect()       + 3%
20x20 drawLine()       +37%
20x20 drawImage()      +17%
20x20 copyArea()       +26%

###@###.### 2005-05-20 05:45:28 GMT

While the proposed changes certainly improve performance, the approach is
a bit clunky ("round peg, square hole").  If we are getting to the point
where we are using tricks to get better performance out of NIO ByteBuffers,
why not just write a thin Unsafe wrapper that meets our needs?  We are
already going out of our way to maintain 4-byte alignment, so only a few
more changes would be required to achieve 8-byte alignment when necessary
(i.e. when adding long and double parameters to the buffer).  This approach
has a couple added benefits:
  - interface is mostly compatible with NIO classes
  - performance gains for all existing code without creating view buffers
    and such, as suggested earlier
  - no temporary object creation (before, we would create one or more
    view buffers for each drawGlyphList() call; while not too expensive,
    it would be nice to avoid this)

Here are some updated performance numbers with these changes in place
(on the Solaris/SPARC configuration listed above):

Operation             Performance
---------             -----------
  1x1   fillRect()       +64%
 20x20  fillRect()       +52%
  1x1   drawLine()       +57%
 20x20  drawLine()       +64%
100x100 drawLine()       +39%
  1x1   drawImage()      +34%
 20x20  drawImage()      +32%
100x100 drawImage()      + 6%
  1x1   copyArea()       +11%
 20x20  copyArea()       +11%
  4 ch  drawString()     +20%
 32 ch  drawString()     +17%

And on my Windows XP machine (2x 2.6GHz P4, GF FX 5600):

Operation             Performance
---------             -----------
  1x1   fillRect()       +41%
 20x20  fillRect()       +33%
  1x1   drawLine()       +49%
 20x20  drawLine()       +47%
100x100 drawLine()       +46%
  1x1   drawImage()      +68%
 20x20  drawImage()      +69%
100x100 drawImage()      + 3%
 20x20  copyArea()         0% (known driver slowness)
  4 ch  drawString()     +38%
 32 ch  drawString()     + 6%

###@###.### 2005-05-25 23:44:30 GMT
                                     
2005-05-20



Hardware and Software, Engineered to Work Together