JDK-6273431 : OGL: improve performance of parameter queuing
  • Type: Bug
  • Component: client-libs
  • Sub-Component: 2d
  • Affected Version: 5.0
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris_9
  • CPU: generic
  • Submitted: 2005-05-20
  • Updated: 2008-02-05
  • Resolved: 2005-06-27
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6
6 b43Fixed
Related Reports
Relates :  
Description
In our single-threaded OGL pipeline, we enqueue parameters for each rendering
operation onto a java.nio.ByteBuffer.  For example, a fillRect() operation
currently looks like this:
    buf.putInt(FILL_RECT);
    buf.putInt(x).putInt(y).putInt(w).putInt(h);

We have a number of operations whose parameters consist entirely of integers (no floats, longs, etc).  For these operations we can exploit the fact that we always keep the current buffer position aligned on a 4-byte boundary, and therefore use an IntBuffer view on the original ByteBuffer to enqueue those
int parameters. (In a microbenchmark, tested on both Solaris/SPARC and Linux,
I found that using IntBuffer.put() can be up to 3x faster than
ByteBuffer.putInt() when the buffer is in the native machine endianness.)
So for the above example, we would instead use:
    IntBuffer ibuf = buf.asIntBuffer();
    ibuf.put(FILL_RECT);
    ibuf.put(x).put(y).put(w).put(h);
    buf.position(buf.position() + 20);
###@###.### 2005-05-20 00:09:21 GMT

Comments
EVALUATION Using the approach described above, it appears that we can improve performance of a number of common operations. For example, on Solaris/SPARC with XVR-1200: Operation Performance --------- ----------- 20x20 fillRect() +45% 20x20 drawLine() +52% 20x20 drawImage() +20% 20x20 copyArea() +10% There are a few other operations that will also likely improve using these techniques (e.g. setClip(), MaskFill, MaskBlit). ###@###.### 2005-05-20 00:07:37 GMT Similar improvements are seen on Linux as well (JDS, Nvidia GF FX 5600, 7590 drivers, 2x 2.6GHz P4): Operation Performance --------- ----------- 20x20 fillRect() + 3% 20x20 drawLine() +37% 20x20 drawImage() +17% 20x20 copyArea() +26% ###@###.### 2005-05-20 05:45:28 GMT While the proposed changes certainly improve performance, the approach is a bit clunky ("round peg, square hole"). If we are getting to the point where we are using tricks to get better performance out of NIO ByteBuffers, why not just write a thin Unsafe wrapper that meets our needs? We are already going out of our way to maintain 4-byte alignment, so only a few more changes would be required to achieve 8-byte alignment when necessary (i.e. when adding long and double parameters to the buffer). This approach has a couple added benefits: - interface is mostly compatible with NIO classes - performance gains for all existing code without creating view buffers and such, as suggested earlier - no temporary object creation (before, we would create one or more view buffers for each drawGlyphList() call; while not too expensive, it would be nice to avoid this) Here are some updated performance numbers with these changes in place (on the Solaris/SPARC configuration listed above): Operation Performance --------- ----------- 1x1 fillRect() +64% 20x20 fillRect() +52% 1x1 drawLine() +57% 20x20 drawLine() +64% 100x100 drawLine() +39% 1x1 drawImage() +34% 20x20 drawImage() +32% 100x100 drawImage() + 6% 1x1 copyArea() +11% 20x20 copyArea() +11% 4 ch drawString() +20% 32 ch drawString() +17% And on my Windows XP machine (2x 2.6GHz P4, GF FX 5600): Operation Performance --------- ----------- 1x1 fillRect() +41% 20x20 fillRect() +33% 1x1 drawLine() +49% 20x20 drawLine() +47% 100x100 drawLine() +46% 1x1 drawImage() +68% 20x20 drawImage() +69% 100x100 drawImage() + 3% 20x20 copyArea() 0% (known driver slowness) 4 ch drawString() +38% 32 ch drawString() + 6% ###@###.### 2005-05-25 23:44:30 GMT
20-05-2005