United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6652116 D3D: SW->Accelerated surface blits are slower with the new pipeline
JDK-6652116 : D3D: SW->Accelerated surface blits are slower with the new pipeline

Details
Type:
Bug
Submit Date:
2008-01-17
Status:
Open
Updated Date:
2012-01-16
Project Name:
JDK
Resolved Date:
Component:
client-libs
OS:
windows_xp
Sub-Component:
2d
CPU:
x86
Priority:
P4
Resolution:
Unresolved
Affected Versions:
dr2,6u10
Targeted Versions:

Related Reports
Duplicate:

Sub Tasks

Description
Copying uncached heap-based images to an accelerated surface is slower with
the new pipeline than in earlier releases, especially when copying
to the screen.

This affects applications which do software-only rendering
(direct pixel manipulation) and then copy images to the screen.

With the attached application:
6u4:
#>java PerfTest 
using AWT
using BI->Screen
1367
1622
1641
1636

6u10:
#>java PerfTest
using AWT
using BI->Screen
812
908
997
936

6u4: (BI->BS->screen)
#> java -Dusebs=true PerfTest
using AWT
using BS
1786
2256
2259
2260

6u10: (BI->BS->screen)
#>java -Dusebs=true PerfTest
using AWT
using BS
1204
1334
1382
1381

In both 6u4 and 6u10 the GDI pipeline is the fastest for this
particular application:

6u4:
#> java -Dsun.java2d.noddraw=true  PerfTest
using AWT
using BI->Screen
3147
3425
3429
6u10:

#>java -Dsun.java2d.d3d=false PerfTest
using AWT
using BI->Screen
3321
3497
3538

                                    

Comments
EVALUATION

The BI->Screen case is the worst because of the onscreen
accelerated rendering support added in 6u10.
(we rendirect on-screen rendering to an off-screen d3d 
surface, and then present it sometime later).

Since we don't have a trigger to synchronously flip
the off-screen surface after BI->Screen flip it
appears slower.

However even if Toolkit.sync() is added (which will force
the flip right away) it's still slower.

Using BufferStrategy instead of rendering directly to
the screen improves the situation somewhat.

But in general we should try to improve the performance
of our BI->D3D blit.

See this thread for more information:
http://forums.java.net/jive/thread.jspa?threadID=35484&tstart=0
                                     
2008-01-17
EVALUATION

Here are the results of an investigation (same attached
PerfTest was used) - the result is in fps:


The code path to get pixels from a BI to the back-buffer
or screen is as follows:
  1. the pixels are copied to a texture
  2. texture is drawn to the back-buffer
This is because creating lockable render targets (like a back-buffe)
is hightly unadviseable since locking stalls the gpu.

If the source image is too large to fit in a texture
it is tiled (steps 1-2 repeated for each tile).

Step 1 consists of this call:
  D3DBL_CopyImageToIntXrgbSurface - takes care of copying pixels to the blit texture

  It locks the destination surface and calls optimized software 
  loop which copies the pixels from the src image to the texture,
  making format conversion on the fly. If no conversion is needed
  it is just a memcpy (specific method for this case is 
  AnyIntIsomorphicCopy).

Given this information here are the results of the investigation:
(the numbers are frames per second, for 300x400 image on nvidia fx 7800)
6u4  ddraw: 1060
6u4  noddraw: 1823
6u10 nod3d: 2262 (100%)
6u10 d3d default: 881 (38%)
1. 6u10 D3DBL_CopyImageToIntXrgbSurface no-oped: 1968 (87%)
2. 6u10 AnyIntIsomorphicCopy no-oped: 1300 (57%)
3. 6u10 AnyIntIsomorphicCopy no-oped + DYNAMIC disabled: 1050

1. means that the whole copying to the texture is no-oped, we just
   draw the texture to the destination. This gives us an
   approximation of what we could get if copying to the texture was
   free.
2. the blit loop which copies the pixels to the texture is no-oped, so
   we just lock and unlock the surface. This gives us an approximation
   of how much time we spend actually copying the pixels
3. 2. and we disable the use of DYNAMIC textures. We already use DYINAMIC textures
   for this purpose to improve performance, so this is just to illustrate
   how much we get by using DYNAMIC textures

By default the d3d is 62% slower than no-d3d case in this benchmark. 
Note that in most cases people will be comparing with the old pipeline which
had ddraw enabled, and performance drop for them will only be around
10%. And if the dimensions of the image increase (depending configuration)
the performance difference decreases.

We spend most of the time getting the pixels
into the texture. Just copying the pixels alone takes around 30%
of the time (and this is w/o conversion). We could try to improve there, 
not sure how though, unless we use sse/mmx instructions, which is 
out of scope of this bug.

I did some experiments with an ideal case (when the scan stride of the
source and destination are the same - like if we're copying a 256x256
image, which happens to be the same size as our blit texture).

In this case could just use a single lock,memcpy(),unlock to copy the pixels
to the texture (instead of memcpy per scan line). The overall mprovement 
was around 8%. But this case is relateively rare. Creating a blit texture
of the size of the source image seems prone to thrashing, and caused
a significant slowdown in some cases (like bouncing between
two images of different sizes).

So no clear solution so far.
                                     
2008-02-01
EVALUATION

I have tried a couple of other approaches.

1. Instead of loading the image into the texture and then 
drawing this texture to the destination, I tried to use
IDirect3DSurface::UpdateSurface. This is a mehtod for
uploading pixels into an unlockable surface created in DEFAULT
pool (vram). Unfortunately this approach didn't yield any
benefits and is in fact slower (at least on my PCIX board).

2. When uploading images larger than 256x256 we
tile the image by uploading it piece by piece into the texture
and rendering it. The texture is a DYNAMIC texture (and thus resides
in vram), which can be locked with DISCARD flag (which allows the hw not to
stall every time the texture is locked).

But we were only locking the texture with this flag if 
we were filling the whole texture. If the image is say 300x300, 
the 3 pieces which didn't fit into the 256x256 temp. texture 
were uploaded w/o the use of the DISCARD flag. I've fixed that,
but again, there were not much benefit.

Approach 2) is probably worth integrating anyway, and probably 
applying it to the MaskBlit image upload code as well.
                                     
2008-04-26



Hardware and Software, Engineered to Work Together