JDK-6652116 : D3D: SW->Accelerated surface blits are slower with the new pipeline
  • Type: Bug
  • Component: client-libs
  • Sub-Component: 2d
  • Affected Version: dr2,6u10
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2008-01-17
  • Updated: 2012-01-16
Related Reports
Duplicate :  
Copying uncached heap-based images to an accelerated surface is slower with
the new pipeline than in earlier releases, especially when copying
to the screen.

This affects applications which do software-only rendering
(direct pixel manipulation) and then copy images to the screen.

With the attached application:
#>java PerfTest 
using AWT
using BI->Screen

#>java PerfTest
using AWT
using BI->Screen

6u4: (BI->BS->screen)
#> java -Dusebs=true PerfTest
using AWT
using BS

6u10: (BI->BS->screen)
#>java -Dusebs=true PerfTest
using AWT
using BS

In both 6u4 and 6u10 the GDI pipeline is the fastest for this
particular application:

#> java -Dsun.java2d.noddraw=true  PerfTest
using AWT
using BI->Screen

#>java -Dsun.java2d.d3d=false PerfTest
using AWT
using BI->Screen

EVALUATION I have tried a couple of other approaches. 1. Instead of loading the image into the texture and then drawing this texture to the destination, I tried to use IDirect3DSurface::UpdateSurface. This is a mehtod for uploading pixels into an unlockable surface created in DEFAULT pool (vram). Unfortunately this approach didn't yield any benefits and is in fact slower (at least on my PCIX board). 2. When uploading images larger than 256x256 we tile the image by uploading it piece by piece into the texture and rendering it. The texture is a DYNAMIC texture (and thus resides in vram), which can be locked with DISCARD flag (which allows the hw not to stall every time the texture is locked). But we were only locking the texture with this flag if we were filling the whole texture. If the image is say 300x300, the 3 pieces which didn't fit into the 256x256 temp. texture were uploaded w/o the use of the DISCARD flag. I've fixed that, but again, there were not much benefit. Approach 2) is probably worth integrating anyway, and probably applying it to the MaskBlit image upload code as well.

EVALUATION Here are the results of an investigation (same attached PerfTest was used) - the result is in fps: The code path to get pixels from a BI to the back-buffer or screen is as follows: 1. the pixels are copied to a texture 2. texture is drawn to the back-buffer This is because creating lockable render targets (like a back-buffe) is hightly unadviseable since locking stalls the gpu. If the source image is too large to fit in a texture it is tiled (steps 1-2 repeated for each tile). Step 1 consists of this call: D3DBL_CopyImageToIntXrgbSurface - takes care of copying pixels to the blit texture It locks the destination surface and calls optimized software loop which copies the pixels from the src image to the texture, making format conversion on the fly. If no conversion is needed it is just a memcpy (specific method for this case is AnyIntIsomorphicCopy). Given this information here are the results of the investigation: (the numbers are frames per second, for 300x400 image on nvidia fx 7800) 6u4 ddraw: 1060 6u4 noddraw: 1823 6u10 nod3d: 2262 (100%) 6u10 d3d default: 881 (38%) 1. 6u10 D3DBL_CopyImageToIntXrgbSurface no-oped: 1968 (87%) 2. 6u10 AnyIntIsomorphicCopy no-oped: 1300 (57%) 3. 6u10 AnyIntIsomorphicCopy no-oped + DYNAMIC disabled: 1050 1. means that the whole copying to the texture is no-oped, we just draw the texture to the destination. This gives us an approximation of what we could get if copying to the texture was free. 2. the blit loop which copies the pixels to the texture is no-oped, so we just lock and unlock the surface. This gives us an approximation of how much time we spend actually copying the pixels 3. 2. and we disable the use of DYNAMIC textures. We already use DYINAMIC textures for this purpose to improve performance, so this is just to illustrate how much we get by using DYNAMIC textures By default the d3d is 62% slower than no-d3d case in this benchmark. Note that in most cases people will be comparing with the old pipeline which had ddraw enabled, and performance drop for them will only be around 10%. And if the dimensions of the image increase (depending configuration) the performance difference decreases. We spend most of the time getting the pixels into the texture. Just copying the pixels alone takes around 30% of the time (and this is w/o conversion). We could try to improve there, not sure how though, unless we use sse/mmx instructions, which is out of scope of this bug. I did some experiments with an ideal case (when the scan stride of the source and destination are the same - like if we're copying a 256x256 image, which happens to be the same size as our blit texture). In this case could just use a single lock,memcpy(),unlock to copy the pixels to the texture (instead of memcpy per scan line). The overall mprovement was around 8%. But this case is relateively rare. Creating a blit texture of the size of the source image seems prone to thrashing, and caused a significant slowdown in some cases (like bouncing between two images of different sizes). So no clear solution so far.

EVALUATION The BI->Screen case is the worst because of the onscreen accelerated rendering support added in 6u10. (we rendirect on-screen rendering to an off-screen d3d surface, and then present it sometime later). Since we don't have a trigger to synchronously flip the off-screen surface after BI->Screen flip it appears slower. However even if Toolkit.sync() is added (which will force the flip right away) it's still slower. Using BufferStrategy instead of rendering directly to the screen improves the situation somewhat. But in general we should try to improve the performance of our BI->D3D blit. See this thread for more information: http://forums.java.net/jive/thread.jspa?threadID=35484&tstart=0