JDK-4481344 : Performance of small image copies to screen slower in 1.4 than 1.1 (win32)
  • Type: Bug
  • Component: client-libs
  • Sub-Component: 2d
  • Affected Version: 1.4.0
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: windows_nt
  • CPU: x86
  • Submitted: 2001-07-18
  • Updated: 2001-08-31
Related Reports
Relates :  
Description
This bug is being filed to take over the small-image part of bug 4276423.
That original bug was filed against all image copies to the screen in JDK 1.2.
Since then, we have implemented faster image copies through hardware accelerated
images (on win32) and have seen substantial improvements in these operations
(vs. jdk 1.1, 1.2, and 1.3).

however, there are still performance improvements that would be nice for
small (20x20 and less) image sizes.  Here are some performance numbers
from some testing for that other bug report:

jdk1.1:
	20x20		37,724,773
	100x100		36,661,107
	300x300		36,808,163

jdk1.2:
	20x20		3,631,375
	100x100		31,629,213
	300x300		43,824,489
jdk1.3:
	20x20		3,728,680
	100x100		33,635,275
	300x300		42,096,774

jdk1.4:
	20x20		5,750,953
	100x100		108.602,065
	300x300		165,263,578

We should determine the bottlenecks for small image performance and try
to eliminate them.

Comments
EVALUATION I will include the Evaluation from bug 4276423, since most of that information is about the performance of small images: I recently got the following results on my PIII-dual 866 NT4 system (video card ATI Rage Pro Turbo), at 32 bits per pixel: jdk1.1: 20x20 16,909,090 pps 100x100 22,268,000 pps 300x300 24,488,304 pps jdk1.2: 20x20 9,440,362 pps 100x100 21,333,600 pps 300x300 24,570,419 pps jdk1.3: 20x20 9,106,579 pps 100x100 21,231,683 pps 300x300 24,616,363 pps jdk1.4 (my most recent build): 20x20 7,495,593 100x100 22,236,607 300x300 23,695,652 And on my PIII-500 (single CPU) win98 system with a Matrox G400 running 32 bits per pixel: jdk1.1: 20x20 37,724,773 100x100 36,661,107 300x300 36,808,163 jdk1.2: 20x20 3,631,375 100x100 31,629,213 300x300 43,824,489 jdk1.3: 20x20 3,728,680 100x100 33,635,275 300x300 42,096,774 jdk1.4: 20x20 5,750,953 100x100 108.602,065 300x300 165,263,578 From these results, it looks like: - There are definitely differences between OS's and video cards, especially when we are comparing hardware-accelerated images and non- accelerated images. - The overhead of the small (20x20) images appears to drag down the performance of 1.4 offscreen images to nearly the level of the 1.2/1.3 software-based images. In fact, on the older ATI video card, the hw-based images were even slower than the software-based images. - NT performance of all images seems gated at some maximum amount. This might be a restriction on NT, or it could be a constraint of the older video card. More investigation would be necessary to figure it out. But all larger image sizes on all releases seem about the same. - win98 shows the difference between jdk1.1 hw-based images (flying at about 36M pps) versus jdk1.2/1.2 sw-based images (limited to only about 3M on the smallest image). - win98 on this fast video card shows the advantage to directDraw in the latest jdk1.4 builds; performance of jdk1.1 was gated about around 36M pps, but the performance of DirectDraw-based images appears much higher, at around 165M pps for th largest image size. more work is necessary. We need to make sure that we eliminate any overhead that might be contributing to the lower scores in jdk1.4 for small image sizes. Profiling is necessary... chet.haase@Eng 2001-04-24 I did a little more debugging/profiling and got the following information: One of the key pieces of overhead in our Blt processing is due to the ddraw Clipper object. When I eliminate the Clipper (i.e., I don't attach it to the window or set the clipper on the primary), then I more than double the performance of the smalles (20x20) image copies. On my test system (PIII-866 dual processor, nVidia TNT2), this made the performance go from 11 M pixels per second to over 26 M pixels per second. Of course, this is a bottleneck that we cannot do much about: drawing without a Clipper object requires that we do our own clipping to the window (not too hard) but it also means that we would be subject to Windows events that could cause rendering artifacts. For example, if our window was obstructed, we would do our Blts over any overlapping windows, regardless of which window was supposed to be on top (ddraw draws directly to the screen without regard for Window properties). And even if our window was on top at the time we issued the Blt call, this might not prevent some event (such as the user dragging a window) from overlapping the window at the time of the the actual Blt operation (there is a delay between our issuing the call and that call actually being processed by the hardware). Actually, this situation might be handled for us through context switching mechanisms of the driver/hardware (hopefully the hardware would flush the graphics pipe before allowing the window system to move things around). But there is still a small hole of opportunity between our checking for obstruction and actually issuing the call. Anyway, this got our performance up to 26 M pixels per second. But the jdk1.1 version is still at 44 M, nearly twice the performance of our non-clipped jdk1.4 version. I think this difference can be attributed to various overhead elements in our drawImage() processing. During a profiling run (using Compuware's TrueTime product), I found that we are spending significant amounts of time (on the order of one to five percent) in the following routines: ClipInfo (used to derive the actual src/dst values after clipping against sg.getCompBounds() Blit.getFromCache() (gets the cache entry for our Blit call) DrawImage.blitSurfaceData (spends a couple of percent just dealing with setting the CompositeType) AcceleratedOffScreenImage.getSourceSurfaceData (gets the accelerated surfaceData object for accelerated images) There are various other methods and simple operations which end up taking over a percent of the runtime. Many of these functions are very simple (like the equals() comparison when retrieving the Blit from the cache), but when called over 60000 times (in this case), they add up to significant overhead. The reason for performance loss due to overhead in this case is that the primitives in question are so small (20x20) that the more we do between issuing the call from the application and actually issuing the ddraw call, the more we suffer from each intermediate step. For the larger primitives, the amount of overhead is now insignificant compared to the performance time of the actual rendering so we see the performance benefits of ddraw much more clearly. (End of Evaluation text from 4276423) More info from further analysis: I wrote a native app that tested similar image copies using ddraw and GDI images. It varied the size of the images and performed three tests with each size: ddraw without a Clipper, ddraw tih a Clipper, and GDI compatible bitmap. These tests were chosen to represent jdk 1.4 (ddraw with a Clipper), jdk 1.4 maximum possible (ddraw without a Clipper - not necessarily something we can do, but a nice theoretical maximum to know about), and jdk 1.1 (they used compatible bitmaps for the offscreen images in that release). The numbers I got were interesting. It turns out that GDI performs significantly better than ddraw w/o Clipper up to a size of about 32x32. It performs better than ddraw with a Clipper (i.e., a comparison of jdk1.1 and jdk1.4) up to a size of about 47x47. After these values, GDI drops significantly and ddraw is able to achieve much better performance results both with and without the Clipper after those points. In fact, GDI hits a limit of about 65 Million pixels per second, but ddraw is about to achieve over 200 million pixels per second with larger primitives. This data tells me that our performance bottleneck may not be due to our overhead in getting to the ddraw call but night, instead, be due to simple GDI performance advantages for smaller primitives. chet.haase@Eng 2001-07-18
18-07-2001