Bug ID: JDK-6355402 Java2D Font implementation should improve MT scaleability of getting outline (visual) bounds.

Type: Bug
Component: client-libs
Sub-Component: 2d
Affected Version: 1.4.2

Priority: P3
Status: Resolved
Resolution: Fixed
OS: windows_xp
CPU: x86

Submitted: 2005-11-25
Updated: 2010-04-02
Resolved: 2006-01-09

Other	JDK 6
5.0u7Fixed	6 b67Fixed

FULL PRODUCT VERSION :


A DESCRIPTION OF THE PROBLEM :
We have problem with heavy report rendering in server environment (8+ threads, 4 CPUs). We don't get 100% cpu usage. Threads are blocked about 70% of the time.

We had this problem in 1.4.2 because of bug 4641861. I was expecting that upgrading to 1.5.0 would resolve the issue. But it didn't.

In 1.4.2, threads were blocked for NativeFontWrapper's synchronized static native methods (which no longer exist in 1.5). Now they are waiting in FileFontStrike.getGlyphOutlineBounds for access to the boundsMap field.

I think use of a ReadWriteLock for access to the boundsMap (instead of simply synchronizing on it) would greatly improve the scalibility. Currently threads that are only reading the map block each other which is not necessary.



STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
You'll need an SMP machine (not hyper threaded). Run at least twice the number of CPUs threads. Each thread should use something like TextLayout to calculate the boundaries of a string displayed in a font.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
100% CPU usage
ACTUAL -
~30%-50% CPU usage

REPRODUCIBILITY :
This bug can be reproduced always.

EVALUATION Worth noting here I think that the fix simply updated from using HashMap with synchronised accesses (as required by HashMap) to instead using java.util.concurrent.ConcurrentHashMap which has no such synchronisation requirements.

07-12-2005

EVALUATION The outline bounds is a relatively rare case, since most text measurement uses the logical bounds. That scales extremely well in 1.5 The cache used for outline bounds is per strike, so that does help in some cases where the threads are using different strikes. The main reason there's synchronisation still in this case is that it was not considered a case that merited the extra per strike storage that would avoid it. So instead a simple cache was added and its worth noting that the benefits of ANY cache here are enormous. One way to reach 100% CPU utilisation is to remove that cache :-v Of course the storage only needs to be allocated once its needed so maybe that wasn't the best call but at the time we lacked evidence this was a an issue for any real application. So this sounds to be specifically an issue for clients (of the server variety) that need the visual (outline) bounds for a single strike from many threads particualarly where there are CPU resources to service those threads concurrently. The submitter did not provide a test case, so of course there can be no guarantee we can accurately characterise his problem, so there's some guesswork involved. The submitter mentioned TextLayout but its probably better to look at a GlyphVector as that is a more lightweight object and so we can get closer to profiling the effects of this synchronisation point as a more dominant factor. There's also the question of whether to use the same instance or a new instance for every call since if the instance caches the result so that it doesn't need to refer back to the strike that isn't telling us anything about the synchronisation costs. I think TextLayout does this and as of mustang b03 so does GlyphVector, since the fix for 5074057: glyph visual bounds incorrect when frc is rotated changed the implementation of getVisualBounds() so that it unioned the bounds of the individual glyphs and accessed a cache of these. That's a great improvement but only useful if reusing the same instance which seems unlikely in the case of the submitter. So using a new GlyphVector for each meaurement call is what will be tested. The resulting test should emphasis the scaleability as more CPUs are added, and the actual time per CPU whilst important isn't the focus here. So what benefit can we show from more MT friendly caching ? Whilst a ReadWriteLock may perform a little better the real benefit comes from avoiding any kind of locking! Here's a test with the ReadWriteLock suggested by the submitter run from a 24 way E6500 running Solaris 8 (Note: the results can be dependent on the machine architecture as well as the actual number of CPUs/cores) 1 threads took 1195ms. 2 threads took 1753ms. 3 threads took 2243ms. 4 threads took 2651ms. 5 threads took 3295ms. 6 threads took 3925ms. 7 threads took 4171ms. 8 threads took 4613ms. 9 threads took 5662ms. 10 threads took 6030ms. 11 threads took 6653ms. 12 threads took 7013ms. 13 threads took 7789ms. 14 threads took 8191ms. 15 threads took 8928ms. 16 threads took 9593ms. 17 threads took 10338ms. 18 threads took 10430ms. 19 threads took 11530ms. 20 threads took 12102ms. 21 threads took 12541ms. 22 threads took 13728ms. 23 threads took 13870ms. 24 threads took 14206ms. 25 threads took 14948ms. 26 threads took 15397ms. It performed some small percentage better than the existing code but it was not scaleable. However here's avoiding synchronisation completely : 1 threads took 880ms. 2 threads took 906ms. 3 threads took 840ms. 4 threads took 1011ms. 5 threads took 1084ms. 6 threads took 1156ms. 7 threads took 988ms. 8 threads took 957ms. 9 threads took 3787ms. 10 threads took 895ms. 11 threads took 885ms. 12 threads took 877ms. 13 threads took 877ms. 14 threads took 879ms. 15 threads took 872ms. 16 threads took 895ms. 17 threads took 915ms. 18 threads took 905ms. 19 threads took 933ms. 20 threads took 934ms. 21 threads took 969ms. 22 threads took 966ms. 23 threads took 969ms. 24 threads took 1109ms. 25 threads took 1221ms. 26 threads took 1176ms. There's one aberrant number in there but you can see that right up to #CPUS-1 the latter scales completely, whereas the readwritelock scales minimally. I expect we could also backport this simple fix to 1.5.0_07. One other point to note : for Solaris at least the "server" VM performs clearly better than the "client" VM in this test. "ie use the -server" option which you get by default anyway in 1.5 if you have 2 CPUS and 2 GB RAM.

30-11-2005