United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: JDK-6868503 RuleBasedBreakIterator is inefficient
JDK-6868503 : RuleBasedBreakIterator is inefficient

Details
Type:
Bug
Submit Date:
2009-08-04
Status:
Closed
Updated Date:
2010-07-29
Project Name:
JDK
Resolved Date:
2010-01-13
Component:
core-libs
OS:
generic
Sub-Component:
java.text
CPU:
generic
Priority:
P2
Resolution:
Fixed
Affected Versions:
6
Fixed Versions:
6u18 (b02)

Related Reports
Backport:

Sub Tasks

Description
LineBreakMeasurer essentially uses following logic to split text into the lines:
   a) find what is the last char that fits into the line based on the metrics
   b) find last work break before this char

For step b BreakIterator is used (RuleBasedBreakIterator in my case) and i observe that preceeding() and following() methods are called very often for attached benchmark.

The reason is RuleBasedBreakIterator looks for previous work break by scanning forward from last known 
"\n" occurence. For long text strings without explicit line breaks and which has to be broken into short lines this is very inefficient. Performance improves dramatically if we can cache and use word break found on previous iteration. See suggested change in the attachment (you may want to refine it, e.g. reset cache if text was replaced, etc.)

With this fix benchmark time drops from 500ms to 20ms on my PC. 
It also saves about 30% time of LineBreakMeasurer time (for original FX testcase).

Similar problems may be applicable to other iterator implementations (and may show up in different locales).

                                    

Comments
EVALUATION

I added a fix to RuleBaseBreakIterator based on the suggested fix.
                                     
2009-08-19



Hardware and Software, Engineered to Work Together