JDK-6868503 : RuleBasedBreakIterator is inefficient
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.text
  • Affected Version: 6
  • Priority: P2
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2009-08-04
  • Updated: 2013-11-01
  • Resolved: 2010-01-13
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 6 JDK 7
6u18 b02Fixed 7Fixed
Description
LineBreakMeasurer essentially uses following logic to split text into the lines:
   a) find what is the last char that fits into the line based on the metrics
   b) find last work break before this char

For step b BreakIterator is used (RuleBasedBreakIterator in my case) and i observe that preceeding() and following() methods are called very often for attached benchmark.

The reason is RuleBasedBreakIterator looks for previous work break by scanning forward from last known 
"\n" occurence. For long text strings without explicit line breaks and which has to be broken into short lines this is very inefficient. Performance improves dramatically if we can cache and use word break found on previous iteration. See suggested change in the attachment (you may want to refine it, e.g. reset cache if text was replaced, etc.)

With this fix benchmark time drops from 500ms to 20ms on my PC. 
It also saves about 30% time of LineBreakMeasurer time (for original FX testcase).

Similar problems may be applicable to other iterator implementations (and may show up in different locales).

Comments
EVALUATION I added a fix to RuleBaseBreakIterator based on the suggested fix.
19-08-2009