JDK-8174266 : Text segmentation revisit
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.text
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: generic
  • CPU: generic
  • Submitted: 2017-02-09
  • Updated: 2024-05-07
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
JDK's text segmentation API, BreakIterator has its own text segmentation rule (sun.text.resources.BreakIteratorRules.java), and has not been updated. OTOH, Unicode's UAX #14/#29 have constantly been updated. JDK's implementation should catch them up.

Comments
A recent discussion with the client team revealed they certainly support this feature.
07-05-2024

For character boundary, possibly missing spec from UAX#29 (grapheme cluster) are: - ZWJ support (GB9) - Extended support (GB9a/GB9b) - Emoji modifier/zwj/flag sequences support (GB11/GB12/GB13) Possibly it would require to replace the current RuleBasedBreakIterator rules with something based on Unicode's text segmentation data, i.e., http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakProperty.txt
21-08-2018