JDK-8292387 : Grapheme support in BreakIterator
  • Type: CSR
  • Component: core-libs
  • Sub-Component: java.text
  • Priority: P4
  • Status: Closed
  • Resolution: Approved
  • Fix Versions: 20
  • Submitted: 2022-08-15
  • Updated: 2022-09-19
  • Resolved: 2022-08-31
Related Reports
CSR :  
Relates :  
Description
Summary
-------

Enhance the existing `java.text.BreakIterator#getCharacterInstance()` to support Graphemes

Problem
-------

`BreakIterator` was designed before Unicode consortium introduced the concept of [`Grapheme Clusters`][1]. The class has been providing `getCharacterInstance()` method for breaking "characters" (in user's perspective), but it cannot handle the breaks defined in the Grapheme specification.

Solution
--------

Enhance `getCharacterInstance()` to support Grapheme Clusters. This will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.
```
"πŸ‡ΊπŸ‡ΈπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"
```
This String will be broken into two graphemes with the new implementation:
```
"πŸ‡ΊπŸ‡Έ", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"
```
whereas the old implementation simply breaks at the code point boundaries:
```
"πŸ‡Ί", "πŸ‡Έ", "πŸ‘¨", "(zwj)", "πŸ‘©", "(zwj)", "πŸ‘§", "(zwj)"‍, "πŸ‘¦" 
```
where `(zwj)` denotes ZERO WIDTH JOINER (U+200D).

Specification
-------------

Insert the following @implSpec after the character boundary analysis paragraph in the class description of `BreakIterator` class:

    + * @implSpec The default implementation of the character boundary analysis
    + * conforms to the Unicode Consortium's Extended Grapheme Cluster breaks.
    + * For more detail, refer to
    + * <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">
    + * Grapheme Cluster Boundaries</a> section in the Unicode Standard Annex #29.

  [1]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
Comments
Supporting supplementary characters in BreakIterator (https://bugs.openjdk.org/browse/JDK-4900739) is the predecessor to this CSR. Character analysis behavior of BreakIterator had changed to support supplementary characters by enhancing `BreakIterator.getCharacterInstance()` method. Since there wasn't any pushback for that enhancement, I would expect the same with this CSR case.
31-08-2022

The BreakIterator spec is, in a way, a snapshot of ICU4J at the time of JDK1.1. Their spec has evolved along with Unicode spec and their character break conforms to grapheme cluster breaks: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html I think it would be natural to enhance the existing method, rather than introduce a new method. In JDK's break iterator spec, it is intentionally vaguely defined what users perceive as a character, foreseeing the future enhancements, and I think this enhancement falls into this category.
30-08-2022

Moving to Provisional for JDK 20, not Approved. It is difficult for me to gauge how impactful this behavioral change would be. An approach with less of a behavioral impact to exist code would be to introduce a new "default" method on BreakIterator, e.g. getGrapheme instance, that had the new behavior.
30-08-2022