Bug ID: JDK-8292387 Grapheme support in BreakIterator

Type: CSR
Component: core-libs
Sub-Component: java.text

Priority: P4
Status: Closed
Resolution: Approved
Fix Versions: 20

Submitted: 2022-08-15
Updated: 2022-09-19
Resolved: 2022-08-31

Summary
-------

Enhance the existing `java.text.BreakIterator#getCharacterInstance()` to support Graphemes

Problem
-------

`BreakIterator` was designed before Unicode consortium introduced the concept of [`Grapheme Clusters`][1]. The class has been providing `getCharacterInstance()` method for breaking "characters" (in user's perspective), but it cannot handle the breaks defined in the Grapheme specification.

Solution
--------

Enhance `getCharacterInstance()` to support Grapheme Clusters. This will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.
```
"🇺🇸👨‍👩‍👧‍👦"
```
This String will be broken into two graphemes with the new implementation:
```
"🇺🇸", "👨‍👩‍👧‍👦"
```
whereas the old implementation simply breaks at the code point boundaries:
```
"🇺", "🇸", "👨", "(zwj)", "👩", "(zwj)", "👧", "(zwj)"‍, "👦" 
```
where `(zwj)` denotes ZERO WIDTH JOINER (U+200D).

Specification
-------------

Insert the following @implSpec after the character boundary analysis paragraph in the class description of `BreakIterator` class:

    + * @implSpec The default implementation of the character boundary analysis
    + * conforms to the Unicode Consortium's Extended Grapheme Cluster breaks.
    + * For more detail, refer to
    + * <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">
    + * Grapheme Cluster Boundaries</a> section in the Unicode Standard Annex #29.

  [1]: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

Supporting supplementary characters in BreakIterator (https://bugs.openjdk.org/browse/JDK-4900739) is the predecessor to this CSR. Character analysis behavior of BreakIterator had changed to support supplementary characters by enhancing `BreakIterator.getCharacterInstance()` method. Since there wasn't any pushback for that enhancement, I would expect the same with this CSR case.

31-08-2022

The BreakIterator spec is, in a way, a snapshot of ICU4J at the time of JDK1.1. Their spec has evolved along with Unicode spec and their character break conforms to grapheme cluster breaks: https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/BreakIterator.html I think it would be natural to enhance the existing method, rather than introduce a new method. In JDK's break iterator spec, it is intentionally vaguely defined what users perceive as a character, foreseeing the future enhancements, and I think this enhancement falls into this category.

30-08-2022

Moving to Provisional for JDK 20, not Approved. It is difficult for me to gauge how impactful this behavioral change would be. An approach with less of a behavioral impact to exist code would be to introduce a new "default" method on BreakIterator, e.g. getGrapheme instance, that had the new behavior.

30-08-2022

CSR :	JDK-8291660 - Grapheme support in BreakIterator
Relates :	JDK-8294008 - Grapheme implementation of setText() throws IndexOutOfBoundsException