Bug ID: JDK-8292992 Release Note: Grapheme support in BreakIterator

Type: Sub-task
Component: core-libs
Sub-Component: java.text

Priority: P4
Status: Resolved
Resolution: Delivered
OS: generic
CPU: generic

Submitted: 2022-08-26
Updated: 2022-09-09
Resolved: 2022-09-09

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 20
20Resolved

Character boundary analysis in `java.text.BreakIterator` now conforms to Extended Grapheme Clusters breaks defined in <a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">Unicode Consortium's  Standard Annex #29</a>. This change will introduce intentional behavioral changes because the old implementation simply breaks at the code point boundaries for the vast majority of characters. For example, this is a String that contains the US flag and a grapheme for a 4-member-family.
```
"🇺🇸👨‍👩‍👧‍👦"
```
This String will be broken into two graphemes with the new implementation:
```
"🇺🇸", "👨‍👩‍👧‍👦"
```
whereas the old implementation simply breaks at the code point boundaries:
```
"🇺", "🇸", "👨", "(zwj)", "👩", "(zwj)", "👧", "(zwj)"‍, "👦" 
```
where (zwj) denotes ZERO WIDTH JOINER (U+200D).