JDK-8291660 : Grapheme support in BreakIterator
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.text
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2022-08-01
  • Updated: 2024-05-07
  • Resolved: 2022-09-09
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 20
20 b15Fixed
Related Reports
CSR :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8292992 :  
Description
In order to stream extended grapheme clusters in a String, users have to create a String array first, then stream it, such as:
```
Arrays.stream("πŸ‡ΊπŸ‡ΈπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".split("\\b{g}"))
```
Instead of this rather vague instruction, enhancing the existing `getCharacterInstance()` method to support grapheme breaks would be appropriate.
Comments
Changeset: b8598b02 Author: Naoto Sato <naoto@openjdk.org> Date: 2022-09-09 17:13:51 +0000 URL: https://git.openjdk.org/jdk/commit/b8598b02979dff8a947a523a6d76768a1bfe594b
09-09-2022

A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/9991 Date: 2022-08-23 22:44:13 +0000
23-08-2022

Instead of introducing a new method, enhancing the existing `getCharacterInstance()` serves the original intention of the method.
19-08-2022

I am now inclined to *not* introduce a method to stream graphemes, but to add `BreakIterator.getGraphemeInstance()` to do traversing on graphemes. The reasoning behind it is :- - "grapheme" is not as common as other units, such as code points / chars, thus putting it in Character or String class may not be necessary. - BreakIterator is conceptually the right place for iterating graphemes over a text, parallel to character/word/sentence breaks. - Cannot seem to come up with an actual use case for streaming graphemes as opposed to traversing on text. There's alternatives, using RegEx with `b{g}` and/or `\X` construct. Obscure but usable to stream graphemes.
05-08-2022

PoC (for String class): ``` /** * {@return {@code Stream} of graphemes} */ public Stream<String> graphemes() { return Pattern.compile("\\b{g}").splitAsStream(this); } /** * {@return the grapheme at the specified index} * * @param index The index to query */ public String graphemeAt(int index) { Objects.checkIndex(index, length()); return Pattern.compile("\\X").matcher(this).results() .filter(r -> r.start() <= index && r.end() > index) .findFirst() .map(MatchResult::group) .orElseThrow(); } ```
02-08-2022

graphemeAt/Before seem to be not feasible as they require the previous break point, ie, stateful compared to supplementary characters which are stateless. Yet another possibility is to add `Stream<CharSequence> graphemes()` in `CharSequenece` interface. -> RegEx depends on String which is an impl of CharSequence, thus not feasible.
02-08-2022

As to the first method, it is not simply between two code points, e.g., two flag emojis "πŸ‡¦πŸ‡ΊπŸ‡ΈπŸ‡¦" (AU/SA) should break between 'U' and 'S', while "πŸ‡ΊπŸ‡Έ" shouldn't (duh!) graphemeAt/Before/After may be useful for fundamental building methods. Another possibility is to add grapheme equivalent in `java.text.BreakIterator`.
01-08-2022

I'm wondering if there is a more fundamental operation that could be exposed, for example, determining if there is a grapheme boundary between two adjacent code points, e.g. boolean isGraphemeBoundary(int cp1, int cp2) I realize that it might be the case that more context might be necessary to determine whether there is such a boundary. If so, the API will certainly look different. An alternative might look something like this: int graphemeAt(char[] a, int index, int limit) (Similar to Character::codePointAt.) This might return the number of char values that form the next grapheme, starting at index, but not going beyond limit. Or something like that.
01-08-2022