Bug ID: JDK-8209777 \b{g} in regexes fails to break between flag emoji

Type: Enhancement
Component: core-libs
Sub-Component: java.util.regex
Affected Version: 10,11

Priority: P3
Status: Closed
Resolution: Duplicate
OS: windows_10
CPU: x86_64

Submitted: 2018-08-19
Updated: 2019-04-25
Resolved: 2019-04-25

Other
tbdResolved

ADDITIONAL SYSTEM INFORMATION :
Windows 10, x64, version 1803 (OS Build 17134.228)

Oracle JRE and JDK 10.0.2
$ java -version
java version "10.0.2" 2018-07-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)

A DESCRIPTION OF THE PROBLEM :
A flag emoji is composed of two regional indicator (RI) symbols (U+1F1E6 through U+1F1FF); multiple RI pairs can be placed adjacently for multiple flags, e.g. U+1F1FA U+1F1F8 U+1F1EB U+1F1E7 for the US and the French flags.

UAX TR29 "Unicode Text Segmentation" determines how text is split into grapheme clusters; rules GB12 and GB13 handle RI pairs (https://unicode.org/reports/tr29/#GB12).

The report states: "Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. [...] Otherwise, break everywhere." i.e. break between flags at even boundaries. (But don't break before the last RI character if it's in an odd-numbered group.)

However java.util.regex.Pattern.compile doesn't break at all in flag sequences; any number of consecutive flag characters are treated as one grapheme cluster.

Also, I had to spend quite some time fixing my answer because this form destroyed all the flag characters (and non-ASCII punctuation) when I failed the captcha.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
$ javac BoundaryRegex.java
$ java -ea BoundaryRegex

Both assertions will fail; note that you may need to add an -encoding argument to the javac command for correct compilation.

Note: Unicode codepoints have been escaped because this form does not appear to be Unicode-friendly, but the input string is the RI characters corresponding to 

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Expected string "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" to split into 4 flag graphemes of 2 RI characters each, i.e. {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"}
ACTUAL -
Got one grapheme of all flags together, { "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" }, i.e. the input string unchanged.

---------- BEGIN SOURCE ----------
// BoundaryRegex.java
import java.util.Arrays;
import java.util.regex.Pattern;

public class BoundaryRegex {
    public static void main(String[] args) {
        var graphemes = Pattern.compile("\\b{g}").split("\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7");
        assert graphemes.length == 4
                : "Input has 4 flags but only " + graphemes.length + " was found";
        assert Arrays.equals(graphemes, new String[] {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"})
                : "Flags split unexpectedly; " + Arrays.toString(graphemes);
    }
}
---------- END SOURCE ----------

FREQUENCY : always

The current regex grapheme support was implemented in jdk9, in which the Unicode 8.0 was the supported Unicode version. The "emoji rules" G12/13 appear to be added in later version. Need to update the engine to support it.

21-08-2018

To reproduce the issue, run the attached test case. JDK 10.0.2 - Fail JDK 11-ea+25 - Fail Output: Exception in thread "main" java.lang.AssertionError: Input has 4 flags but only 1 was found at JI9056766.main(JI9056766.java:8)

21-08-2018