ADDITIONAL SYSTEM INFORMATION :
Windows 10, x64, version 1803 (OS Build 17134.228)
Oracle JRE and JDK 10.0.2
$ java -version
java version "10.0.2" 2018-07-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)
A DESCRIPTION OF THE PROBLEM :
A flag emoji is composed of two regional indicator (RI) symbols (U+1F1E6 through U+1F1FF); multiple RI pairs can be placed adjacently for multiple flags, e.g. U+1F1FA U+1F1F8 U+1F1EB U+1F1E7 for the US and the French flags.
UAX TR29 "Unicode Text Segmentation" determines how text is split into grapheme clusters; rules GB12 and GB13 handle RI pairs (https://unicode.org/reports/tr29/#GB12).
The report states: "Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. [...] Otherwise, break everywhere." i.e. break between flags at even boundaries. (But don't break before the last RI character if it's in an odd-numbered group.)
However java.util.regex.Pattern.compile doesn't break at all in flag sequences; any number of consecutive flag characters are treated as one grapheme cluster.
Also, I had to spend quite some time fixing my answer because this form destroyed all the flag characters (and non-ASCII punctuation) when I failed the captcha.
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
$ javac BoundaryRegex.java
$ java -ea BoundaryRegex
Both assertions will fail; note that you may need to add an -encoding argument to the javac command for correct compilation.
Note: Unicode codepoints have been escaped because this form does not appear to be Unicode-friendly, but the input string is the RI characters corresponding to
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Expected string "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" to split into 4 flag graphemes of 2 RI characters each, i.e. {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"}
ACTUAL -
Got one grapheme of all flags together, { "\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7" }, i.e. the input string unchanged.
---------- BEGIN SOURCE ----------
// BoundaryRegex.java
import java.util.Arrays;
import java.util.regex.Pattern;
public class BoundaryRegex {
public static void main(String[] args) {
var graphemes = Pattern.compile("\\b{g}").split("\ud83c\udde6\ud83c\uddec\ud83c\uddec\ud83c\udde6\ud83c\uddfa\ud83c\uddf8\ud83c\uddeb\ud83c\uddf7");
assert graphemes.length == 4
: "Input has 4 flags but only " + graphemes.length + " was found";
assert Arrays.equals(graphemes, new String[] {"\ud83c\udde6\ud83c\uddec", "\ud83c\uddec\ud83c\udde6", "\ud83c\uddfa\ud83c\uddf8", "\ud83c\uddeb\ud83c\uddf7"})
: "Flags split unexpectedly; " + Arrays.toString(graphemes);
}
}
---------- END SOURCE ----------
FREQUENCY : always