JDK-8253058 : Case insensitive regexes for supplementary characters
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Priority: P4
  • Status: Resolved
  • Resolution: Not an Issue
  • OS: generic
  • CPU: generic
  • Submitted: 2020-09-11
  • Updated: 2020-09-11
  • Resolved: 2020-09-11
Related Reports
Relates :  
Description
Raised in the jdk-dev ml:
https://mail.openjdk.java.net/pipermail/jdk-dev/2020-September/004727.html

---
For scripts Deseret, Osage, Old Hungarian, Warang Citi,
Medefaidrin, and Adlam, for strings with upper- and
lowercase variants of the same letter, the following
code fails:

Pattern pattern = Pattern.compile(lower, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(upper);
assertThat(matcher.matches()).isTrue();
Comments
The spec of Pattern.CASE_INSENSITIVE reads: --- By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag --- And in fact, the following piece of code returns true. --- var lower = "\ud83a\udd2e"; var upper = "\ud83a\udd0c"; Pattern.compile(lower, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE) .matcher(upper) .matches(); ---
11-09-2020