Bug ID: JDK-8253058 Case insensitive regexes for supplementary characters

JDK-8253058 : Case insensitive regexes for supplementary characters

Type: Bug
Component: core-libs
Sub-Component: java.util.regex

Priority: P4
Status: Resolved
Resolution: Not an Issue
OS: generic
CPU: generic

Submitted: 2020-09-11
Updated: 2020-09-11
Resolved: 2020-09-11

Related Reports

Relates :

JDK-8248655 - Support supplementary characters in String case insensitive operations

Description

Raised in the jdk-dev ml:
https://mail.openjdk.java.net/pipermail/jdk-dev/2020-September/004727.html

---
For scripts Deseret, Osage, Old Hungarian, Warang Citi,
Medefaidrin, and Adlam, for strings with upper- and
lowercase variants of the same letter, the following
code fails:

Pattern pattern = Pattern.compile(lower, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(upper);
assertThat(matcher.matches()).isTrue();

Comments

The spec of Pattern.CASE_INSENSITIVE reads: --- By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag --- And in fact, the following piece of code returns true. --- var lower = "\ud83a\udd2e"; var upper = "\ud83a\udd0c"; Pattern.compile(lower, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE) .matcher(upper) .matches(); ---

11-09-2020