Bug ID: JDK-8253059 Case insensitive collators for supplementary characters

Type: Bug
Component: core-libs
Sub-Component: java.text

Priority: P4
Status: Resolved
Resolution: Won't Fix
OS: generic
CPU: generic

Submitted: 2020-09-11
Updated: 2020-09-14
Resolved: 2020-09-14

Raised in the jdk-dev ml:
https://mail.openjdk.java.net/pipermail/jdk-dev/2020-September/004727.html

---
For scripts Deseret, Osage, Old Hungarian, Warang Citi,
Medefaidrin, and Adlam, for strings with upper- and
lowercase variants of the same letter, the following
code fails:

Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
assertThat(collator.compare(lower, upper)).isEqualTo(0);

According to Collator class' spec: --- The exact assignment of strengths to language features is locale dependent --- So it does not necessarily mean specifying PRIMARY would distinguish cases. Looking at the implementation in sun.util.locale.provider.CollationRules class, only the latin alphabets (with combining marks) are supposed to have TERTIARY differences. For example, Russian "A" (U+0410) and Russian "a" (U+0430) would not be considered TERTIARY different. (It IS considered TERTIARY equal with "ru" locale Collator instance, though). Thus supplementary characters' case insensitivity is not a TERTIARY difference in the default collation rules, i.e., working as expected. I am not sure this default behavior is intended, but I would not replace it with different ones, because:- - It would cause incompatibility. - If one would need it, he can implement java.text.spi.CollatorProvider interface. So unless there is a dire need to change the default behavior, I would not fix this as suggested.

14-09-2020