JDK-8248516 : some newly added locale cannot parse uppercased date string.
  • Type: CSR
  • Component: core-libs
  • Sub-Component: java.lang
  • Priority: P3
  • Status: Closed
  • Resolution: Withdrawn
  • Fix Versions: tbd
  • Submitted: 2020-06-29
  • Updated: 2020-07-01
  • Resolved: 2020-07-01
Related Reports
CSR :  
Description
Summary
-------

Date/Time names with supplementary characters cannot be parsed in a case-insensitive manner.

Problem
-------

JDK15 added a new locale "`ff-Adlm-LR`", which has locale data, such as month/day names in Adlam script, which is encoded in a supplementary character plane. `java.text.DateFormat` parses those names in a case-insensitive manner, but it throws an exception because underlying `String.regionMatches(ignoreCase == true)` fails for supplementary characters, such that:

    "\ud83a\udd2e".regionMatches(true, 0, "\ud83a\udd0c", 0, 2)

Returns `false`. where:

    "\ud83a\udd2e" == 'ADLAM SMALL LETTER O' (U+1E92E)
    "\ud83a\udd0c" == 'ADLAM CAPITAL LETTER O' (U+1E90C)

despite that:

    "\ud83a\udd2e".toUpperCase(Locale.ROOT).equals("\ud83a\udd0c")
    Character.toUpperCase(0x1e92e) == 0x1e90c

each statement returns `true`.

Solution
--------

Change those specs for `String.regionMatches(boolean,...)`, `String.equalsIgnoreCase()`, and `String.compareToIgnoreCase()` to perform "code point" comparison in case for supplementary characters. Characters in Basic Multilingual Plane (`<= \uFFFF`) are continued to be compared with code units got from `charAt()` method.

Although this change will alter the semantics in traversing the string to compare, the rationale to change it is that these String methods should consistently behave across characters (code points) whether they are in Basic Multilingual Plane or not.  There should be no reason to exclude supplementary characters from comparing strings in a case-insensitive manner.

Specification
-------------

Append the following sentence just after the last list item of conditions in the method description of `String.regionMatches(boolean, ...)` method.

    * In case that both <i>toffset+k</i> and <i>ooffset+k</i> point to
    * supplementary characters, that is <i>k</i> point to high surrogates
    * and <i>k+1</i> point to low surrogates, {@code codePointAt()} is
    * used to retrieve the code points in place for {@code charAt()} method,
    * and <i>k+1</i> is excluded from the above condition. If they point
    * to an unpaired high or low surrogates, they are compared using
    * {@code charAt()} method.

Change the following list item of conditions in the method description of `String.equalsIgnoreCase()` method from:

    *   <li> Calling {@code Character.toLowerCase(Character.toUpperCase(char))}
    *        on each character produces the same result

to:

    *   <li> Calling {@code Character.toLowerCase(Character.toUpperCase(int))}
    *        on each code point produces the same result

Change the following description in the method description of `String.compareToIgnoreCase()` method from:

    * {@code Character.toLowerCase(Character.toUpperCase(character))} on
    * each character.

to:

    * {@code Character.toLowerCase(Character.toUpperCase(int))} on
    * each code point of the character.
Comments
Cloned to JDK-8248664, with clearer focus on the String case insensitive operations.
01-07-2020