JDK-8248664 : Support supplementary characters in String case insensitive operations
  • Type: CSR
  • Component: core-libs
  • Sub-Component: java.lang
  • Priority: P4
  • Status: Closed
  • Resolution: Approved
  • Fix Versions: 16
  • Submitted: 2020-07-01
  • Updated: 2020-07-18
  • Resolved: 2020-07-18
Related Reports
CSR :  
Description
Summary
-------

Support supplementary characters' case-mappings in `java.lang.String` methods that perform case-insensitive comparing/matching. 

Problem
-------

`String.regionMatches(ignoreCase=true, ...)`, `String.equalsIgnoreCase()`, and `String.compareToIgnoreCase()` are supposed to match/compare strings in a case-insensitive manner. However, their specs and implementations are `char` based, which cannot handle supplementary characters correctly. For example,

    "\ud83a\udd2e".regionMatches(true, 0, "\ud83a\udd0c", 0, 2)

Returns `false` (conforming to the existing spec), although `"\ud83a\udd2e"` is the `'ADLAM SMALL LETTER O'` character which has the code point `U+1E92E`, and `"\ud83a\udd0c"` is the `'ADLAM CAPITAL LETTER O'` character which has the code point `U+1E90C`. Thus it should return `true` if it is true to the meaning of "ignore case." This behavior contradicts to the fact that:

    "\ud83a\udd2e".toUpperCase(Locale.ROOT).equals("\ud83a\udd0c")
    Character.toUpperCase(0x1e92e) == 0x1e90c

each statement returns `true`.

Solution
--------

Change those specs for `String.regionMatches(boolean, ...)`, `String.equalsIgnoreCase()`, and `String.compareToIgnoreCase()` to perform "code point" comparison in case for supplementary characters. Characters in Basic Multilingual Plane (`<= \uFFFF`) are continued to be compared with code units got from `charAt()` method.

Although this change will alter the semantics in traversing the string to compare, the rationale to change it is that these String methods should consistently behave across characters (code points) whether they are in Basic Multilingual Plane or not.  There should be no reason to exclude supplementary characters from comparing strings in a case-insensitive manner.

Specification
-------------

Change the method description of `String.regionMatches(boolean, ...)` method as:

       * A substring of this {@code String} object is compared to a substring
       * of the argument {@code other}. The result is {@code true} if these
    -  * substrings represent character sequences that are the same, ignoring
    -  * case if and only if {@code ignoreCase} is true. The substring of 
    -  * this {@code String} object to be compared begins at index
    -  * {@code toffset} and has length {@code len}. The substring of 
    -  * {@code other} to be compared begins at index {@code ooffset} and
    -  * has length {@code len}. The result is {@code false} if and only if 
    -  * at least one of the following is true:
    -  * <ul><li>{@code toffset} is negative.
    -  * <li>{@code ooffset} is negative.
    -  * <li>{@code toffset+len} is greater than the length of this
    +  * substrings represent Unicode code point sequences that are the same,
    +  * ignoring case if and only if {@code ignoreCase} is true.
    +  * The sequences {@code tsequence} and {@code osequence} are compared,
    +  * where {@code tsequence} is the sequence produced as if by calling
    +  * {@code this.substring(toffset, len).codePoints()} and {@code osequence}
    +  * is the sequence produced as if by calling
    +  * {@code other.substring(ooffset, len).codePoints()}.
    +  * The result is {@code true} if and only if all of the following
    +  * are true:
    +  * <ul><li>{@code toffset} is non-negative.
    +  * <li>{@code ooffset} is non-negative.
    +  * <li>{@code toffset+len} is less than or equal to the length of this
       * {@code String} object.
    -  * <li>{@code ooffset+len} is greater than the length of the other
    +  * <li>{@code ooffset+len} is less than or equal to the length of the other
       * argument.
    -  * <li>{@code ignoreCase} is {@code false} and there is some nonnegative
    -  * integer <i>k</i> less than {@code len} such that:
    -  * <blockquote><pre>
    -  * this.charAt(toffset+k) != other.charAt(ooffset+k)
    -  * </pre></blockquote>
    -  * <li>{@code ignoreCase} is {@code true} and there is some nonnegative
    -  * integer <i>k</i> less than {@code len} such that:
    -  * <blockquote><pre>
    -  * Character.toLowerCase(Character.toUpperCase(this.charAt(toffset+k))) != 
    -  * Character.toLowerCase(Character.toUpperCase(other.charAt(ooffset+k)))
    -  * </pre></blockquote>
    +  * <li>if {@code ignoreCase} is {@code false}, all pairs of corresponding Unicode
    +  * code points are equal integer values; or if {@code ignoreCase} is {@code true},
    +  * {@link Character#toLowerCase(int) Character.toLowerCase(}
    +  * {@link Character#toUpperCase(int)}{@code )} on all pairs of Unicode code points
    +  * results in equal integer values.
       * </ul>

<snip>

    -  * @param   len          the number of characters to compare.
    +  * @param   len          the number of characters (Unicode code units -
    +  *                       16bit {@code char} value) to compare.
       * @return  {@code true} if the specified subregion of this string
       *          matches the specified subregion of the string argument;
       *          {@code false} otherwise. Whether the matching is exact
       *          or case insensitive depends on the {@code ignoreCase}
       *          argument.
    +  * @see     #codePoints()
       */

Change the method description of `String.equalsIgnoreCase()` method as:

      /**
       * Compares this {@code String} to another {@code String}, ignoring case
       * considerations.  Two strings are considered equal ignoring case if they
    -  * are of the same length and corresponding characters in the two strings
    -  * are equal ignoring case.
    +  * are of the same length and corresponding Unicode code points in the two
    +  * strings are equal ignoring case.
       *
    -  * <p> Two characters {@code c1} and {@code c2} are considered the same
    +  * <p> Two Unicode code points are considered the same
       * ignoring case if at least one of the following is true:
       * <ul>
    -  *   <li> The two characters are the same (as compared by the
    +  *   <li> The two Unicode code points are the same (as compared by the
       *        {@code ==} operator)
    -  *   <li> Calling {@code Character.toLowerCase(Character.toUpperCase(char))}
    -  *        on each character produces the same result
    +  *   <li> Calling {@code Character.toLowerCase(Character.toUpperCase(int))}
    +  *        on each Unicode code point produces the same result
       * </ul>
       *

<snip>

       * @see  #equals(Object)
    +  * @see  #codePoints()
       */

Change the method description of `String.compareToIgnoreCase()` method as:

      /**
       * Compares two strings lexicographically, ignoring case
       * differences. This method returns an integer whose sign is that of
    -  * calling {@code compareTo} with normalized versions of the strings
    +  * calling {@code compareTo} with case folded versions of the strings
       * where case differences have been eliminated by calling
    -  * {@code Character.toLowerCase(Character.toUpperCase(character))} on
    -  * each character.
    +  * {@code Character.toLowerCase(Character.toUpperCase(int))} on
    +  * each Unicode code point.
       * <p>

<snip>

       * @see     java.text.Collator
    +  * @see     #codePoints()
       * @since   1.2
       */

Comments
I've filed JDK-8249718 as a follow-up to refine the code point discussions; moving this CSR to Approved.
18-07-2020

As Naoto mentioned, the codePoints() method spec covers all the cases for BMP code points, supplementary code points represented as surrogate pairs, unpaired surrogates, and undefined code units. The spec here is rather compressed in that all of these behaviors are implications of the phrase "as if by calling xxx.codePoints()." I think the spec is well-defined, though I admit that all the behaviors aren't obvious. I'm not sure this can be remedied by adding an API note. I think there is room for a followup pass to the String spec, as there are several other String methods that deal with a String as a "sequence of code points" which might involve combining surrogate pairs, and handling all those same special cases. Such methods include codePointAt, codePointBefore, codePointCount, codePoints, offsetByCodePoints, toLowerCase, toUpperCase, as well as the methods modified by this CSR. It seems reasonable to me for that to be undertaken separately from this CSR, which is trying to bring the three methods in question into the "Unicode code point" umbrella and remove the ambiguity about "character" (while also fixing bugs).
16-07-2020

How about something like an apiNote -- "Make sure you start on good code point boundaries for meaningful results." etc.
16-07-2020

How the code point sequence is generated is spelled out in codePoints() method, including the case for unpaired surrogates (zero-extended to int values), thus I added `@see codePoints()` tag. Do you think an additional similar explanation helps here?
16-07-2020

The new spec states: + * The sequences {@code tsequence} and {@code osequence} are compared, + * where {@code tsequence} is the sequence produced as if by calling + * {@code this.substring(toffset, len).codePoints()} and {@code osequence} + * is the sequence produced as if by calling + * {@code other.substring(ooffset, len).codePoints()} .Operationally, what happens if the starting offset to either string is to the middle of a surrogate pair? Does this need to be spelled out in the spec?
16-07-2020

I don't think there would be any new exceptional cases, because if the last "char" is a high surrogate, it is simply treated as a code point which is the way it used to be. This is even true if a lone high (or low) surrogate exists in the middle of the string, it is treated as a single code point (and no case mapping occurs).
15-07-2020

Moving to Provisional; not Approved. Are some new kind of exceptional cases possible if the len offset ends on the first half of a surrogate pair?
15-07-2020

[~naoto], right no new CSR or CSR revision needed; please roll it into the rest of your patch. (Not quite done reviewing this CSR, but I wanted to leave the comment in the interim.)
15-07-2020

Thank you, [~darcy]. I will apply this patch ``` * A Comparator that orders {@code String} objects as by - * {@code compareToIgnoreCase}. This comparator is serializable. + * {@link #compareToIgnoreCase(String) compareToIgnoreCase}. + * This comparator is serializable. * <p> ``` I expect this won't warrant a CSR revise, right?
15-07-2020

As a code review comment, I suggest updating the javadoc for CASE_INSENSITIVE_ORDER to use a link to refer to compareToIgnoreCase.
14-07-2020