JDK-6664636 : [Col] API improvements to minimize memory allocations during unicode processing
  • Type: Enhancement
  • Component: core-libs
  • Sub-Component: java.text
  • Affected Version: 6
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: solaris_8
  • CPU: x86
  • Submitted: 2008-02-19
  • Updated: 2019-04-11
Related Reports
Relates :  
Relates :  
Description
A DESCRIPTION OF THE REQUEST :
While writing some code to get the maximum common prefix of two unicode CharSequences, it became apparent the the current API was not sufficient for an efficient implementation. Suggested changes:

Collator.compare(String, String) -> Collator.compare(CharSequence, CharSequence)

  Suggested additions:

Collator.compare(int codepoint1, int codepoint2)
Character.toString(int codepoint)

JUSTIFICATION :
While writing some code to get the maximum common prefix of two unicode CharSequences, it became apparent the the current API was not sufficient for an efficient implementation.  See the attached source code for an example. Basically, Strings are immutable and the only comparison provided by the Collator is string based, rather than the more generic CharSequence. If the data you are processing is not stored as strings, then you are forced to allocate strings to do basic processing. Also, since there is no API for comparing single codepoints, doing processing like finding the max common prefix requires up to (# of codepoints in smaller sequence * 2) memory allocations.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Require fewer memory allocations when doing unicode processing of CharSequences.
ACTUAL -
For example, currently requires (# of codepoints in smaller sequence * 2) memory allocations to find maximum common prefix of 2 unicode CharSequences.

---------- BEGIN SOURCE ----------
private static int getLengthOfMaxCommonPrefix(CharSequence str1, CharSequence str2, Collator collator) {
    if ((str1 == null) || (str2 == null)) { return 0; }
    if (Character.codePointCount(str1, 0, str1.length()) > Character.codePointCount(str2, 0, str2.length())) {
      CharSequence tmp = str1;
      str1 = str2;
      str2 = tmp;
    }
    // @todo get rid of memory allocation
    char[] charArray = new char[4];
    int i = 0;
    for (int size = Character.codePointCount(str1, 0, str1.length()); i < size; i++) {
      Character.toChars(Character.codePointAt(str1, i), charArray, 0);
      Character.toChars(Character.codePointAt(str2, i), charArray, 2);
       // @todo get rid of memory allocation
      String char1Str = new String(charArray, 0, 2);
      // @todo get rid of memory allocation
      String char2Str = new String(charArray, 2, 2);
      if (collator.compare(char1Str, char2Str) != 0) {
        return i;
      }
    }
    return i;
  }

---------- END SOURCE ----------

Comments
In principle, this could be done without even changing the API signature, since there is already a method to compare Objects. We would just need to extend the definition of that method to accommodate CharSequence as well as String. Adding compare(CharSequence, CharSequence) would be even better.
18-01-2016

How difficult would it be for compare methods to use CharSequence, this would help improve performance with the new Javadoc.
18-01-2016

WORK AROUND String(int[] codePoints, int offset, int count) can be used to convert a single code point to a String.
20-02-2008

EVALUATION Note that collation can't be performed based on single code points in many locales, i.e., a collating element can consist of multiple code points. Collator.compare(int codepoint1, int codepoint2) doesn't work.
20-02-2008