JDK-4867170 : Pattern doesn't work with composite character in CANON_EQ mode
  • Type: Bug
  • Status: Closed
  • Resolution: Fixed
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Priority: P3
  • Affected Version: 1.4.0,6
  • OS: generic,windows
  • CPU: generic,x86
  • Submit Date: 2003-05-21
  • Updated Date: 2016-08-26
  • Resolved Date: 2016-05-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availabitlity Release.

To download the current JDK release, click here.
JDK 9
9 b119Fixed
Description
(1) Composite characters only "Character Classes" pattern will throw
    Exception, example below shows the problem.

import java.util.regex.*;

public class RegTest {

    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f82cd";
        String patternStr = "[\u1f80\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        }
    }
}

(2) replace the pattern to 
    String patternStr = "\u1f80\u1f82";
    also throw exception


(3)Pattern "[\u1f80-\u1f82]" will not have match for input string
   "ab\u1f81cd" in CANONO_EQ mode, though it does catch character
   \u1f80 and \u1f82. Need to iterate all characters in "Range"
   and list all their "EquivalentAlternation" in CANONO_EQ mode.

import java.util.regex.*;
public class RegTest {
    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f81cd";
        String patternStr = "[\u1f80-\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        } else {
            System.out.println("No Match");
        }

    }
}

(4)Though not critical, but seems like there will be some redundency 
   patterns created by produceEquivalentAlternation() when dealint with
   multiple combining characters in CANON_EQ mode

   for example

   pattern "\u1f80" will create
 (?: 0x3b1 0x313 0x345 | 0x1f00 0x345 | 0x1f80 | 0x3b1 0x345 0x313 | 0x1fb3 0x313 | 0x1f80)     

   and "\u1f82" will create
(?: 0x3b1 0x313 0x300 0x345 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x3b1 0x313 0x345 0x300 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x3b1 0x345 0x313 0x300 | 0x1fb3 0x313 0x300 | 0x1f80 0x300 | 0x1f82)

   #space has been added between hexadecimal numbers

Comments
EVALUATION It is correct that CANON_EQ mode does have certain limitations and inefficiencies. The complexity of unicode regular expression support prevents us from supporting much beyond level 1 as described in Unicode Technical Standard #18. Perhaps our equivalence support will be extended in a future release. ###@###.### 2004-02-19
2004-02-19