Bug ID: JDK-4867170 Pattern doesn't work with composite character in CANON

Type: Bug
Component: core-libs
Sub-Component: java.util.regex
Affected Version: 1.4.0,6

Priority: P3
Status: Closed
Resolution: Fixed
OS: generic,windows
CPU: generic,x86

Submitted: 2003-05-21
Updated: 2016-08-26
Resolved: 2016-05-11

JDK 9
9 b119Fixed

(1) Composite characters only "Character Classes" pattern will throw
    Exception, example below shows the problem.

import java.util.regex.*;

public class RegTest {

    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f82cd";
        String patternStr = "[\u1f80\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        }
    }
}

(2) replace the pattern to 
    String patternStr = "\u1f80\u1f82";
    also throw exception


(3)Pattern "[\u1f80-\u1f82]" will not have match for input string
   "ab\u1f81cd" in CANONO_EQ mode, though it does catch character
   \u1f80 and \u1f82. Need to iterate all characters in "Range"
   and list all their "EquivalentAlternation" in CANONO_EQ mode.

import java.util.regex.*;
public class RegTest {
    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f81cd";
        String patternStr = "[\u1f80-\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        } else {
            System.out.println("No Match");
        }

    }
}

(4)Though not critical, but seems like there will be some redundency 
   patterns created by produceEquivalentAlternation() when dealint with
   multiple combining characters in CANON_EQ mode

   for example

   pattern "\u1f80" will create
 (?: 0x3b1 0x313 0x345 | 0x1f00 0x345 | 0x1f80 | 0x3b1 0x345 0x313 | 0x1fb3 0x313 | 0x1f80)     

   and "\u1f82" will create
(?: 0x3b1 0x313 0x300 0x345 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x3b1 0x313 0x345 0x300 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x3b1 0x345 0x313 0x300 | 0x1fb3 0x313 0x300 | 0x1f80 0x300 | 0x1f82)

   #space has been added between hexadecimal numbers

EVALUATION It is correct that CANON_EQ mode does have certain limitations and inefficiencies. The complexity of unicode regular expression support prevents us from supporting much beyond level 1 as described in Unicode Technical Standard #18. Perhaps our equivalence support will be extended in a future release. ###@###.### 2004-02-19

19-02-2004