United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4867170 : Pattern doesn't work with composite character in CANON_EQ mode

Details
Type:
Bug
Submit Date:
2003-05-21
Status:
Resolved
Updated Date:
2016-05-19
Project Name:
JDK
Resolved Date:
2016-05-11
Component:
core-libs
OS:
generic,windows
Sub-Component:
java.util.regex
CPU:
generic,x86
Priority:
P3
Resolution:
Fixed
Affected Versions:
1.4.0,6
Fixed Versions:

Related Reports
Backport:

Sub Tasks

Description
(1) Composite characters only "Character Classes" pattern will throw
    Exception, example below shows the problem.

import java.util.regex.*;

public class RegTest {

    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f82cd";
        String patternStr = "[\u1f80\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        }
    }
}

(2) replace the pattern to 
    String patternStr = "\u1f80\u1f82";
    also throw exception


(3)Pattern "[\u1f80-\u1f82]" will not have match for input string
   "ab\u1f81cd" in CANONO_EQ mode, though it does catch character
   \u1f80 and \u1f82. Need to iterate all characters in "Range"
   and list all their "EquivalentAlternation" in CANONO_EQ mode.

import java.util.regex.*;
public class RegTest {
    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f81cd";
        String patternStr = "[\u1f80-\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        } else {
            System.out.println("No Match");
        }

    }
}

(4)Though not critical, but seems like there will be some redundency 
   patterns created by produceEquivalentAlternation() when dealint with
   multiple combining characters in CANON_EQ mode

   for example

   pattern "\u1f80" will create
 (?: 0x3b1 0x313 0x345 | 0x1f00 0x345 | 0x1f80 | 0x3b1 0x345 0x313 | 0x1fb3 0x313 | 0x1f80)     

   and "\u1f82" will create
(?: 0x3b1 0x313 0x300 0x345 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x3b1 0x313 0x345 0x300 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x3b1 0x345 0x313 0x300 | 0x1fb3 0x313 0x300 | 0x1f80 0x300 | 0x1f82)

   #space has been added between hexadecimal numbers

                                    

Comments
URL:   http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/d0c319c32334
User:  lana
Date:  2016-05-18 20:42:24 +0000

                                     
2016-05-18
URL:   http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d0c319c32334
User:  sherman
Date:  2016-05-11 04:19:37 +0000

                                     
2016-05-11
EVALUATION

It is correct that CANON_EQ mode does have certain limitations and inefficiencies. The complexity of unicode regular expression support prevents us from supporting much beyond level 1 as described in Unicode Technical Standard #18. Perhaps our equivalence support will be extended in a future release.
###@###.### 2004-02-19
                                     
2004-02-19



Hardware and Software, Engineered to Work Together