United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-4867170 : Pattern doesn't work with composite character in CANON_EQ mode

Details
Type:
Bug
Submit Date:
2003-05-21
Status:
Open
Updated Date:
2014-09-08
Project Name:
JDK
Resolved Date:
Component:
core-libs
OS:
generic,windows
Sub-Component:
java.util.regex
CPU:
x86,generic
Priority:
P3
Resolution:
Unresolved
Affected Versions:
1.4.0,6
Targeted Versions:
tbd_major

Related Reports
Backport:

Sub Tasks

Description
(1) Composite characters only "Character Classes" pattern will throw
    Exception, example below shows the problem.

import java.util.regex.*;

public class RegTest {

    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f82cd";
        String patternStr = "[\u1f80\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        }
    }
}

(2) replace the pattern to 
    String patternStr = "\u1f80\u1f82";
    also throw exception


(3)Pattern "[\u1f80-\u1f82]" will not have match for input string
   "ab\u1f81cd" in CANONO_EQ mode, though it does catch character
   \u1f80 and \u1f82. Need to iterate all characters in "Range"
   and list all their "EquivalentAlternation" in CANONO_EQ mode.

import java.util.regex.*;
public class RegTest {
    public static void main(String args[]) {

        CharSequence inputStr = "ab\u1f81cd";
        String patternStr = "[\u1f80-\u1f82]";

        Pattern pattern = Pattern.compile(patternStr, Pattern.CANON_EQ);
        Matcher matcher = pattern.matcher(inputStr);
        boolean matchFound = matcher.find();

        if (matchFound) {
            System.out.println("<" + Integer.toString(matcher.start())
				   + ","
				   + Integer.toString(matcher.end())
				   + ">  ");
        } else {
            System.out.println("No Match");
        }

    }
}

(4)Though not critical, but seems like there will be some redundency 
   patterns created by produceEquivalentAlternation() when dealint with
   multiple combining characters in CANON_EQ mode

   for example

   pattern "\u1f80" will create
 (?: 0x3b1 0x313 0x345 | 0x1f00 0x345 | 0x1f80 | 0x3b1 0x345 0x313 | 0x1fb3 0x313 | 0x1f80)     

   and "\u1f82" will create
(?: 0x3b1 0x313 0x300 0x345 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x3b1 0x313 0x345 0x300 | 0x1f00 0x345 0x300 | 0x1f80 0x300 | 0x1f82 | 0x1f00 0x300 0x345 | 0x1f02 0x345 | 0x1f82 | 0x3b1 0x345 0x313 0x300 | 0x1fb3 0x313 0x300 | 0x1f80 0x300 | 0x1f82)

   #space has been added between hexadecimal numbers

                                    

Comments
EVALUATION

It is correct that CANON_EQ mode does have certain limitations and inefficiencies. The complexity of unicode regular expression support prevents us from supporting much beyond level 1 as described in Unicode Technical Standard #18. Perhaps our equivalence support will be extended in a future release.
###@###.### 2004-02-19
                                     
2004-02-19



Hardware and Software, Engineered to Work Together