United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
JDK-6486934 : RegEx case_insensitive match is broken

Details
Type:
Bug
Submit Date:
2006-10-26
Status:
Closed
Updated Date:
2011-03-08
Project Name:
JDK
Resolved Date:
2011-03-08
Component:
core-libs
OS:
generic,windows_xp
Sub-Component:
java.util.regex
CPU:
x86,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
5.0,6
Fixed Versions:

Related Reports
Backport:
Backport:
Duplicate:
Relates:

Sub Tasks

Description
The case folding spec in regex clearly says

CASE_INSENSITIVE
 By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched. Unicode-aware case-insensitive
matching can be enabled by specifying the UNICODE_CASE flag in conjunction
with this flag.

UNICODE_CASE
 When this flag is specified then case-insensitive matching, when enabled
by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode
Standard. By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched.

But our implementation totally disagrees with our own spec at

(1)The UNICODE_CASE is mostly treated as UNICODE_CASE_INSENSITIVE,
which means the match is case insensitive no matter whether or
no the CASE_INSENSITIVE is enabled. We only "accidently" follow
the spec in character class case when the specified character is
basic latin (ascii) and latine-1 supplement (<=0xff).

1.4.x does follow the spec, the "regression" started from Tiger. Based
on the sccs history, the change was introduced in by the fix for
#4908476 (which I believe is a mistake, the test cases showed in the
bug report use (?u) alone  instead of (?iu)).

(2)When CASE_INSENSITIVE is not companying with a UNI_CODE_CASE,
case insensitive match is still being done for

a)Class_Single   Latin-1 supplement
b)Class_Range    Non-ASCII
c)BackReference  Non-ASCII

We have this buggy behavior from day-one. It might be OK (really???) to extend
the interpretation of ASCII a little to cover all characters less thatn \u00ff, but
the inconsistency between different constructs is really a big deal.


Attached is the test cases.

import java.util.regex.*;
public class Foo {
   public static void main(String[] args) {
   int failCount = 0;
   Pattern pattern;
   Matcher matcher;
   int flags = 0;
    // ASCII               \u0061   "a"
       // Latin-1 Supplement  \u00e0   "a" + grave
       // Cyrillic            \u0431   cyrillic "a"
   String[] patterns = new String[] {
       //single char
       "a", "\u00e0", "\u0430",
       //slice of chars
       "ab", "\u00e0\u00e1", "\u0430\u0431",
       //class single
       "[a]", "[\u00e0]", "[\u0430]",
       //class range
       "[a-b]", "[\u00e0-\u00e5]", "[\u0430-\u0431]",
       //back reference
       "(a)\\1", "(\u00e0)\\1", "(\u0430)\\1"
   };
   String[] texts = new String[] {
           "A", "\u00c0", "\u0410",
           "AB", "\u00c0\u00c1", "\u0410\u0411",
           "A", "\u00c0", "\u0410",
           "B", "\u00c2", "\u0411",
           "aA", "\u00e0\u00c0", "\u0430\u0410"
   };
   boolean[] expected = new boolean[] {
       true, false, false,
       true, false, false,
       true, false, false,
       true, false, false,
       true, false, false
   };

       flags = Pattern.CASE_INSENSITIVE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (matcher.matches() != expected[i]) {
       System.out.println("<CI>    Failed at " + i);
       failCount++;
       }
   }

   flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (!matcher.matches()) {
       System.out.println("<CI+UC> Failed at " + i);
       failCount++;
       }
   }
   // flag unicode_case alone should do nothing
   flags = Pattern.UNICODE_CASE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (matcher.matches()) {
       System.out.println("<UC>    Failed at " + i);
       failCount++;
       }
   }
   System.out.println("Total failure :" + failCount);
   }
}

                                    

Comments
EVALUATION

To solve above problems, we have several options

For issue (1)
(a)Change the spec to specify the UNICODE_CASE means UNICODE_CASE_INSENSITIVE,
the match will be case insensitive even without a CASE_INSENSITIVE
(b)Rollback the fix for #4908476 to apply unicode case folding match
iff having  UNICODE_CASE + CASE_INSENSITIVE

For issue (2) -- when UNICODE_CASE is not presented
(a)Enforce the CASE_INSENSITIVE for ASCII only
(b)Modify the spec to say CASE_INSENSITIVE for ASCII and Latin-1 supplements (<=0xff)
and update the implementation accordinly.
                                     
2006-10-26
EVALUATION

The proposed solution now is to follow the spec, means
CASE_INSENSITIVE is for ASCII only, UNICODE_CASE must
companied with CASE_INSENSITIVE to have a "unicode
insensitive case match".
                                     
2006-12-01



Hardware and Software, Engineered to Work Together