JDK-6486934 : RegEx case_insensitive match is broken
  • Type: Bug
  • Status: Closed
  • Resolution: Fixed
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Priority: P3
  • Affected Version: 5.0,6
  • OS: generic,windows_xp
  • CPU: generic,x86
  • Submit Date: 2006-10-26
  • Updated Date: 2017-05-16
  • Resolved Date: 2011-03-08
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availabitlity Release.

To download the current JDK release, click here.
JDK 6 JDK 7
6u2Resolved 7 b06Fixed
Related Reports
Duplicate :  
Relates :  
Description
The case folding spec in regex clearly says

CASE_INSENSITIVE
 By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched. Unicode-aware case-insensitive
matching can be enabled by specifying the UNICODE_CASE flag in conjunction
with this flag.

UNICODE_CASE
 When this flag is specified then case-insensitive matching, when enabled
by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode
Standard. By default, case-insensitive matching assumes that only characters
in the US-ASCII charset are being matched.

But our implementation totally disagrees with our own spec at

(1)The UNICODE_CASE is mostly treated as UNICODE_CASE_INSENSITIVE,
which means the match is case insensitive no matter whether or
no the CASE_INSENSITIVE is enabled. We only "accidently" follow
the spec in character class case when the specified character is
basic latin (ascii) and latine-1 supplement (<=0xff).

1.4.x does follow the spec, the "regression" started from Tiger. Based
on the sccs history, the change was introduced in by the fix for
#4908476 (which I believe is a mistake, the test cases showed in the
bug report use (?u) alone  instead of (?iu)).

(2)When CASE_INSENSITIVE is not companying with a UNI_CODE_CASE,
case insensitive match is still being done for

a)Class_Single   Latin-1 supplement
b)Class_Range    Non-ASCII
c)BackReference  Non-ASCII

We have this buggy behavior from day-one. It might be OK (really???) to extend
the interpretation of ASCII a little to cover all characters less thatn \u00ff, but
the inconsistency between different constructs is really a big deal.


Attached is the test cases.

import java.util.regex.*;
public class Foo {
   public static void main(String[] args) {
   int failCount = 0;
   Pattern pattern;
   Matcher matcher;
   int flags = 0;
    // ASCII               \u0061   "a"
       // Latin-1 Supplement  \u00e0   "a" + grave
       // Cyrillic            \u0431   cyrillic "a"
   String[] patterns = new String[] {
       //single char
       "a", "\u00e0", "\u0430",
       //slice of chars
       "ab", "\u00e0\u00e1", "\u0430\u0431",
       //class single
       "[a]", "[\u00e0]", "[\u0430]",
       //class range
       "[a-b]", "[\u00e0-\u00e5]", "[\u0430-\u0431]",
       //back reference
       "(a)\\1", "(\u00e0)\\1", "(\u0430)\\1"
   };
   String[] texts = new String[] {
           "A", "\u00c0", "\u0410",
           "AB", "\u00c0\u00c1", "\u0410\u0411",
           "A", "\u00c0", "\u0410",
           "B", "\u00c2", "\u0411",
           "aA", "\u00e0\u00c0", "\u0430\u0410"
   };
   boolean[] expected = new boolean[] {
       true, false, false,
       true, false, false,
       true, false, false,
       true, false, false,
       true, false, false
   };

       flags = Pattern.CASE_INSENSITIVE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (matcher.matches() != expected[i]) {
       System.out.println("<CI>    Failed at " + i);
       failCount++;
       }
   }

   flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (!matcher.matches()) {
       System.out.println("<CI+UC> Failed at " + i);
       failCount++;
       }
   }
   // flag unicode_case alone should do nothing
   flags = Pattern.UNICODE_CASE;
   for (int i = 0; i < patterns.length; i++) {
           pattern = Pattern.compile(patterns[i], flags);
       matcher = pattern.matcher(texts[i]);
       if (matcher.matches()) {
       System.out.println("<UC>    Failed at " + i);
       failCount++;
       }
   }
   System.out.println("Total failure :" + failCount);
   }
}

Comments
EVALUATION The proposed solution now is to follow the spec, means CASE_INSENSITIVE is for ASCII only, UNICODE_CASE must companied with CASE_INSENSITIVE to have a "unicode insensitive case match".
2006-12-01

EVALUATION To solve above problems, we have several options For issue (1) (a)Change the spec to specify the UNICODE_CASE means UNICODE_CASE_INSENSITIVE, the match will be case insensitive even without a CASE_INSENSITIVE (b)Rollback the fix for #4908476 to apply unicode case folding match iff having UNICODE_CASE + CASE_INSENSITIVE For issue (2) -- when UNICODE_CASE is not presented (a)Enforce the CASE_INSENSITIVE for ASCII only (b)Modify the spec to say CASE_INSENSITIVE for ASCII and Latin-1 supplements (<=0xff) and update the implementation accordinly.
2006-10-26