JDK-4881248 : REGRESSION: Regular expression matching bug with text with non-ascii characters
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 1.4.2
  • Priority: P3
  • Status: Closed
  • Resolution: Duplicate
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2003-06-19
  • Updated: 2003-06-19
  • Resolved: 2003-06-19
Related Reports
Duplicate :  
Description

Name: rmT116609			Date: 06/19/2003


FULL PRODUCT VERSION :
java version "1.4.2-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-beta-b19)
Java HotSpot(TM) Client VM (build 1.4.2-beta-b19, mixed mode)

FULL OS VERSION :
Microsoft Windows XP [Version 5.1.2600]

EXTRA RELEVANT SYSTEM CONFIGURATION :
Regional Settings: Turkish

A DESCRIPTION OF THE PROBLEM :
it seems like j2sdk1.4.2b has some serious regex matching bug with strings that contain unicode characters. In my case, the string contained some Turkish chars.
regex is simple <[^>]*> which matches string runs that are enclosed in <>
(ex. <field>)
although the matching is successful with j2sdk1.4.1_02, it just doesn't match unicode containing text with 1.4.2b

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the following code excerpt with JDK1.4.2b

String text="text with some <ascii> and non ascii<����������> characters>";
Pattern pt=Pattern.compile("<([^>]*)>");
Matcher mc=pt.matcher(text);
while (mc.find()){
    String s = mc.group();
    System.out.println("s = " + s);
}


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
s = <ascii>
s = <����������>
ACTUAL -
s = <ascii>

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BugTest {
    public static void main(String[] args) {
        String text="text with some <ascii> and non ascii<����������> characters>";
        Pattern pt=Pattern.compile("<([^>]*)>");
        Matcher mc=pt.matcher(text);
        while (mc.find()){
            String s = mc.group();
            System.out.println("s = " + s);
        }
    }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Switching to JDK1.4.1_02 seems to be the only workaround if possible.

Release Regression From : 1.4.1_02
The above release value was the last known release where this 
bug was known to work. Since then there has been a regression.

(Review ID: 187695) 
======================================================================