Duplicate :
|
Name: rmT116609 Date: 06/19/2003 FULL PRODUCT VERSION : java version "1.4.2-beta" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-beta-b19) Java HotSpot(TM) Client VM (build 1.4.2-beta-b19, mixed mode) FULL OS VERSION : Microsoft Windows XP [Version 5.1.2600] EXTRA RELEVANT SYSTEM CONFIGURATION : Regional Settings: Turkish A DESCRIPTION OF THE PROBLEM : it seems like j2sdk1.4.2b has some serious regex matching bug with strings that contain unicode characters. In my case, the string contained some Turkish chars. regex is simple <[^>]*> which matches string runs that are enclosed in <> (ex. <field>) although the matching is successful with j2sdk1.4.1_02, it just doesn't match unicode containing text with 1.4.2b STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : Run the following code excerpt with JDK1.4.2b String text="text with some <ascii> and non ascii<����������> characters>"; Pattern pt=Pattern.compile("<([^>]*)>"); Matcher mc=pt.matcher(text); while (mc.find()){ String s = mc.group(); System.out.println("s = " + s); } EXPECTED VERSUS ACTUAL BEHAVIOR : EXPECTED - s = <ascii> s = <����������> ACTUAL - s = <ascii> REPRODUCIBILITY : This bug can be reproduced always. ---------- BEGIN SOURCE ---------- import java.util.regex.Matcher; import java.util.regex.Pattern; public class BugTest { public static void main(String[] args) { String text="text with some <ascii> and non ascii<����������> characters>"; Pattern pt=Pattern.compile("<([^>]*)>"); Matcher mc=pt.matcher(text); while (mc.find()){ String s = mc.group(); System.out.println("s = " + s); } } } ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : Switching to JDK1.4.1_02 seems to be the only workaround if possible. Release Regression From : 1.4.1_02 The above release value was the last known release where this bug was known to work. Since then there has been a regression. (Review ID: 187695) ======================================================================