FULL PRODUCT VERSION : java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) ADDITIONAL OS VERSION INFORMATION : Linux mclane 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux A DESCRIPTION OF THE PROBLEM : If the CANON_EQ flag of java.util.regex.Pattern is used in conjunction with the quotation escapes \Q and \E, the resulting matcher does not operate as expected, i.e. decomposable characters within the quotes are not recognized. Debugging and some analysis of the source code have shown that the problem likely lies in the Pattern.normalize method. This method seems to analyse the original pattern string and to replace decomposable characters with a non-capturing group containing various alternatives. However, it does not seem to take care about the problem that the replacement may take place within a \Q...\E sequence. As these escapes seem to be retained, the later processing assumes the content to be literal, i.e. \Q??\E becomes \Q(?:...|...|...)\E so that the resulting matcher actually looks for an open parenthesis, a question mark, a colon and so on. STEPS TO FOLLOW TO REPRODUCE THE PROBLEM : Run the supplied test program. EXPECTED VERSUS ACTUAL BEHAVIOR : EXPECTED - p1 matches: true p2 matches: true ACTUAL - p1 matches: true p2 matches: false REPRODUCIBILITY : This bug can be reproduced always. ---------- BEGIN SOURCE ---------- import java.util.regex.Pattern; import java.util.regex.Matcher; public class Test { public static void main (String[] args) { String test = "\u00fc"; // u umlaut Pattern p1 = Pattern.compile ("\u00fc", Pattern.CANON_EQ); System.out.println ("p1 matches: " + p1.matcher (test).matches ()); Pattern p2 = Pattern.compile ("\\Q\u00fc\\E", Pattern.CANON_EQ); System.out.println ("p2 matches: " + p2.matcher (test).matches ()); } } ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : None, except not using CANON_EQ or \Q\E at the same time
|