Duplicate :
|
|
Relates :
|
|
Relates :
|
Name: jl125535 Date: 03/15/2004 FULL PRODUCT VERSION : java version "1.5.0-beta" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-beta-b32c) Java HotSpot(TM) Client VM (build 1.5.0-beta-b32c, mixed mode) ADDITIONAL OS VERSION INFORMATION : Microsoft Windows XP [Version 5.1.2600] A DESCRIPTION OF THE PROBLEM : This problem has been present since regex support was introduced in version 1.4.0, and may have led some users to conclude that the java.util.regex package is too slow. The problem, however, is not the matching speed, but the time it takes to *compile* a pattern that contains many optional groups. The accompanying sample code measures how long it takes to compile several patterns, each containing twenty-six optional groups. The first pattern, consisting of capturing groups with the default greedy question mark, takes 13.3 seconds to compile on my machine (Athlon XP 1700+ w/512M RAM, running WinXP Home). In the next pattern, I made the question marks lazy - "(x)??" - and it takes even longer to compile, about 15 seconds. But when I make the question marks possessive - "(x)?+" - the pattern compiles in no time at all. The fourth pattern shows that changing to non-capturing groups makes things slightly worse; that one takes 14.5 seconds to compile on my machine. The fifth pattern, however, is the most interesting; simply adding the '^' start anchor to the beginning of the pattern brings the compilation time down to nothing. Anchoring only at the end, as in the sixth pattern, actually increases the compilation time 16.4 seconds), but a pattern that's anchored at both ends compiles instantly. I realize this bug is not very likely to affect the average user, so I don't expect it to be assigned a high priority. It is a bug, though, and should be fixed eventually; it shouldn't be possible to write a regex that takes fifteen seconds to compile. In the meantime, you could add a section to the Pattern class doc containing tips to help speed up pattern compilation and/or matching. The first two of my suggested workarounds are good general advice anyway, the third one could be generalized to "use possessive closures or independent groups whenever you can", and the fourth could be listed as applying only to this edge case. REPRODUCIBILITY : This bug can be reproduced always. ---------- BEGIN SOURCE ---------- import java.util.regex.*; public class Test { public static void main(String[] args) { String[] regexes = { "(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" + "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?", "(a)??(b)??(c)??(d)??(e)??(f)??(g)??(h)??(i)??(j)??" + "(k)??(l)??(m)??(n)??(o)??(p)??(q)??(r)??(s)??(t)??" + "(u)??(v)??(w)??(x)??(y)??(z)??", "(a)?+(b)?+(c)?+(d)?+(e)?+(f)?+(g)?+(h)?+(i)?+(j)?+" + "(k)?+(l)?+(m)?+(n)?+(o)?+(p)?+(q)?+(r)?+(s)?+(t)?+" + "(u)?+(v)?+(w)?+(x)?+(y)?+(z)?+", "(?:a)?(?:b)?(?:c)?(?:d)?(?:e)?(?:f)?(?:g)?(?:h)?(?:i)?" + "(?:j)?(?:k)?(?:l)?(?:m)?(?:n)?(?:o)?(?:p)?(?:q)?(?:r)?" + "(?:s)?(?:t)?(?:u)?(?:v)?(?:w)?(?:x)?(?:y)?(?:z)?", "^(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" + "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?", "(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" + "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?$", "^(?:a)?(?:b)?(?:c)?(?:d)?(?:e)?(?:f)?(?:g)?(?:h)?(?:i)?" + "(?:j)?(?:k)?(?:l)?(?:m)?(?:n)?(?:o)?(?:p)?(?:q)?(?:r)?" + "(?:s)?(?:t)?(?:u)?(?:v)?(?:w)?(?:x)?(?:y)?(?:z)?$" }; for (int i = 0; i < regexes.length; i++) { long startTime = System.currentTimeMillis(); Pattern p = Pattern.compile(regexes[i]); long msec = (System.currentTimeMillis() - startTime); System.out.println("Pattern " + i + " compiles in " + msec + " milliseconds"); } } } ---------- END SOURCE ---------- CUSTOMER SUBMITTED WORKAROUND : 1) Don't compile patterns inside loops if you don't have to. 2) Always anchor patterns at both ends if appropriate, even if you're using the matches() method. 3) Use the possesive form of the '?' closure if you can. 4) Avoid patterns with many optional groups in them. (Incident Review ID: 243126) ====================================================================== ###@###.### 11/1/04 20:33 GMT
|