JDK-5013651 : java.util.regex.Pattern: Optional groups take too long to compile
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 1.4.2
  • Priority: P4
  • Status: Resolved
  • Resolution: Fixed
  • OS: solaris_2.5.1,windows_xp
  • CPU: x86
  • Submitted: 2004-03-15
  • Updated: 2014-09-22
  • Resolved: 2005-12-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other JDK 6
5.0u81Fixed 6 b65Fixed
Related Reports
Duplicate :  
Relates :  
Relates :  
Description
Name: jl125535			Date: 03/15/2004


FULL PRODUCT VERSION :
java version "1.5.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-beta-b32c)
Java HotSpot(TM) Client VM (build 1.5.0-beta-b32c, mixed mode)

ADDITIONAL OS VERSION INFORMATION :
Microsoft Windows XP [Version 5.1.2600]

A DESCRIPTION OF THE PROBLEM :
This problem has been present since regex support was introduced
in version 1.4.0, and may have led some users to conclude that
the java.util.regex package is too slow.  The problem, however,
is not the matching speed, but the time it takes to *compile* a
pattern that contains many optional groups.

The accompanying sample code measures how long it takes to
compile several patterns, each containing twenty-six optional
groups.  The first pattern, consisting of capturing groups with
the default greedy question mark, takes 13.3 seconds to compile
on my machine (Athlon XP 1700+ w/512M RAM, running WinXP Home).
In the next pattern, I made the question marks lazy - "(x)??" -
and it takes even longer to compile, about 15 seconds.  But when
I make the question marks possessive - "(x)?+" - the pattern
compiles in no time at all.

The fourth pattern shows that changing to non-capturing groups
makes things slightly worse; that one takes 14.5 seconds to
compile on my machine.  The fifth pattern, however, is the most
interesting; simply adding the '^' start anchor to the beginning
of the pattern brings the compilation time down to nothing.
Anchoring only at the end, as in the sixth pattern, actually
increases the compilation time 16.4 seconds), but a pattern
that's anchored at both ends compiles instantly.

I realize this bug is not very likely to affect the average user, so I
don't expect it to be assigned a high priority.  It is a bug, though,
and should be fixed eventually; it shouldn't be possible to write a
regex that takes fifteen seconds to compile.  In the meantime, you
could add a section to the Pattern class doc containing tips to help
speed up pattern compilation and/or matching.  The first two of my
suggested workarounds are good general advice anyway, the third one
could be generalized to "use possessive closures or independent groups
whenever you can", and the fourth could be listed as applying only to
this edge case.


REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
import java.util.regex.*;


public class Test
{
  public static void main(String[] args)
  {
    String[] regexes = {
      "(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" +
      "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?",

      "(a)??(b)??(c)??(d)??(e)??(f)??(g)??(h)??(i)??(j)??" +
      "(k)??(l)??(m)??(n)??(o)??(p)??(q)??(r)??(s)??(t)??" +
      "(u)??(v)??(w)??(x)??(y)??(z)??",

      "(a)?+(b)?+(c)?+(d)?+(e)?+(f)?+(g)?+(h)?+(i)?+(j)?+" +
      "(k)?+(l)?+(m)?+(n)?+(o)?+(p)?+(q)?+(r)?+(s)?+(t)?+" +
      "(u)?+(v)?+(w)?+(x)?+(y)?+(z)?+",

      "(?:a)?(?:b)?(?:c)?(?:d)?(?:e)?(?:f)?(?:g)?(?:h)?(?:i)?" +
      "(?:j)?(?:k)?(?:l)?(?:m)?(?:n)?(?:o)?(?:p)?(?:q)?(?:r)?" +
      "(?:s)?(?:t)?(?:u)?(?:v)?(?:w)?(?:x)?(?:y)?(?:z)?",

      "^(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" +
      "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?",

      "(a)?(b)?(c)?(d)?(e)?(f)?(g)?(h)?(i)?(j)?(k)?(l)?(m)?" +
      "(n)?(o)?(p)?(q)?(r)?(s)?(t)?(u)?(v)?(w)?(x)?(y)?(z)?$",

      "^(?:a)?(?:b)?(?:c)?(?:d)?(?:e)?(?:f)?(?:g)?(?:h)?(?:i)?" +
      "(?:j)?(?:k)?(?:l)?(?:m)?(?:n)?(?:o)?(?:p)?(?:q)?(?:r)?" +
      "(?:s)?(?:t)?(?:u)?(?:v)?(?:w)?(?:x)?(?:y)?(?:z)?$"
    };

    for (int i = 0; i < regexes.length; i++)
    {
      long startTime = System.currentTimeMillis();
      Pattern p = Pattern.compile(regexes[i]);
      long msec = (System.currentTimeMillis() - startTime);
      System.out.println("Pattern " + i + " compiles in " +
                         msec + " milliseconds");
    }
  }
}

---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
1) Don't compile patterns inside loops if you don't have to.

2) Always anchor patterns at both ends if appropriate, even if
   you're using the matches() method.

3) Use the possesive form of the '?' closure if you can.

4) Avoid patterns with many optional groups in them.
(Incident Review ID: 243126) 
======================================================================
###@###.### 11/1/04 20:33 GMT

Comments
EVALUATION Apparently the study methods of the Pattern node classes have poor time efficiency. Should be investigated. ###@###.### 2004-03-15 Branch is used for optional groups and its study computes the info for each branch separately. It is quite wasteful when applied recursively since each branch is computed separately in each node all the way down. This has been fixed by taking advantage of the fact that in a Branch created for an optional group, the two branches are almost the same-- they differ only by the optional group. Thus study can calculate the info for the rest of the tree only once, and add in the info for the optional group to get the result for the other branch. ###@###.### 2004-04-23
23-04-2004