JDK-4785712 : The '#' in regex character class is treated as comment
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 1.4.1
  • Priority: P4
  • Status: Closed
  • Resolution: Won't Fix
  • OS: windows_2000
  • CPU: x86
  • Submitted: 2002-11-27
  • Updated: 2006-06-15
  • Resolved: 2006-06-15
Description
Name: nt126004			Date: 11/27/2002


FULL PRODUCT VERSION :
java version "1.4.1_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_01-b01)
Java HotSpot(TM) Client VM (build 1.4.1_01-b01, mixed mode)


FULL OPERATING SYSTEM VERSION :
Microsoft Windows 2000 [Version 5.00.2195]

ADDITIONAL OPERATING SYSTEMS :
Linux


A DESCRIPTION OF THE PROBLEM :
When you use extended REs (m//x) in Perl, the hash symbol
introduces a comment that lasts until the end of the line.
That doesn't happen if the hash is inside a character class.

Although (?x) enables extended REs in Java, the hash symbol
is not treated as a literal inside a character class.


STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
compile and run code given below


EXPECTED VERSUS ACTUAL BEHAVIOR :
i would expect no errors from this. however, the second
regex is parsed incorrectly, resulting in a
PatternSyntaxException. this behavior differs from Perl,
which treats # in a character class as a literal character,
not a comment character.


ERROR MESSAGES/STACK TRACES THAT OCCUR :
 $ javac example.java
 $ java example
Exception in thread "main"
java.util.regex.PatternSyntaxException : Unclosed character
class near index 71
 (?x)(?i) \b ( (?: D (?:efect)? | B (?:ug)? | Fix\ for) [ #/]* ) (\d+) \b
                                                             ^
        at java.util.regex.Pattern.error(Pattern.java:1489)
        at java.util.regex.Pattern.clazz(Pattern.java:2002)
        at java.util.regex.Pattern.sequence(Pattern.java:1546)
        at java.util.regex.Pattern.expr(Pattern.java:1506)
        at java.util.regex.Pattern.group0(Pattern.java:2248)
        at java.util.regex.Pattern.sequence(Pattern.java:1534)
        at java.util.regex.Pattern.expr(Pattern.java:1506)
        at java.util.regex.Pattern.compile(Pattern.java:1274)
        at java.util.regex.Pattern.<init>(Pattern.java:1030)
        at java.util.regex.Pattern.compile(Pattern.java:777)
        at java.lang.String.replaceAll(String.java:1710)
        at example.main(example.java:10)

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
// example.java
import java.util.regex.*;
public class example {
 public static void main(String[] args) {
  String line = "this is a fix for defect 1234.";

  // this works...
  line = line.replaceAll("(?x)(?i) \\b ( (?: D (?:efect)? | B (?:ug)? | Fix\for)  [ \\#/]* ) (\\d+) \\b", "<a href=\"/bugdb.cgi?bug=$2\">$1$2</a>");

  // this should, but doesn't...
  line = line.replaceAll("(?x)(?i) \\b ( (?: D (?:efect)? | B (?:ug)? | Fix\for)  [ #/]* ) (\\d+) \\b", "<a href=\"/bugdb.cgi?bug=$2\">$1$2</a>");
 }
}
---------- END SOURCE ----------

CUSTOMER WORKAROUND :
 Use "\\#" instead of "#" inside a character class.
(Review ID: 166822) 
======================================================================

Comments
EVALUATION Not compelling enough to change the current behavior to match what Perl does after two major releases, the compatibility weighs more here. Closed as "will not fix".
15-06-2006

EVALUATION It is true that we handle this differently than in Perl but it is not a critical issue. This is only a problem when using the extended comments flag (?x) and when a hash appears inside a character class, and it is easy to workaround. ###@###.### 2002-12-02
02-12-2002