JDK-6609854 : Regex does not match correctly for negative nested character classes
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 6
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2007-09-26
  • Updated: 2021-10-13
  • Resolved: 2016-05-11
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
9 b119Fixed
Related Reports
CSR :  
Relates :  
Relates :  
Relates :  
Relates :  
Relates :  
Sub Tasks
JDK-8216391 :  
> >> I have been looking into the definition of [character set]
> >> expressions in Java regular expressions, to understand what needs to
> >> be done to make ICU be compatible, or more compatible at least.
> >>
> >> There does not appear to be any formal definition for [set
> >> expressions], or at least not that I can find.
> >>
> >> Trying tests, one aspect of the behavior seems really odd.  It would
> >> be good if we could find out from Sun whether it was really intended
> >> to work the way  that it does.
> >>
> >> The question concerns the negation of a set,
> >> [^0-9], to get everything except for the ASCII digits, for example.
> >>
> >> In Java, the negation does _not_ apply to anything appearing in
> >> nested [brackets]
> >>
> >> So [^c]  does not match "c", as you would expect.
> >> [^[c]]  does match "c".  Not what I would expect.
> >> [[^c]]  does not match "c"
> >>
> >> The same holds true for ranges or property expressions - if they're
> >> inside brackets, a negation at an out level does not affect them.
> >>
> >> [^a-z]  is opposite from [^[a-z]]
> >>
> >> And the same seems to hold for set expressions with &&, although the
> >> cases become hard to understand.
> >>
> >> Perl and Posix behavior doesn't provide any guidance here, as they do
> >> not support nested brackets at all - a '[' is not special within a
> >> set, and just becomes yet another member of the set.

The idea of nested character classes seems to be a Java-only construct. In Perl and Python it appears that, within a bracketed character class, the [ character is not treated specially; that is, it's treated as an ordinary character. Thus, for a "nested" regex such as "[ab[cd]ef]" the [ is treated as a member of the character class like a b c and d, the ] following the d closes the character class, and the ef] that trail are treated as if they are outside the character class. An unmatched close bracket ] is apparently also treated as an ordinary character, that is, it matches an actual close bracket, and it's not treated as a syntax error. Thus, in both Perl and Python, the string "aef]" will match the regex "[ab[cd]ef]". https://docs.python.org/3.7/library/re.html https://perldoc.perl.org/perlrecharclass#Bracketed-Character-Classes In addition, Friedl "Mastering Regular Expressions" 3/e (O'Reilly, 2006) refers to "Full class set operations" as a Java-only construct (chapter 3). There is also mention of character class nesting in Ruby and PowerGREP here: https://www.regular-expressions.info/charclassintersect.html This supports the idea of "nested" character classes being at least mostly a Java-only construct. Thus we can't obtain any guidance by looking at other systems and how they handle negation of nested character classes. [Heh, Xueming mentioned this in the description already. Good to know that we've reached the same conclusion.]

This changeset ended up in the unified repo as follows: http://hg.openjdk.java.net/jdk/jdk/rev/e7f3cf12e739 This changeset was also discussed in several other email review threads: https://mail.openjdk.java.net/pipermail/core-libs-dev/2014-February/025314.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039366.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039540.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039551.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-May/040780.html Unfortunately the history is convoluted. This particular behavior change seems to have been swept in with several other changes, and perhaps its full impact wasn't properly assessed.

For further detail see: http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html

URL: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/d0c319c32334 User: lana Date: 2016-05-18 20:42:24 +0000

URL: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d0c319c32334 User: sherman Date: 2016-05-11 04:19:37 +0000