Bug ID: JDK-6609854 Regex does not match correctly for negative nested character classes

JDK-6609854 : Regex does not match correctly for negative nested character classes

Type: Bug
Component: core-libs
Sub-Component: java.util.regex
Affected Version: 6

Priority: P3
Status: Resolved
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2007-09-26
Updated: 2021-10-13
Resolved: 2016-05-11

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 9
9 b119Fixed

Related Reports

CSR :	JDK-8275184 - change in regex character class operator precedence
Relates :	JDK-8247728 - Regex behavior is different and now wrong comparing 8 and 11 (now)
Relates :	JDK-8189343 - Change of behavior of java.util.regex.Pattern between JDK 8 and JDK 9
Relates :	JDK-8215626 - The '^' operator (negation in char classes) in regex does not work properly
Relates :	JDK-8264671 - Update Pattern spec to provide details of character class syntax and behavior
Relates :	JDK-8228606 - Negation on nested character classes does not work

Sub Tasks

JDK-8216391 :

Release Note: Correction to negation function of RegEx character classes - Closed

Description

> >> I have been looking into the definition of [character set]
> >> expressions in Java regular expressions, to understand what needs to
> >> be done to make ICU be compatible, or more compatible at least.
> >>
> >> There does not appear to be any formal definition for [set
> >> expressions], or at least not that I can find.
> >>
> >> Trying tests, one aspect of the behavior seems really odd.  It would
> >> be good if we could find out from Sun whether it was really intended
> >> to work the way  that it does.
> >>
> >> The question concerns the negation of a set,
> >> [^0-9], to get everything except for the ASCII digits, for example.
> >>
> >> In Java, the negation does _not_ apply to anything appearing in
> >> nested [brackets]
> >>
> >> So [^c]  does not match "c", as you would expect.
> >> [^[c]]  does match "c".  Not what I would expect.
> >> [[^c]]  does not match "c"
> >>
> >> The same holds true for ranges or property expressions - if they're
> >> inside brackets, a negation at an out level does not affect them.
> >>
> >> [^a-z]  is opposite from [^[a-z]]
> >>
> >> And the same seems to hold for set expressions with &&, although the
> >> cases become hard to understand.
> >>
> >> Perl and Posix behavior doesn't provide any guidance here, as they do
> >> not support nested brackets at all - a '[' is not special within a
> >> set, and just becomes yet another member of the set.

Comments

The idea of nested character classes seems to be a Java-only construct. In Perl and Python it appears that, within a bracketed character class, the [ character is not treated specially; that is, it's treated as an ordinary character. Thus, for a "nested" regex such as "[ab[cd]ef]" the [ is treated as a member of the character class like a b c and d, the ] following the d closes the character class, and the ef] that trail are treated as if they are outside the character class. An unmatched close bracket ] is apparently also treated as an ordinary character, that is, it matches an actual close bracket, and it's not treated as a syntax error. Thus, in both Perl and Python, the string "aef]" will match the regex "[ab[cd]ef]". https://docs.python.org/3.7/library/re.html https://perldoc.perl.org/perlrecharclass#Bracketed-Character-Classes In addition, Friedl "Mastering Regular Expressions" 3/e (O'Reilly, 2006) refers to "Full class set operations" as a Java-only construct (chapter 3). There is also mention of character class nesting in Ruby and PowerGREP here: https://www.regular-expressions.info/charclassintersect.html This supports the idea of "nested" character classes being at least mostly a Java-only construct. Thus we can't obtain any guidance by looking at other systems and how they handle negation of nested character classes. [Heh, Xueming mentioned this in the description already. Good to know that we've reached the same conclusion.]
11-10-2021
This changeset ended up in the unified repo as follows: http://hg.openjdk.java.net/jdk/jdk/rev/e7f3cf12e739 This changeset was also discussed in several other email review threads: https://mail.openjdk.java.net/pipermail/core-libs-dev/2014-February/025314.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039366.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039540.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039551.html https://mail.openjdk.java.net/pipermail/core-libs-dev/2016-May/040780.html Unfortunately the history is convoluted. This particular behavior change seems to have been swept in with several other changes, and perhaps its full impact wasn't properly assessed.
16-07-2020
For further detail see: http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html
10-01-2019
URL: http://hg.openjdk.java.net/jdk9/jdk9/jdk/rev/d0c319c32334 User: lana Date: 2016-05-18 20:42:24 +0000
18-05-2016
URL: http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d0c319c32334 User: sherman Date: 2016-05-11 04:19:37 +0000
11-05-2016