JDK-8275184 : change in regex character class operator precedence
  • Type: CSR
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Priority: P3
  • Status: Closed
  • Resolution: Approved
  • Fix Versions: 9
  • Submitted: 2021-10-13
  • Updated: 2021-10-19
  • Resolved: 2021-10-19
Related Reports
CSR :  
Description
Summary
-------

This is a retroactive CSR for [JDK-6609854][1], which changed the behavior of regex pattern operators on character classes. That behavior change could be regarded as incompatible, but it was never properly CSRed or documented. This CSR is a first attempt at a fairly rigorous description of the change. Subsequent work may update the appropriate specifications.

Problem
-------

Character classes are a feature of regex patterns. There are several set arithmetic operations possible on character classes. The operators are documented in the [Pattern][2] class specification, and they are described by some simple examples shown there. However, the behaviors of *combinations* of operators are not specified. The behavior in JDK 8 and earlier was well-defined and predictable, but it was complex, counterintuitive, and hard to explain.

This change (integrated in JDK 9) makes the behavior more sensible. However, it changed the behavior in a way that broke some existing uses of regex patterns. This has resulted in several bug reports as users stumbled over the behavior change. (See links from the main bug.) Given that this change was integrated in JDK 9 and has remained in place through JDK 11 -- an LTS release -- it seems like it's too late to revert this change. Instead, it's better to leave it in place and improve the documentation of the new behavior.

Solution
--------

BACKGROUND

Character classes occur within square brackets. Nesting of character classes is also possible by nesting sets of brackets. The operators on character classes are as follows:

*Range:* `-`

Constructs a character class consisting of a range between two literal characters. For example, `[a-z]` is a character class consisting of lowercase characters in the range from `a` to `z`, inclusive.

*Negation:* `^`

Immediately after the opening square bracket of a character class, negates (complements) the character class. For example, `[^a-z]` is a character class consisting of any character other than lowercase characters in the range `a` to `z`.

*Union:* (empty)

Results in the union of nested character classes, if they are adjacent to one another. For example, `[[a-f][d-h]]` is equivalent to `[a-h]`. A union also occurs between an outer character class and a nested character class. For example, `[a-m[n-z]]` is equivalent to the union of `[a-m]` and `[n-z]` which in turn is equivalent to `[a-z]`. 

Note that several literal characters and character ranges at the same level of a character class is the definition of that character class and is not the union of multiple classes. This is true even if multiple characters or ranges at the same level are separated by an intervening nested class. For example, `[a-d[e-g]h-j]` is equivalent to the union of the top-level class `[a-dh-j]` with the nested class `[e-g]`. It is _not_ equivalent to the union of three character classes `[a-d]`, `[e-g]`, and `[h-j]`. This is significant only in the JDK 8 behavior, where the union operator has a lower precedence than the negation operator.

*Intersection:* `&&`

Results in a character class that is the intersection of two character classes. For example, `[a-h&&d-k]` is equivalent to `[d-h]`.

(The examples above show the character class operators using literal characters and character ranges for the sake of simplicity. The set algebra operators are more useful when combined with the various predefined character classes.)

PRECEDENCE CHANGES

The range operator `-` constructs a character class from character literals, not from other character classes, so syntactically it has the highest precedence. This remains unchanged. The precedence among the negation, union, and intersection operators was changed.

In JDK 8, the operator precedence was as follows, from highest to lowest:

1. range `-`
2. negation `^`
3. union `[a][b]`
4. intersection `&&`

In JDK 9 and later, the operator precedence was changed to be as follows, from highest to lowest:

1. range `-`
2. union `[a][b]`
3. intersection `&&`
4. negation `^`

The net effect is that the precedence of the negation operator was moved from a very high precedence to the lowest precedence. Although this is an incompatible change, it actually makes a good deal of sense. For example, given any character class `[...]`, adding a negation operator `[^...]` now always negates the entire character class. This was not true in JDK 8, and its actual behavior was quite difficult to understand.

EXAMPLES

----------

`Pattern.compile("[^a[b]c]").matcher("b").matches()`

JDK 8: true. The negation is performed on the outer character class, which effectively is `[^ac]`. This is then unioned with `[b]`.

JDK 9: false. The union is performed before the negation, resulting in a character class is equivalent to `[^abc]`.

----------

`Pattern.compile("[^a&&b]").matcher("a").matches()`

JDK 8: false. The negation is performed first, effectively giving `[^a]`, which is then intersected with `[b]`.

JDK 9: true. The intersection `a&&b` is performed first, resulting in the empty set, which is then negated, giving a character class that matches everything.

----------

`Pattern.compile("[a[b]&&b[c]]").matcher("a").matches()`

All JDKs: false. (The behavior here hasn't changed; it is shown here for completeness of the discussion of operator precedence.) The union operations are performed first. Thus we have `[ab]` intersected with `[bc]` which results in `[b]`.

----------

Specification
-------------

The precedence of character class operators has never been specified, and this change did not update the Pattern specification at all. Eventually, the Pattern specification should be updated to be more precise in its treatment of the character class operators. This work is covered by [JDK-8264671][3].

  [1]: https://bugs.openjdk.java.net/browse/JDK-6609854

  [2]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html

  [3]: https://bugs.openjdk.java.net/browse/JDK-8264671
Comments
[~smarks]; thank you for filing the retroactive CSR, largest known gap between push of fix and retroactive CSR! Voting to retroactively Approve, but it seems that producing more extensive documentation on this points is prudent.
19-10-2021

As far as I can tell, Perl 5 doesn't support nested character classes (thus there is no union operator), nor does it support intersection of character classes. See the comment: https://bugs.openjdk.java.net/browse/JDK-6609854?focusedCommentId=14410632&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14410632
18-10-2021

How does this behavior compare to that of Perl 5?
18-10-2021