JDK-8258119 : Linebreak pattern needs adjustment to conform to Unicode TR18 and PCRE
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 15
  • Priority: P3
  • Status: Open
  • Resolution: Unresolved
  • Submitted: 2020-12-11
  • Updated: 2020-12-17
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
Other
tbdUnresolved
Related Reports
Relates :  
Relates :  
Description
Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change will be backed out by JDK-8258259.

The problem stated in JDK-8235812 was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially

-----
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----

and the behavior after the change conforms to that definition.

The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is

-----
(?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----

(Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)

The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.

The Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely. The test cases removed in the backout changeset JDK-8258259 should be revisited. The code changes should also be revisited. It seems odd that the implementation of \R doesn't simply expand to something more-or-less equivalent to the TR18 expression. It may be that there are special cases in the code to handle \R instead of treating it as a "macro" that is expanded to a more complicated sequence. It's not clear which is preferable.