JDK-8258259 : Unicode linebreak matching behavior is incorrect; backout JDK-8235812
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 15,16
  • Priority: P3
  • Status: Resolved
  • Resolution: Fixed
  • Submitted: 2020-12-14
  • Updated: 2021-01-20
  • Resolved: 2020-12-18
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 16 JDK 17
16 b30Fixed 17Fixed
Related Reports
Relates :  
Relates :  
Sub Tasks
JDK-8258456 :  
Description
Bug JDK-8235812 changed the behavior of matching of the Unicode linebreak pattern, \R. This change should be reverted.

The problem stated in that bug report was that the pattern \R{2} did not match the string "\r\n" and the fix changed the behavior so that a match was successful. This *seemed* the correct thing to do, as the Pattern class spec has a definition for \R which is essentially

-----
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----

and the behavior after the change conforms to that definition.

The problem is that this definition of the \R pattern doesn't match the recommendation from TR18, which is

-----
(?:\u000D\u000A)|(?!\u000D\u000A)[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
-----

(Based on http://unicode.org/reports/tr18/#Line_Boundaries and corrected and transliterated to Java regex syntax.)

The salient difference is the appearance of a negative lookahead pattern "?!" which causes the pattern not to match a \r if it's immediately followed by \n. Thus, the TR18 recommendation would have the pattern \R{2} NOT match the string "\r\n". Indeed, PCRE has this behavior.

This bug covers backing out of the JDK-8235812 change. Follow-on bug JDK-8258119 covers further changes in this area. In particular, the Pattern spec's definition of \R should be revisited to see if it should be adjusted to match TR18 more closely.
Comments
Changeset: cbc3feeb Author: Stuart Marks <smarks@openjdk.org> Date: 2020-12-18 00:36:33 +0000 URL: https://git.openjdk.java.net/jdk16/commit/cbc3feeb
18-12-2020