JDK-8247728 : Regex behavior is different and now wrong comparing 8 and 11 (now)
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 11,15
  • Priority: P3
  • Status: Closed
  • Resolution: Not an Issue
  • Submitted: 2020-06-16
  • Updated: 2020-11-18
  • Resolved: 2020-07-16
Related Reports
Relates :  
Description
ADDITIONAL SYSTEM INFORMATION :
Tested on Windows and MacOS

A DESCRIPTION OF THE PROBLEM :
Using regex for natural language processing, tokenization, to find places between non-repeating punctuations and symbol characters now finds breaks between different whitespace characters.

REGRESSION : Last worked in version 8u251

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Results for 1st and 2nd regex patterns show different behavior for Java 8 vs Java 11 runtimes.
3rd and 4th patterns seem stable but should be equivalent.
Non alpha numeric, non - whitespace character not followed by the same character
In Java 11, the 1st pattern is now matching spaces and the 2nd is also matching alphanumeric 


EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Java HotSpot(TM) 64-Bit Server VM
Oracle Corporation
25.221-b11
Running Test
Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;

Pattern: ([^0-9a-z&&[^\s]])(?!\1)
	Index: 15, 16 Group: . Next Char: ,
	Index: 16, 17 Group: , Next Char: ;
	Index: 18, 19 Group: ; Next Char: 
Pattern: ([^0-9a-z&&[\s]])(?!\1)
	Index: 5, 6 Group: <space> Next Char: <newline>
	Index: 7, 8 Group: <newline> Next Char: <tab>
	Index: 9, 10 Group: <tab> Next Char: <newline>
	Index: 10, 11 Group: <newline> Next Char: <space>
	Index: 11, 12 Group: <space> Next Char: <newline>
	Index: 13, 14 Group: <newline> Next Char: .
Pattern: ([\S&&[\w]])(?!\1)
	Index: 1, 2 Group: a Next Char: b
	Index: 2, 3 Group: b Next Char: c
	Index: 4, 5 Group: c Next Char: <space>
Pattern: ([\S&&[\W]])(?!\1)
	Index: 15, 16 Group: . Next Char: ,
	Index: 16, 17 Group: , Next Char: ;
	Index: 18, 19 Group: ; Next Char: 

ACTUAL -
Java HotSpot(TM) 64-Bit Server VM
Oracle Corporation
11.0.7+8-LTS
Running Test
Text: aabcc<space><newline><newline><tab><tab><newline><space><newline><newline>..,;;

Pattern: ([^0-9a-z&&[^\s]])(?!\1)
	Index: 5, 6 Group: <space> Next Char: <newline>
	Index: 7, 8 Group: <newline> Next Char: <tab>
	Index: 9, 10 Group: <tab> Next Char: <newline>
	Index: 10, 11 Group: <newline> Next Char: <space>
	Index: 11, 12 Group: <space> Next Char: <newline>
	Index: 13, 14 Group: <newline> Next Char: .
	Index: 15, 16 Group: . Next Char: ,
	Index: 16, 17 Group: , Next Char: ;
	Index: 18, 19 Group: ; Next Char: 
Pattern: ([^0-9a-z&&[\s]])(?!\1)
	Index: 1, 2 Group: a Next Char: b
	Index: 2, 3 Group: b Next Char: c
	Index: 4, 5 Group: c Next Char: <space>
	Index: 5, 6 Group: <space> Next Char: <newline>
	Index: 7, 8 Group: <newline> Next Char: <tab>
	Index: 9, 10 Group: <tab> Next Char: <newline>
	Index: 10, 11 Group: <newline> Next Char: <space>
	Index: 11, 12 Group: <space> Next Char: <newline>
	Index: 13, 14 Group: <newline> Next Char: .
	Index: 15, 16 Group: . Next Char: ,
	Index: 16, 17 Group: , Next Char: ;
	Index: 18, 19 Group: ; Next Char: 
Pattern: ([\S&&[\w]])(?!\1)
	Index: 1, 2 Group: a Next Char: b
	Index: 2, 3 Group: b Next Char: c
	Index: 4, 5 Group: c Next Char: <space>
Pattern: ([\S&&[\W]])(?!\1)
	Index: 15, 16 Group: . Next Char: ,
	Index: 16, 17 Group: , Next Char: ;
	Index: 18, 19 Group: ; Next Char: 


---------- BEGIN SOURCE ----------
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TestRegex {

	public static void main(String[] args) {
		System.out.println(System.getProperty("java.vm.name"));
		System.out.println(System.getProperty("java.vm.vendor"));
		System.out.println(System.getProperty("java.vm.version"));

		System.out.println("Running Test");
		String[] testRegex = new String[] { "([^0-9a-z&&[^\\s]])(?!\\1)", "([^0-9a-z&&[\\s]])(?!\\1)", "([\\S&&[\\w]])(?!\\1)", "([\\S&&[\\W]])(?!\\1)" };
		String text = "aabcc \n\n\t\t\n \n\n..,;;";
		System.out.println("Text: " + printable(text));
		System.out.println();
		for (String regex : testRegex) {
			Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
			System.out.println("Pattern: " + pattern.pattern());
			Matcher m = pattern.matcher(text);
			while (m.find()) {
				System.out
						.println("\tIndex: " + m.start() + ", " + m.end() + " Group: " + printable(m.group()) + " Next Char: " + printable((m.end() < text.length() ? "" + text.charAt(m.end()) : "")));
			}
		}

	}

	public static String printable(String text) {
		text = text.replaceAll("\t", "<tab>");
		text = text.replaceAll("\n", "<newline>");
		text = text.replaceAll(" ", "<space>");
		return text;
	}

}
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Have to use the \S and \W character classes that also allow the underscore character, which is a loss of precision. 

FREQUENCY : always



Comments
This is an intended behavior change introduced by JDK-6609854 in JDK 9.
16-07-2020

Additional Information from submitter: =========================== Using this online tester, the patterns with no results seems correctly implemented for Ruby https://rubular.com/ The problem is seen in both Java 8 and 11
29-06-2020

Additional Information from submitter: =========================== Upon discussion and more research this may be an adjustment related to https://bugs.openjdk.java.net/browse/JDK-8189343 It appears that the Java 8 precedence is different from Java 11: ([^0-9a-z&&[^\s]])(?!\1) In Java 8 read equivalently as: ([[^0-9a-z]&&[^\s]] (not alphanumeric intersection with not whitespace) -> not alphanumeric or whitespace In Java 9+ read equivalently as: ([^[[0-9a-z]&&[^\s]] (not (alphanumeric intersection with not whitespace) -> not alphanumeric Testing the order of operations more, union vs intersection we seemed to hit a different bug: Test String: "11233aabcc" Pattern: ([0-9&&\S])(?!) Index: 0, 1 Group: 1 Next Char: 1 Index: 1, 2 Group: 1 Next Char: 2 Index: 2, 3 Group: 2 Next Char: 3 Index: 3, 4 Group: 3 Next Char: 3 Index: 4, 5 Group: 3 Next Char: a Pattern: ([0-9&&\Sa-z])(?!) Index: 0, 1 Group: 1 Next Char: 1 Index: 1, 2 Group: 1 Next Char: 2 Index: 2, 3 Group: 2 Next Char: 3 Index: 3, 4 Group: 3 Next Char: 3 Index: 4, 5 Group: 3 Next Char: a Pattern: ([0-9&&[\S]])(?!) Index: 0, 1 Group: 1 Next Char: 1 Index: 1, 2 Group: 1 Next Char: 2 Index: 2, 3 Group: 2 Next Char: 3 Index: 3, 4 Group: 3 Next Char: 3 Index: 4, 5 Group: 3 Next Char: a Pattern: ([0-9&&[\S]a-z])(?!) Pattern: ([0-9&&[^\s]])(?!) Index: 0, 1 Group: 1 Next Char: 1 Index: 1, 2 Group: 1 Next Char: 2 Index: 2, 3 Group: 2 Next Char: 3 Index: 3, 4 Group: 3 Next Char: 3 Index: 4, 5 Group: 3 Next Char: a Pattern: ([0-9&&[^\s]a-z])(?!) All the patterns are semantically equivalent given the published precedence order group then union(implicit) then intersection. The 1st, 2nd, 3rd, and 5th patterns have the same results. The 4th, and 6th patterns have no results. Adding additional bracing around the right side of the intersection produces correct the values again. Pattern: ([0-9&&[[\S]a-z]])(?!) Index: 0, 1 Group: 1 Next Char: 1 Index: 1, 2 Group: 1 Next Char: 2 Index: 2, 3 Group: 2 Next Char: 3 Index: 3, 4 Group: 3 Next Char: 3 Index: 4, 5 Group: 3 Next Char: a Pattern: ([0-9&&[[^\s]a-z]])(?!) Index: 0, 1 Group: 1 Next Char: 1 Index: 1, 2 Group: 1 Next Char: 2 Index: 2, 3 Group: 2 Next Char: 3 Index: 3, 4 Group: 3 Next Char: 3 Index: 4, 5 Group: 3 Next Char: a So something is wrong with the braces around the right side of the intersection... as if it makes the expression [0-9&&a-z] which have no intersection values to produce no matches in the test string.
29-06-2020

The observations on Windows 10: JDK 11: Fail JDK 15: Fail ILW=MML=P4
17-06-2020