JDK-6984178 : + repetition on a regex causes StringIndexOutOfBoundsException (++ and +? works)
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 6u21
  • Priority: P4
  • Status: Open
  • Resolution: Unresolved
  • OS: windows_xp
  • CPU: x86
  • Submitted: 2010-09-13
  • Updated: 2012-03-20
Description
FULL PRODUCT VERSION :
1.6.0_07

A DESCRIPTION OF THE PROBLEM :
The following pattern works as expected:

		String FIBONACCI =
			"(?x) .{0,2} | (?: (?=(\\2|^)) (?=(\\2\\3|^.)) (?=(\\1)) \\2)++ . ";
			
		for (int n = 0; n < 1000; n++) {
			String s = new String(new char[n]);
			if (s.matches(FIBONACCI)) {
				System.out.printf("%s ", n);
			}
		}
		// 0 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987

Note that the above uses ++ possessive repetition. Modifying it to +? reluctant backtracking repetition also works. However, using just + greedy backtracking repetition throws StringIndexOutOfBoundsException with index -1.



STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the provided snippet; it should work as expected.

Then change ++ to +?; it should still work as expected.

Then change to just +; now a StringIndexOutOfBoundsException is thrown for no apparent reason.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
All ++, +?, and + variation should work correctly.
ACTUAL -
++ works, +? works, but + throws a StringIndexOutOfBoundsException.

ERROR MESSAGES/STACK TRACES THAT OCCUR :
Exception in thread "main" java.lang.StringIndexOutOfBoundsException:
String index out of range: -1
	at java.lang.String.charAt(Unknown Source)
	at java.lang.Character.codePointAt(Unknown Source)
	at java.util.regex.Pattern$CharProperty.match(Unknown Source)
	at java.util.regex.Pattern$GroupCurly.match0(Unknown Source)
	at java.util.regex.Pattern$GroupCurly.match0(Unknown Source)
	at java.util.regex.Pattern$GroupCurly.match(Unknown Source)
	at java.util.regex.Pattern$Branch.match(Unknown Source)
	at java.util.regex.Matcher.match(Unknown Source)
	at java.util.regex.Matcher.matches(Unknown Source)
	at java.util.regex.Pattern.matches(Unknown Source)
	at java.lang.String.matches(Unknown Source)

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
new String(new char[42]).matches("(?:(?=(\\2|^))(?=(\\2\\3|^.))(?=(\\1))\\2)+.");
// throws StringIndexOutOfBoundsException: -8
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
As mentioned, ++ and +? still work correctly in this case, but they have different semantics than + in the general case.