JDK-8007395 : StringIndexOutofBoundsException in Match.find() when input String contains surrogate UTF-16 characters
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 5.0,6u34,7,8
  • Priority: P3
  • Status: Closed
  • Resolution: Fixed
  • OS: generic
  • CPU: generic
  • Submitted: 2013-02-01
  • Updated: 2019-06-27
  • Resolved: 2013-04-26
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 7 JDK 8
7u241Fixed 8 b89Fixed
Description
SYNOPSIS
--------
StringIndexOutofBoundsException in Match.find() when input String contains surrogate UTF-16 characters
       
OPERATING SYSTEMS
-----------------
All
       
FULL JDK VERSIONS
-----------------
All (Since JDK 1.5.0)

PROBLEM DESCRIPTION
-------------------
When the Match.find() is called for an input String with surrogate characters in the string, it throws a StringIndexOutofBoundsException under the following circumstances:

1. When a regex pattern results in a call to the GroupCurly.match0() method
2. When the surrogate pair in the String is after an index > 4+ minimum expected length of the input string for the pattern
3. When the pattern does not match the input string
       
REPRODUCTION INSTRUCTIONS
-------------------------
Simply compile and run the attached test case.

Observed behaviour (this specific trace is from 7u9):
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.charAt(String.java:658)
        at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
        at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4360)
        at java.util.regex.Pattern$GroupCurly.match0(Pattern.java:4354)
        at java.util.regex.Pattern$GroupCurly.match(Pattern.java:4304)
        at java.util.regex.Pattern$SliceI.match(Pattern.java:3895)
        at java.util.regex.Pattern$Start.match(Pattern.java:3408)
        at java.util.regex.Matcher.search(Matcher.java:1199)
        at java.util.regex.Matcher.find(Matcher.java:592)
        at RegexTestCase.main(RegexTestCase.java:11)
       
Expected Behavior:
No Exceptions should be thrown. The pattern does not match, so Matcher.find() should return false.

TEST CASE
---------
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTestCase {
    public static void main(String[] args) {
        String ptrnStr = "test(.)+(@[a-zA-Z.]+)";
        Pattern ptrn = Pattern.compile (ptrnStr, Pattern.CASE_INSENSITIVE);
        String inputStr = "test this as \ud83d\ude0d";
        Matcher matcher = ptrn.matcher(inputStr);
        try {
            if (matcher.find()) {
                System.out.println("Found String");
            } else {
                System.out.println("Not found");
            }
        } catch (StringIndexOutOfBoundsException siob) {
            System.out.println("Testcase Failed");
            siob.printStackTrace();
        }
    }
}

WORK AROUND
----------
Catch the exception and treat is as a "false" return value.

SUGGESTED FIX
-------------
See attachment.
Comments
CharProperty becomes kinda "non-deterministic" if it can match both bmp and supplementary character (though it only matches ONE code point, but it's a one or two "char" in the input CharSequence), the existing "iterative optimization" GroupCurly fails when backtracks. The impl of GroupCurly does have them mechanism to "recursively" step in a new layer for a different "sized" match, but it appears the implementation fails to back off correctly (should not pass through its start iteration "j").
26-04-2013

Fix provided by Licensee. Please review and provided comments.
18-02-2013

After some investigation, this is reproducible using JDK 1.5.0 beta. We do not archive earlier builds, but there was a definite change made to this code between b01 and beta.
01-02-2013