JDK-8247546 : Pattern matching does not skip correctly over supplementary characters
  • Type: Bug
  • Component: core-libs
  • Sub-Component: java.util.regex
  • Affected Version: 8,11,15
  • Priority: P4
  • Status: Closed
  • Resolution: Fixed
  • OS: linux,windows_10
  • CPU: x86_64
  • Submitted: 2020-06-11
  • Updated: 2022-08-16
  • Resolved: 2020-07-29
The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.
JDK 11 JDK 16
11.0.17Fixed 16 b09Fixed
Description
A DESCRIPTION OF THE PROBLEM :
The find method in java.util.regex.Matcher incorrectly skips only the first char of a supplemental codepoint when searching for an initial pattern match. The problematic code is in the java.util.regex.Pattern.Start Node which contains the following code:

            for (; i ]]
					</div>
				</div>
				<br /> <br /> <br /> <br /> <br /> <br />
				<div class="form-group">
					<label for="system_os_info" class="col-sm-2 control-label">System
						/ OS / Java Runtime Information </label>
					<div class="col-sm-8">

						<textarea id="system_os_info" name="system_os_info" style="resize: none;" placeholder="Additional system configuration information here." class="form-control" rows="4">
Tested on openjdk 14.0.1 and 11.0.5

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See the attached source code. The goal of the program is to replace invalid surrogate characters, properly encoded supplemental characters like the example emoji should be left unchanged.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The input string containing the emoji should not be matched and replaced by the pattern
ACTUAL -
The pattern does not match at char index 0, but then steps only one char forward (instead of one codepoint), leading to a match on the second half of the supplemental codepoint. This second char is then matched and replaced. Output (question mark is due to terminal encoding):

? d83d
X 58

---------- BEGIN SOURCE ----------
import java.util.regex.Pattern;

public class ReplaceInvalidSurrogates {
    public static void main(String[] args) {
        String pileofpoo = new StringBuilder().appendCodePoint(0x1F4A9).toString();
        System.out.println(pileofpoo);

        // match low and high surrogate ranges. should only match lone surrogates, not any correctly encoded supplementary characters
        Pattern surrogates = Pattern.compile("[\\x{D800}-\\x{DBFF}\\x{DC00}-\\x{DFFF}]");

        String result = surrogates.matcher(pileofpoo).replaceAll("X");

        System.out.println(result);
        System.out.println(result.charAt(0) + " " + Integer.toHexString(result.charAt(0)));
        System.out.println(result.charAt(1) + " " + Integer.toHexString(result.charAt(1)));
    }
}

---------- END SOURCE ----------

FREQUENCY : always



Comments
A pull request was submitted for review. URL: https://git.openjdk.org/jdk11u-dev/pull/1319 Date: 2022-08-09 02:13:19 +0000
09-08-2022

Fix Request (11u) Backporting this patch to fix supplementary character handling. Risk should be minimal, since the original fix didn't warrant a CSR.
08-08-2022

URL: https://hg.openjdk.java.net/jdk/jdk/rev/7d9dbad25be9 User: naoto Date: 2020-07-29 16:50:11 +0000
29-07-2020

This code will do: ``` --- old/src/java.base/share/classes/java/util/regex/Pattern.java 2020-07-25 13:12:37.000000000 -0700 +++ new/src/java.base/share/classes/java/util/regex/Pattern.java 2020-07-25 13:12:37.000000000 -0700 @@ -2948,8 +2948,10 @@ return null; if (p instanceof BmpCharPredicate) return new BmpCharProperty((BmpCharPredicate)p); - else + else { + hasSupplementary = true; return new CharProperty(p); + } } ``` But I am not sure it is the right way. Who is the current Tzar of RE?
25-07-2020

Please take a look and reassign if you think someone else should look at it.
23-07-2020

High and low surrogates for 0x1F4A9 are 0xD83D and 0xDCA9. The regex pattern checks the range between 0xD800-0xDBFF for higher surrogate and 0xDC00- 0xDFFF for lower one. The code does not match the first one but the second one and replaces it. The issue is always reproducible Observation on Windows 10: JDK 8: Fail JDK 11: Fail JDK 14: Fail JDK 15 b29: Fail
07-07-2020