Bug ID: JDK-8311906 Improve robustness of String constructors with mutable array inputs

JDK-8311906 : Improve robustness of String constructors with mutable array inputs

Type: Bug
Component: core-libs
Sub-Component: java.lang
Affected Version: 8-pool-perf,11,17,20,21

Priority: P4
Status: Closed
Resolution: Fixed
OS: generic
CPU: generic

Submitted: 2023-07-11
Updated: 2025-01-28
Resolved: 2023-12-04

Versions (Unresolved/Resolved/Fixed)

The Version table provides details related to the release that this issue/RFE will be addressed.

Unresolved : Release in which this issue/RFE will be addressed.
Resolved: Release in which this issue/RFE has been resolved.
Fixed : Release in which this issue/RFE has been fixed. The release containing this fix may be available for download as an Early Access Release or a General Availability Release.

To download the current JDK release, click here.

JDK 11	JDK 17	JDK 21	JDK 22	JDK 8
11.0.28-oracleUnresolved	17.0.16-oracleUnresolved	21.0.8-oracleUnresolved	22 b27Fixed	8u461Unresolved

Related Reports

CSR :	JDK-8319228 - Improve robustness of String constructors with mutable array inputs
Relates :	JDK-8325737 - Release Note: `Files.readString` May Return Incorrect String When Using UTF-16 or Other Charsets
Relates :	JDK-8321180 - Condition for non-latin1 string size too large exception is off by one
Relates :	JDK-8321514 - UTF16 string gets constructed incorrectly from codepoints if CompactStrings is not enabled
Relates :	JDK-8325590 - Regression in round-tripping UTF-16 strings after JDK-8311906

Description

A DESCRIPTION OF THE PROBLEM :
A race condition in the String constructor taking a char[] (and probably other constructors too) allows creating a String with an incorrect coder: A String only containing latin-1 characters, but still encoded using UTF-16.
This is because in between the constructor checking if the content can be encoded using latin-1 and it being encoded as UTF-16, the content of the passed in array may have changed

See https://wouter.coekaerts.be/2023/breaking-string

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Concurrently modify the char[] passed into the String constructor. See example code.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
A String where .equals and other methods behave correctly
ACTUAL -
A String where .equals and other methods are inconsistent with its contents

---------- BEGIN SOURCE ----------
/**
 * Given a latin-1 String, creates a copy that is
 * incorrectly encoded as UTF-16.
 */
static String breakIt(String original) {
  if (original.chars().max().orElseThrow() > 256) {
    throw new IllegalArgumentException(
        "Can only break latin-1 Strings");
  }

  char[] chars = original.toCharArray();

  // in another thread, flip the first character back
  // and forth between being encodable as latin-1 or not
  Thread thread = new Thread(() -> {
    while (!Thread.interrupted()) {
      chars[0] ^= 256;
    }
  });
  thread.start();

  // at the same time call the String constructor,
  // until we hit the race condition
  while (true) {
    String s = new String(chars);
    if (s.charAt(0) < 256 && !original.equals(s)) {
      thread.interrupt();
      return s;
    }
  }
}


String a = "foo";
String b = breakIt(a);

// they are not equal to each other
System.out.println(a.equals(b));
// => false

// they do contain the same series of characters
System.out.println(Arrays.equals(a.toCharArray(),
    b.toCharArray()));
// => true
---------- END SOURCE ----------

FREQUENCY : always

Comments

verified JDK22 b27 ATR -- corelibs parts https://mach5.us.oracle.com/mdash/jobs/root-jdk-22-27-2262-20231208-1723-3257153/results?search=status%3A%20AND%20name%3A%2FStringRacyConstructor.java https://mach5.us.oracle.com/mdash/jobs/root-jdk-22-27-2262-20231208-1723-3257153/results?search=status%3A%20AND%20name%3A%2FChars.java JDK22 b30 CI for TestStringConstructionIntrinsics.java https://mach5.us.oracle.com/mdash/builds/jdk-22+30-2287/results?search=status%3A%20AND%20name%3Acompiler%2Fintrinsics%2Fstring%2FTestStringConstructionIntrinsics
11-01-2024
Changeset: 155abc57 Author: Roger Riggs <rriggs@openjdk.org> Date: 2023-12-04 18:28:59 +0000 URL: https://git.openjdk.org/jdk/commit/155abc576a0212932825485380d4e2a9c7dd2fdc
04-12-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/16425 Date: 2023-10-30 18:34:44 +0000
06-11-2023
PR 16425 In Review: https://github.com/openjdk/jdk/pull/16425
03-11-2023
[~rriggs] Didn't know you are working on this bug, thought it was just another assignment that's forgotten over time. So here's a quick summary of my approach: Problem: racy input may lead to coder == UTF16 but value is totally LATIN1 compatible, and is the only case we need to defend against. Solution: when we promote Strings to UTF16, we must perform a final check that the value is not LATIN1 compatible when we obtain a trusted UTF16 value array. Implementation: I moved the compress calls in such untrusted contexts into a guarded method, and decode UTF-8 (which requires double reads) from a trusted array. The final checks are added, but there's still a pass before copying to create a UTF16 array (much like double null check + synchronization) Shortcoming: This will lead to increased CPU cost (compress array once more after putting into trusted UTF16 array) for correctness. The memory consumption should be the same. Testing: I ran test/jdk/java/lang/String tests. Might need to pass my current patch to more powerful devices to test the exact performance impacts and whether it affects hotspot. I also ran the reproduction case given here, but we might need JCStress tests to ensure correctness across platforms.
25-09-2023
A pull request was submitted for review. URL: https://git.openjdk.org/jdk/pull/15902 Date: 2023-09-25 12:28:40 +0000
25-09-2023
Requiring the input array to be unchanged during the call is most likely not sufficient to ensure correctness per Java memory model. A first plain read may see a write before that subsequent reads never see, so the writes to the array must be consistently visible to the plain array index reads.
17-07-2023
BigInteger constructor has the same "problem". https://gist.github.com/turbanoff/bcd234bd3469e177f5da0e9f3cf85c2a But its javadoc has note that "val" array is assumed to be unchanged for the duration of the constructor call. https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/math/BigInteger.html#%3Cinit%3E(byte%5B%5D)
17-07-2023
In general, we should read user arrays like how we read from benign races, where we only read once. The methods in StringLatin1 etc. should be updated to take the user array and convert it to byte array + coder at once, instead of reading the user array twice (once in trying to encode in latin1 and the other in encoding to utf16) It's also possible for the reverse situation, where an array is incorrectly interpreted as able to be encoded into latin1, as countNegatives + actual encoding performs more than one reads as well.
13-07-2023
The observations on Windows 10: JDK 8: The reproducer is not applicable. JDK 11: Failed, false value returned. JDK 17: Failed. JDK 20: Failed. JDK 21ea+20: Failed.
12-07-2023